chi2007 (1).pdf

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 6, JUNE 2007 637

Gate Level Multiple Supply Voltage AssignmentAlgorithm for Power Optimization

Under Timing ConstraintJun Cheng Chi, Student Member, IEEE, Hung Hsie Lee, Sung Han Tsai, and Mely Chen Chi, Member, IEEE

Abstract—We propose a multiple supply voltage scaling algo-rithm for low power designs. The algorithm combines a greedy ap-proach and an iterative improvement optimization approach. Inphase I, it simultaneously scales down as many gates as possibleto lower supply voltages. In phase II, a multiple way partitioningalgorithm is applied to further refine the supply voltage assign-ment of gates to reduce the total power consumption. During bothphases, the timing correctness of the circuit is maintained. Levelconverters (LCs) are adjusted correctly according to the local con-nectivity of the different supply voltage driven gates. Experimentalresults show that the proposed algorithm can effectively convertthe unused slack of gates into power savings. We use two of theISPD2001 benchmarks and all of the ISCAS89 benchmarks as testcases. The 0.13- m CMOS TSMC library is used. On average, theproposed algorithm improves the power consumption of the orig-inal design by 42.5% with a 10.6% overhead in the number of LCs.Our study shows that the key factor in achieving power saving is in-cluding the most comportable supply voltage in the scaling process.

Index Terms—Algorithms, low power, multiple voltages assign-ment, partition, power optimization, voltage scaling.

I. INTRODUCTION

POWER consumption is an important issue in modernVLSI designs. The power consumption of CMOS circuits

consists of two factors. One is dynamic power consumption.The other is static power consumption caused by the leakagecurrent. In practical, the leakage power only contributes 1% tothe total power consumption when the circuit is in the activemode. Therefore, in this paper we focus on the power mini-mization for the dynamic power. Dynamic power consumptionincludes switching power and short-circuit power. Switchingpower dissipation occurs when switching current from chargingand discharging parasitic capacitance. Short-circuit powerconsumption occurs because of the short-circuit current whichoccurs when both n-channel and p-channel transistors are mo-mentarily on at the same time. Switching power consumptionin CMOS circuits is proportional to the square of the supplyvoltage ( ). Applying a voltage scaling technique thatchanges the supply voltage of gates to a lower value in CMOScircuits is an effective way of reducing power consumption.

Manuscript received May 3, 2006; revised September 21, 2006. This workwas supported by the National Science Council of Taiwan under Grant NSC94-2215-E-033-003 and Grant NSC95-2221-E-033-077-MY3.

J. C. Chi is with the Department of Electronic Engineering, Chung YuanChristian University, Chung Li 32023, Taiwan (e-mail: [email protected]).

H. H. Lee, S. H. Tsai, and M. C. Chi are with the Department of Informationand Computer Engineering, Chung Yuan Christian University, Chung Li 32023,Taiwan (e-mail: [email protected]).

Digital Object Identifier 10.1109/TVLSI.2007.898650

Fig. 1. Average distribution of gates with different slacks for 16 MCNC91benchmarks [1].

Scaling down the supply voltage of a gate will cause the gateto have a longer gate delay. In order to maintain the correctnessof the timing, only the gates along noncritical paths are assignedto a lower supply voltage to convert the unused slack into powersavings. The average distribution of gates with different slacksfor 16 MCNC91 benchmarks is shown in Fig. 1; these werepresented in [1]. In Fig. 1, the slack of each gate was normalizedto the longest path delay. It may be seen from this figure thatthe number of gates on critical paths (i.e., gates with zero orclose-to-zero slack) accounts for only about 14% of the totalnumber of gates. The number of gates with a slack larger than0.2 comprise more than 60% of the total number of gates. Thismeans that there is plenty of room for power reduction via theutilization of lower supply voltages on the gates of large slack.

However, in a voltage-scaled circuit, if a lower supplyvoltage gate (a VDDL gate) drives a higher supply voltage gate(a VDDH gate), a level converter (an “LC”) must be inserted asa bridge between these two gates [2]. This is because the outputsignal of the VDDL gate will cause a static current flow fromthe VDD to VSS in the VDDH gate. An example is shown inFig. 2. In Fig. 2, the inverter is a VDDH gate and this inverteris driven by a VDDL gate. Since the voltage of the input signalof the inverter will not be higher than VDDL even when theinput signal is at the “HIGH” level, the pMOS in this invertermay not be cut-off if .represents the threshold voltage of the pMOS. This will causea static current flow from VDD to VSS through the pMOS tonMOS. Thus, an LC is needed between a VDDL and a VDDHgate to prevent the creation of a static current. However, the LCwill also consume power and will cause a timing delay. It alsoincreases the chip area. An LC is not required if a VDDH gatedrives a VDDL gate. The number of LCs in a voltage-scaled

1063-8210/$25.00 © 2007 IEEE

638 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 6, JUNE 2007

Fig. 2. Illustration of the static current flow in a VDDH inverter when it isdirectly connected to a VDDL gate.

Fig. 3. Conventional level converter [2].

design is determined by the voltage scaling algorithm. Fig. 3shows an example of an LC [2].

In [2] and [3], Usami et al. proposed a dual supply voltage ap-proach, namely clustered voltage scaling (CVS). CVS is basedon a topological constraint that only allows a transition from aVDDH gate to a VDDL gate along paths from the primary in-puts (PI) to the primary outputs (PO). Thus, no LCs are required.In this resultant cluster structure, some gates with high value ofslack cannot be assigned to VDDL gates. The ECVS [4] relaxesthis topological constraint by allowing a VDDL gate to drive aVDDH gate if an LC is inserted. The assignment is performed byvisiting gates in a reverse levelized manner. A design method-ology and design flow for implementing this structure of netlistwas proposed in [5]. Both algorithms, CVS and ECVS, assignthe appropriate power supply to each gate by traversing a com-binational circuit from the POs to the PIs in a levelized order.The levelized approach restricts the possibility of creating a so-lution with lower power consumption.

Without using the levelization approach, Yeh and Kuo [6]proposed an optimization-based multiple supply voltage scalingalgorithm, OB MVS. The OB MVS algorithm first scales allgates to the lowest supply voltage, and then scales up the supplyvoltage of the gates according to a critical order until the timingconstraint is met. Gates that are shared by more critical pathshave a higher critical order. This work does not consider the in-sertion of LCs. The calculation of the critical order of each gateis very time consuming. Kulkarni et al. [7] proposed an algo-rithm (GECVS) that greedily assigns power supplies based ona sensitivity measure. This sensitivity measure considers slackchanges and power saving at each voltage assignment. The al-gorithm starts from a VDDH netlist and then scales down thesupply voltage of the gate with the highest sensitivity. This al-gorithm saves more power than the CVS and ECVS algorithms.However, these algorithms use greedy approaches without any

refinement process, and thus, power consumption may not beoptimized.

Chen et al. [1] translated the power optimization problemto a maximal-weighted-independent set (MWIS) problem. Theauthors first estimate the lower bound of the supply voltage ofeach gate that meets the timing constraints. Then, the voltage ofeach gate is assigned by the proposed dual-voltage-power-op-timization (DVPO) algorithm. In order to reduce the powerpenalty of LCs, a constrained F-M algorithm is used to reducethe number of LCs. Kang et al. [8] proposed a schedulingand resource allocation approach for multiple voltage scaling.They used a data-flow graph (DFG) to represent the netlistand then calculate the timing slack of each node in the DFG.After the initial voltage assignment of each node, a pair-wisemultiple way graph partition algorithm was performed tofurther improve the power consumption while not violatingthe given timing constraints. This work was enhanced in [9].Manzak and Chakrabarti [10] present resource and latencyconstrained scheduling algorithms to minimize power/energyconsumption when the resources operate at multiple voltages.The proposed algorithms are based on efficient distribution ofslack among the nodes in the data-flow graph. The distributionprocedure tries to implement the minimum energy relationderived using the Lagrange multiplier method in an iterativefashion. Mohanty and Ranganathan [11] present an integerlinear programming (ILP)-based power minimization algo-rithm. The algorithm simultaneously minimizes the peak andaverage power during behavioral level datapath scheduling. Thedatapaths can function in three modes of operations: 1) singlesupply voltage and single frequency; 2) multiple supply volt-ages and dynamic frequency clocking; and 3) multiple supplyvoltages and multicycling. Recently, some research [12], [13]has combined voltage scaling and assignment to furtherreduce the power consumption of a circuit. For example, in[13], Hung et al. proposed an algorithm that utilizes the GeneticAlgorithm to perform the multi- assignment, the multi-assignment, device sizing, and stack forcing for low powerdesigns simultaneously.

These previous works did not consider wire delay and wireload. Thus, the timing and the total power consumption of a de-sign might vary a lot from the design after layout. In this paper,wire delay and load are taken into consideration in both timinganalysis and power calculation processes.

In this paper, we focus on the voltage assignment for gatesin a gate-level netlist for dynamic power optimization. We pro-pose an algorithm that combines both greedy and iterative opti-mization approaches. In phase I, we greedily assign the supplyvoltage of all gates to VDDL, then reassign the gates with nega-tive slack to a higher supply voltage to guarantee the timing cor-rectness of the circuit. This allows the gates along the timing-loose paths to be scaled down simultaneously. It results in asmaller number of LCs on these paths. The VDDL gates willbe fixed at the lowest voltage and the rest of gates are referredas “scalable gates.” Only scalable gates will have their supplyvoltages reassigned in phase II. The problem size will thus bereduced. In phase II, we utilize a partitioning algorithm to per-form the voltage assignment to optimize the power consump-tion. After each scaling, the LCs are inserted into or removed

CHI et al.: GATE LEVEL MULTIPLE SUPPLY VOLTAGE ASSIGNMENT ALGORITHM FOR POWER OPTIMIZATION UNDER TIMING CONSTRAINT 639

from the netlist according to the local connectivity of the voltagescaled gates. The delays and power consumptions of LCs arecounted in the timing analysis and power calculation process.During the scaling process, the wire load and delay are estimatedby applying a wire load model. Therefore, the estimation of pathdelay and the total power consumption of our algorithm may bemore accurate than the estimations used in algorithms that donot account for wire load and delay considerations. These con-siderations are helpful in facilitating timing closure in the phys-ical design flow. The details of the proposed algorithm are de-scribed in Section IV. We have carried out several experimentsto study the impact of applying different voltage domains onpower savings. We have also studied the impact of switchingactivity on total power consumption. These experiments are dis-cussed in Section V.

This paper is organized as follows. In Sections II and III, wedescribe the calculation of the slack and power consumption ofeach gate. In Section IV, we present the multiple supply voltagescaling algorithm. In Section V, we show the experimental re-sults. Finally, Section VI concludes the paper.

II. BACKGROUND-TIMING ANALYSIS

The objective of timing analysis is to calculate the slack ofeach gate in the design. We first calculate the arrival time ofeach gate. The breadth first search algorithm (BFS) is applied ina levelized manner to find the maximum delay of all paths. EachBFS starts from a PI/flip-flop and ends at a PO/flip-flop. At thestarting gate of each path, the arrival time of each PI/flip-flop isthe delay of the PI/flip-flop. The maximum delay among outputports of a gate is the arrival time of the gate. The maximumarrival time among all input ports of a PO/flip-flop is the arrivaltime of the POs/flip-flop. The maximum arrival time among allPOs/flip-flops is the cycle time of the circuit.

During the timing analysis, both gate and wire delays arecounted. Because at this stage, the wire length of each net isnot available, we use the wire load model [14] that is given inthe standard cell library to estimate the wire length of a net. Thiswire load model is a statistical experimental result obtained byanalyzing many layouts of a specific technology. This modelprovides parameter for estimating wire length of a net accordingto the number of fan-out of the net and the total number of gatesin the circuit. After estimating the wire length, we may calcu-late the capacitance (C) and resistance (R) of the wire. Then, theR*C delay model is used to estimate the wire delay. The summa-tion of the wire capacitance and the input capacitance of fan-outgates is the total load capacitance on the driver gate. Then a gatedelay is obtained from a lookup table in the standard cell libraryaccording to the total load capacitance of the gate. The inputtransition time is assumed to be the minimum transition timein the table. The gate delay that is extracted from the library isthe delay of the gate at supply voltage VDDH. The delay of aVDDL driver gate may be estimated by using thealpha-power law model (1) [15]

(1)

where is the threshold voltage of a transistor and is atechnology dependent parameter. We run SPICE on differenttypes of gates in the library (TSMC 0.13- m CMOS library) toget the dependency of delay on various supply voltage. Fromthese data, we can calculate the value of . The value of is1.49.

Then, we calculate the require time of each gate. The BFSalgorithm is applied backwards from a PO/flip-flop to a PI/flip-flop. The cycle time of the circuit is set to be the require timeat each PO/flip-flop. When a gate is visited, its require time isequal to the require time of the previous gate minus the sum ofthe delay of the previous gate and the connected wire.

Finally, the slack of a gate is calculated by subtracting thearrival time from the require time of the gate. If the slack of anygate is negative then the circuit has a timing violation.

We assume that the input circuit has no timing violation. Ouralgorithm will reduce the power consumption of the circuit byscaling down the supply voltages of gates while maintainingthe timing correctness of the circuit. During the timing analysisprocess, the delay of an LC is also extracted from a lookup tablein the cell library. The table is created according to a real LCdesign.

III. POWER CONSUMPTION CALCULATION

The dynamic power consumption of a gate , denoted as ,may be calculated by (2)

(2)

where and are the numbers of input and output pins of gate. is the power consumption of the th input pin. The values

of may be extracted from the lookup table in the library ac-cording to the total capacitance of fan-out load. represents thefrequency of the circuit. and represent the switching ac-tivity of the th input pin and the th output pin, respectively.

is the loading capacitance on the th output pin. The valueof is the sum of the capacitances of the fan-out net and thedriven pins of the net. The capacitance of each net is estimatedby applying wire load model. represents the supply voltageat gate . For example, if is a VDDH/VDDL gate thenequals to VDDH/VDDL. The power consumption of an LC isalso calculated by (2). The supply voltage of each LC is as-signed as VDDH.

The total power consumption of the circuit is calculated by

(3)

IV. PROPOSED VOLTAGE SCALING ALGORITHM

FOR LOW POWER CIRCUITS

The inputs of the algorithm are a circuit which has no timingviolation and a standard cell library. A set of voltage domain

for whichrepresents the voltages that may be applied to the gates of thecircuit. Initially, the supply voltage of all gates is , which isalso denoted as VDDH. The lowest voltage is also denotedas VDDL. The objective is to scale down the supply voltage on


Fig. 4. Example of the insertion and removal of an LC. (a) The original netlistof a four terminal net is shown. When gate A is scaled down to a VDDL gate,an LC is inserted at the output of gate A, as shown in (b). When gate D issuccessively scaled down to a VDDL gate, the netlist becomes (c). Finally, gateA is scaled up to a VDDH gate, the LC is removed and the netlist becomes (d).

the gates such that the total power consumption is reduced whilemaintaining the same cycle time of the design.

In the voltage scaling process, we need to adjust the netlistby inserting or removing an LC according to the local connec-tivity of the voltage scaled gates. An example is shown in Fig. 4in which two voltage domains VDDL and VDDH are used.Fig. 4(a) shows the original connection of a four terminal net

. First, when gate is scaled down to a VDDL gate, an LCis inserted at the output of gate . The net is divided intotwo nets and as shown in Fig. 4(b). Then gate is suc-cessively scaled down to a VDDL gate and the netlist becomesFig. 4(c). Finally, gate is scaled up to a VDDH gate, the LC isremoved, and the netlist becomes Fig. 4(d). Due to the insertingand removing of an LC from the netlist, the netlist is changeddynamically. This change is considered in the timing analysisprocess. In the example shown in Fig. 4(b), the delay from aninput pin of gate to the input pin of gate is the summationof the delays of gate , net , gate LC, and net .

A. Algorithm Overview

At the beginning, we apply the timing analysis procedure tocalculate the cycle time of the circuit and the slack of each gate.This cycle time is used as the timing constraint of the design inorder to maintain the timing correctness of the circuit. Then thealgorithm will proceed with two phases. In phase I, we applya greedy approach that scales down the supply voltages of asmany gates as possible. It allows all gates along the timing-loosepaths to be scaled down simultaneously and results in a smallernumber of LCs on these paths. The VDDL gates will be fixedat the lowest voltage and the rest of gates are referred to “scal-able gates.” Only the supply voltages of scalable gates will bereassigned in phase II. In phase II, we utilize the technique ofthe multiway partitioning algorithm to perform the voltage as-signments. Different voltage domains are treated as differentpartitions. We refer to a voltage assignment as a “move.” Eachscalable gate is moved to the voltage domain of the maximumpower gain. The iterative optimization process is executed untilthe total power of the circuit can no longer be reduced. During

both phases, the correctness of timing is maintained. The detailsof these two phases are described in the following.

B. Phase I: Greedily Scaling Down the Voltages of GatesWhile Satisfying the Timing Constraint

At this stage, we try to scale down as many gates as possible toreduce the total power consumption while simultaneously main-taining the timing correctness of the circuit. The pseudo code ofphase I is shown in Fig. 5.

First, we scale all gates in the netlist to the lowest voltage .LCs are inserted before the primary output gates. Static timinganalysis is performed to calculate the slack of all gates. Thesupply voltages of all gates with negative slacks are scaled up tothe next higher supply voltage . The netlist is adjusted by in-serting or removing LCs according to the vicinity connectivityof the voltage scaled gates. Then, we perform timing analysisagain to recalculate the slack of each gate. The delays of LCsand nets along the path are also included in the timing calcu-lation. Then the slacks of these gates are updated. If any gatewith negative slack is found, then the supply voltage of the gateis scaled up to the next supply voltage. This process is repeateduntil the slacks of all gates are positive or zero.

An example of phase I is shown in Fig. 6. In the example,dual supply voltages, VDDH and VDDL, are used and the cycletime is 20 ns. The delays of each gate and wire are shown aboveeach gate and wire. Fig. 6(a) represents a portion of the orig-inal netlist in which all gates are VDDH gates and the slackof each gate is positive. We first scale all gates to VDDL asshown in Fig. 6(b). Then, we scale all gates of negative slacksto VDDH gates. According to the connectivity of the scaledgates, two LCs are inserted at the output of gate and the inputof gate , respectively. Then, we recalculate the slack of eachgate. We can find that the slack of each gate is positive and thenphase I is finished. The resultant netlist is shown in Fig. 6(c).As shown in Fig. 6(c), the supply voltages of these gates onthe timing-loose path are scaled down toVDDL simultaneously.

After this phase, the delays of all paths are less than or equalto the cycle time. The effectiveness of this phase lies in the factthat it simultaneously scales down the supply voltage of all gateson the timing-loose paths. The delays on these paths are stillless than the cycle time after the voltage scaling. Other pathsare composed of gates with different voltage domains.

C. Phase II: Partition Based Multiple Supply Voltage ScalingAlgorithm

At this phase, we apply a partition-based approach to per-form the voltage scaling. If the number of voltage domains is

, then this problem is treated as a -way partition problem.For example, if we are performing the voltage scaling with dualsupply voltages (VDDH and VDDL), the problem is formu-lated as a two-way partition problem. The gates are moved be-tween the two partitions VDDH and VDDL to obtain a netlistwith the lowest power consumption. A gate is moved to theVDDH/VDDL partition which means that the supply voltageof the gate is assigned to VDDH/VDDL.

All VDDL gates are marked as unscalable while the othersare scalable. We only deal with the scalable gates. Thus, the


Fig. 5. Pseudo code of phase I.

problem size of phase II is smaller than the original circuit. Thescalable gates are moved among the voltage domains in orderto reduce the total power consumption. The moves that resultin timing violations are disallowed. During the moving process,the power may be increased. It provides for the possibility thatbetter solutions may be discovered in later moves.

1) Cost Function of Phase II: The cost function of this phaseis the total power consumption of the circuit that may be calcu-lated by (3). The power consumptions of LCs are also included.Then, we define the power gain of each gate. For each scalablegate “ ,” the power gain of , denoted as represents theimprovement of power consumption when the supply voltageof gate is scaled to . The power gain is defined asfollows:

Total Power Total Power(4)

If the voltage movement requires an insertion or removal of alevel converter then the power associated with the level con-verter is added to or subtracted from the Total Power .During the scaling process, only the power consumptions of thefan-in gates of the scaled gate are affected. Thus, the power gainof a gate , may be simplified by following equation:

(5)

where and represent the power values of gate beforeand after gate is scaled, respectively. The values of andmay be calculated by (2). By applying (5), the power gain ofeach gate may be incrementally updated. An example is shownin Fig. 7. If the supply voltages of gates , and are VDDHand the supply voltage of is VDDL, then the power gain

of gate is.

According to the definition of power gain, each gate haspower gains, denoted as for each . Then, we

define the maximum gain value of a gate , denoted as ,as the maximum value among for each . It isdefined as follows:

(6)

After one gate is scaled, the value of the affected gateswill be updated.

2) Proposed Algorithm: First, we set up an unlock flag foreach scalable gate and calculate the power gain of each gate.Then we select the unlocked gate with the largest andmove it to the best voltage domain. We insert or remove levelconverters according to local connectivity. Then we lock themoved gate and calculate the slacks of all affected gates. Whena gate is scaled, the arrival time of the gates on the down-stream paths of the fan-in gates of and the require time ofthe gates on the upstream paths of are changed. Thus, theslacks of these gates are updated. If any slack is negative, thenthe move will cause a timing violation and the move is rejected.If the move is accepted, we will update the power gain of allconnected gates.

We use the same example as shown in Fig. 7. If gate hasthe maximum gain among all unlocked gates and

, we select gate and scale it to a VDDL gate.Then, a level converter is inserted between gates and . If theslacks of all affected gates are positive or zero, then the moveis accepted. The power gains of gates , and are updated.In this process, the output load of gate is changed. Beforescaling, the load of gate is equal to the sum of the capacitanceof a two terminal net and the input capacitance of . Afterscaling, the load of gate becomes the capacitance of a twoterminal net and the input capacitance of gate . Therefore, bothdelay and power consumption of gate should be updated. The


Fig. 6. Example of phase I.

difference of the power consumptions of gate is counted in.

The gate-move process is repeated until all gates are locked orall power gains of the last 20 moves are negative. The accumu-lated sum of power gains of all moved gates is called the partial

sum. We calculate the partial sum after each move. Because thepower gain may be negative, the value of the partial sum fluctu-ates. During the gate-move process, we keep the largest partialsum of power gain and record the corresponding state. After an it-eration is terminated, the state of the largest partial sum is found.


Fig. 7. Example of the calculation of power gain when gateB is scaled. The power consumption of gateA stays the same when gateB is scaled, thus P = P .Due to the removal of LC the loading capacitance of gate D is changed. It leads to a change in the power consumption of gate D. Therefore, the power gainG(B;VDDL) of gate B is equal to (P + P + P ) � (P + P + P ).

Fig. 8. Pseudo code of phase II.

We assign the state of the largest partial sum of the previousiteration as the initial state and unlock all scalable gates. Then,we recalculate the slack and power gain of each gate. The itera-tion of the gate-move process is performed again to reduce thepower consumption. This optimization process is repeated untilthe largest partial sum is equal to or less than zero. Then phase IIis completed. The pseudo code of this phase is shown in Fig. 8.

V. EXPERIMENTAL RESULTS

The proposed algorithm is implemented in the C/C++ lan-guage. We also implemented the GECVS algorithm of [7] for

comparison. The experimental platform is SUN Blade 1000 run-ning Solaris 9 with 8-G RAM. Two ISPD2001 circuit bench-marks, Mac1 and Mac2 [16], and the ISCAS89 benchmarks [17]are used as the test cases. The TSMC 0.13- m CMOS libraryis used in our experiments. In these experiments, in order to re-duce the complexity of the problem, we only use one type of LCthat can convert an input signal that is within the range (0.6 V,1.2 V) to a 1.2 V output signal. As presented in the previous re-searches [1], [7], we assume the switching activity of all nets isa constant. Because the value will not affect the results of thevoltage scaling, we assign 1 to the switching activity to reduce


TABLE ICOMPARISON OF THE RESULTS OF OUR ALGORITHM AFTER PHASE I AND PHASE II

the computational effort. However, we also study the impact ofswitching activity on the voltage scaling in the last experiment.Since we do not have the switching activity for each net, we usethe probability of switching activity at the output of a gate asthe switching activity of the output signal. Details are describedlater in this section.

First, we would like to study the effect of phase I in the pro-posed algorithm on dual voltage domains. We use 1.2 and 0.6 Vas the voltage domains. In phase I, many gates are assignedto VDDL gates and the number of VDDH gates is reduced.These VDDH gates are the scalable gates for phase II. Thus,the problem size is reduced. The results are listed in Table I.The number of gates is listed in column 2. The original powerof the circuits is listed in column 3. The results after phase I arelisted in columns 4–8. Column 4 shows the ratio of the VDDHgates over the total number of gates. The ratio of the numberof VDDL gates is shown in column 5. The number of LCs isshown in column 6. Comparing to the total power of the orig-inal design, the power saving after phase I is listed in column 7.The CPU time after phase I is shown in column 8. The CPU timelisted in column 8 includes the CPU time required to import theinput data. The results of our algorithm after phase II are shownin columns 9–12. The total CPU time listed in column 12 is thetotal execution time of the program.

As shown in Table I, after phase I, on average, the problemsize is reduced to 73.8% of the original size. The amount ofreduction depends on the timing tightness of the paths in thecircuit. More paths with looser timing will exhibit a greater re-duction in the problem size after phase 1. On average, 12.7%of the total power of the original circuit is saved. Table I alsoshows that in all test cases, phase I is completed within 7.3 s.After phase II, there are 61.9% of gates are scaled to the lowsupply voltage and 37.2% of power is reduced.

We use s13207 as an example in Fig. 9 to illustrate the varia-tion of slack distribution during the scaling process. The -axisrepresents the slack of a gate and the -axis shows the numberof gates. Fig. 9(a) is the slack distribution of the original de-sign. Fig. 9(b) and (c) are the distributions after phase I andphase II, respectively. Fig. 9 shows that this algorithm succes-

sively scales gates from high voltage to low voltage such thatthe slacks of gates are decreased. We find that, originally, slacksare heavily clustered between 7 and 11 ns. After phase I, theslacks are shifted to the interval [4 ns, 9 ns]. The slacks are gen-erally reduced. After phase II, 25% of slacks are less than 1 ns.There are about 1100 gates with the slacks close to zero. Thecycle time of the design is 12.2 ns. It shows that this algorithmutilizes the slack of each gate to scale down the voltage of thegates. Yet, we can detect that there are gates which still havelarge slacks after phase II. This is because these gates are on thepaths with delays much shorter than the cycle time. Even thoughthe voltages of all gates along these paths have been scaled tothe low supply voltage, the slacks of these gates are still muchlarger than the slacks of other gates. This implies that the supplyvoltages of these gates could be scaled further downward.

Then, we compared the results of the proposed algorithm withdual supply voltages and the results of GECVS [7]. The twovoltage domains used in this experiment are 1.2 and 0.6 V. Inthis experiment, we implement the GECVS without the backoffon the timing of the circuit. Thus, the total power of GECVSmay be further reduced if the backoff delay is allowed. The ex-perimental results are listed in Table II. As shown in Table II,column 2 shows the number of gates. The original power ofthese circuits is listed in column 3. Columns 4–7 show the re-sults of our algorithm. Column 4 shows the percentage of thenumber of LCs over the number of total gates (including LCs).The percentage of total power of LCs over the total power ofthe circuit is shown in column 5. Column 6 shows the savedpower compared to the original power. The CPU time is shownin column 7. The results of GECVS are shown in columns 8–11,respectively.

On average, our algorithm can reduce total power consump-tion by 37.2% with an 11.9% overhead on the total number ofLCs. The GECVS algorithm can only reduce total power con-sumption by 29.7% with a 16.9% overhead on total number ofLCs. In this experiment, an interesting effect is observed in thetest case s1488. As shown in Table II, our algorithm uses moreLCs but consumes less LC power than GECVS. It is becausethe output load on each LC will affect its power consumption.


Fig. 9. Slack distribution of s13207 which was scaled by the proposed algo-rithm using dual supply voltages (1.2 and 0.6 V). (a) the slack distibution beforescaling. (b) After phase I. (c) After phase II.

The power consumption of any two LCs may not be the same.We calculated the total loading capacitance of all LCs in s1488.The values are 2.047 and 3.259 pf by using our algorithm andGECVS, respectively. Therefore, it is possible that a netlist withmore LCs consumes less LC power.

Table III shows a comparison of the improvements our al-gorithm provides compared to GECVS in the number of levelconverters, power consumption, and CPU time. The percentageof the number of scalable gates in phase II with respect to thenumber of total gates is shown in column 5. The table showsthat our algorithm can save more power and use fewer level con-verters. On average our algorithm improves the number of levelconverters by 34.2% and the power consumption by 11.0% in27.0% less CPU time. Looking at the same table we can seethat, on average, the percentage of scalable gates in phase II ofour algorithm is only 73.8% of the scalable gates of GECVS.Thus, the problem size of our algorithm is much smaller than

that of GECVS. Therefore, even though our algorithm is an it-erative optimization process, it still uses less CPU time.

We have also explored the impact on power savings by usingmultiple voltage domains. We compared the results of using twoand four voltage domains in the scaling process. The experi-mental results of the proposed algorithm that uses four voltagedomains are listed in Table IV. Column 2 shows the percentageof the number of LCs over the number of total gates (includingLCs). The percentage of total power of LCs over the total poweris shown in column 3. Column 4 shows the saved power com-pared to the original power. The original power is shown in thecolumn 3 of Table II. The CPU time is shown in column 5.

Comparing the results in Table IV and the results shown inTable II, we can find that the average power saving of fourvoltage domains is 5.3% more than the power savings achievedby using two voltage domains. Fig. 10 shows a comparison ofthe power savings in three programs, namely, GECVS, our al-gorithm with dual voltage domains, and our algorithm with fourvoltage domains.

These experiments show that using more voltage domains cansave more power. We find that in the case Mac1, using fourvoltage domains in our algorithm saves 12.1% more power thanthat with dual supply voltages. Therefore, we use Mac1 as thetest case to further study the impact on power savings when dif-ferent numbers of voltage domains are used. The experimentalresults are shown in Table V. In Table V, column 1 shows thenumber of voltage domains. Column 2 is the original power ofthe circuit. The power saved after phase I and phase II are shownin columns 3 and 4, respectively. The voltage domains used inthese different experiments are shown in column 5.

As shown in Table V, we can find that when dual supply volt-ages (1.2 and 0.6 V) are used, the power saving is 40.3%. Whenthe supply voltage 0.8 V is added into the voltage domains in theexperiments, the power savings on rows 3 and 4 are the same.When the supply voltage 0.7 V is used, the power saving onrows 5, 6, and 7 are very close. It means that 0.6 V is not a goodchoice for the lower bound of the voltage domains. Therefore,we use the same test case to study the impact on power savingby applying different lower bounds in the voltage domain. Theexperimental results are shown in Table VI.

Table VI shows that after phase II, when 1.2 and 0.7 V areused, the largest power consumption can be saved. Tables Vand VI also show the comparable power savings when 0.7 V isincluded in the voltage domains. Therefore, we can find that theuse of the most comportable supply voltage is the main factorin determining total power savings. The most comportablesupply voltage of different circuits is determined according tothe timing tightness of the paths in the circuit.

Although using more voltage domains saves more power, itneeds more voltage islands and the designer must make extraefforts to accommodate the extra voltage islands. Therefore, forthe proposed algorithm, if the most comportable supply voltagefor each case is known, the use of dual supply voltages is a goodchoice for considering both power and effort saving.

In the previous experiments, the switching activity of each netis 1. In this experiment, we would like to assign different valuesof switching activity to different nets. Reference [18] describeda patternless method to estimate the switching activity of simple


TABLE IIEXPERIMENTAL RESULTS OF OUR ALGORITHM AND THE GECVS [7] ALGORITHM. THE TWO SUPPLY VOLTAGES USED ARE 1.2 AND 0.6 V

TABLE IIICOMPARISON OF THE IMPROVEMENTS OUR ALGORITHM

PROVIDES COMPARED TO GECVS [7] IN THE NUMBER OF

LEVEL CONVERTERS, POWER CONSUMPTION, AND CPU TIME

TABLE IVRESULTS OF OUR ALGORITHM USING FOUR VOLTAGE DOMAINS

gates, including NAND, NOR, and XOR. The switching activity atthe output of a two input NAND/NOR gate is 3/8 when the twoinputs are independent. The switching activity at the output ofa -input NAND/NOR gate is for a large . We know

Fig. 10. Power saving of the three programs, GECVS, our algorithm with dualsupply voltages, and our algorithm with four supply voltages.

TABLE VCOMPARE THE RESULTS OF USING DIFFERENT NUMBER OF

VOLTAGE DOMAINS ON MAC 1

TABLE VIRESULTS OF OUR ALGORITHM USING DIFFERENT LOWER

BOUNDS OF VOLTAGE DOMAINS ON MAC1

that the function of a NAND/NOR gate is the combination of anAND/OR gate and an inverter. Therefore, the switching activity


TABLE VIIEXPERIMENTAL RESULTS OF THE PROPOSED ALGORITHM WITH ESTIMATED SWITCHING ACTIVITY

at the output of an AND/OR gate is the same with a NAND/NOR

gate. The switching activity at the output of any XOR gate is 1/2.Besides, we assume that the switching activity at the output ofan INV/BUF/DFF is same as its input signal. For other types ofgates, the switching activity of the output is no larger than 0.5according to [19]. Thus, the switching activity at the output ofany other type of gate is a randomly generated value within therange of 0.01 to 0.5. Then, the switching activity is applied in thecost function in the proposed algorithm. Dual voltage domains(1.2 and 0.6 V) are used and the experimental results are shownin Table VII.

As shown in Table VII, on average, we can reduce totalpower consumption by 39.5% with an 9.5% overhead on thetotal number of LCs. Compared to Table II, our algorithm withestimated switching activity produces 2.3% more power savingand 2.4% less number of LCs than that with constant switchingactivity. Therefore, to consider the switching activity in thevoltage scaling process may save more power consumption anduse less LCs.

VI. CONCLUSION

A two-phase voltage scaling algorithm for VLSI circuits isproposed. The proposed algorithm utilizes the slack of each gateto scale down the voltages of the gates. It combines a greedy ap-proach and an iterative optimization method to scale the supplyvoltage of gates effectively. On average, it improves total powerconsumption by 42.5% over the original circuit with a 10.6%overhead on the total number of LCs. Phase I in the algorithmreduces the problem size for the optimization process in phaseII. Therefore, even though our algorithm is an iterative optimiza-tion process, it still can reduce more power consumption in lessCPU time as compared to GECVS [7]. On average, our algo-rithm improves the number of level converters by 34.2% andthe power consumption by 11.0% in 27.0% less CPU time. Ourstudy also shows that when the most comportable supply voltageis included in the voltage domains, using more voltage domainsmay improve power consumption by a small amount. The keyfactor in achieving power saving is including the most com-portable supply voltage in the scaling process. If more voltage

domains are used, more voltage islands will be needed and de-signers will be burdened with the extra voltage islands in theirdesigns. Thus, using dual voltage domains is a good choiceboth for saving power and facilitating the design effort. We alsostudied the impact of considering switching activity on totalpower consumption. The results show that the algorithm reducestotal power consumption by 39.5% as compared to the originalcircuit. Low power design is always an important issue for VLSIdesigns. By applying lower supply voltages on nontiming crit-ical gates, we can greatly reduce the total power consumption.

REFERENCES

[1] C. Chen, A. Srivastava, and M. Sarrafzadeh, “On gate level power op-timization using dual-supply voltages,” IEEE Trans. Very Large ScaleIntegr. (VLSI) Syst., vol. 9, no. 5, pp. 616–629, Oct. 2001.

[2] K. Usami and M. Horowitz, “Clustered voltage scaling technique forlow-power design,” in Proc. Int. Symp. Low Power Design, 1995, pp.3–8.

[3] K. Usami, T. Ishikawa, M. Kanazawa, and H. Kotani, “Low-powerdesign technique for ASICs by partially reducing supply voltage,” inProc. 9th Annu. IEEE Int. ASIC Conf. , 1996, pp. 301–304.

[4] K. Usami, M. Igarashi, F. Minami, M. Ishikawa, M. Ichida, and K.Nogami, “Automated low-power technique exploiting multiple supplyvoltages applied to a media processor,” IEEE J. Solid-State Circuits,vol. 33, no. 3, pp. 463–472, 1998.

[5] M. Igarashi, “A low-power design method using multiple supply volt-ages,” in Proc. Int. Symp. Low Power Design, 1997, pp. 36–41.

[6] Y. J. Yeh and S. Y. Kuo, “An optimization-based low-power voltagescaling technique using multiple supply voltages,” in Proc. IEEE Int.Symp. Circuits Syst., 2001, pp. 535–538.

[7] S. N. Kulkarni, A. N. Srivastava, and D. Sylvester, “A new algorithmfor improved VDD assignment in low power dual VDD systems,” inProc. Int. Symp. Low Power Design, 2004, pp. 200–205.

[8] D. Kang, M. C. Johnson, and K. Roy, “Multiple-Vdd scheduling/allo-cation for partitioned floorplan,” in Proc. 21st Int. Conf. Comput. De-sign, 2003, pp. 412–418.

[9] D. Kang, M. C. Johnson, and K. Roy, “Simultaneous multiple-Vddscheduling and allocation for partitioned floorplan,” in Proc. 5th Int.Symp. Quality Electron. Design, 2004, pp. 98–103.

[10] A. Manzak and C. Chakrabarti, “A low power scheduling scheme withresources operating at multiple voltages,” IEEE Trans. Very LargeScale Integr. (VLSI) Syst., vol. 10, no. 1, pp. 6–14, Feb. 2002.

[11] S. P. Mohanty and N. Ranganathan, “Simultaneous peak and averagepower minimization during datapath scheduling,” IEEE Trans. CircuitsSyst.—I: Reg. Papers, vol. 52, no. 6, pp. 1157–1165, Jun. 2005.

[12] A. Srivastava and D. Sylvester, “Minimizing total power by simultane-ously Vdd/Vth assignment,” in Proc. Asia South Pacific Design Autom.Conf., 2003, pp. 400–403.


[13] W. Hung, Y. Xie, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and Y.Tsai, “Total power optimization through simultaneously multiple-Vddmultiple-Vth assignment and device sizing with stack forcing,” in Proc.Int. Symp. Low Power Electron. Design, 2004, pp. 144–149.

[14] Synopsys, Sunnyvale, CA, “Library compiler user guide: Modeling,timing and power technology libraries,” 2003.

[15] T. Sakurai and A. R. Newton, “Alpha-power law MOSFET model andits applications to CMOS inverter and other formulas,” IEEE J. Solid-State Circuits, vol. 25, no. 2, pp. 584–594, Apr. 1990.

[16] Y. C. Chou and Y. L. Lim, “A performance-driven standard-cellplacer based on a modified force-directed algorithm,” in Proc. ISPD,2001, pp. 24–29 [Online]. Available: http://www.cs.nthu.edu.tw/~ylin/ISPD2001NTHUBenchmark/placement.htm

[17] F. Braglez, D. Bryan, and K. Kozminski, “Combinational profiles ofsequential benchmark circuits-preliminary results,” in Int. Symp. Cir-cuits Syst., 1989, pp. 1929–1934 [Online]. Available: http://www.fm.vslib.cz/~kes/asic/iscas/

[18] M. Pedram, “Power minimization in IC design: Principles and appi-cations,” ACM Trans. Design Autom. Electron. Syst., vol. 1, no. 1, pp.3–56, Jan. 1996.

[19] C. Svensson and D. Liu, “Low power circuit techniques,” in Low PowerDesign Methodologies. Norwell, MA: Kluwer, 1996, pp. 38–64.

Jun Cheng Chi (S’05) received the M.S. degreefrom the Chung Yuan Christian University, Taiwan,R.O.C., in 2001, and the Ph.D. degree from theGraduate Institute of Electronics Engineering,Chung Yuan Christian University, Taiwan, R.O.C.,in 2006.

Currently, he is a Senior Engineer with Spring-soft Inc., Hsinchu, Taiwan, R.O.C. His researchinterests include the areas of design automation oftiming-driven physical design, signal integrity, andlow power design methodologies.

Hung Hsie Lee received the B.S. degree in infor-mation and computer engineering from Chung YuanChristian University, Taiwan, R.O.C., in 2006.

Currently, he is an Engineer in the SoC Tech-nology Center, Industrial Technology ResearchInstitute, Hsinchu, Taiwan, R.O.C. His researchinterests include the area of design automation forlow power VLSI IC design.

Sung Han Tsai received the B.S. degree in infor-mation and computer engineering from Chung YuanChristian University, Taiwan, R.O.C., in 2006.

Currently, he is an Engineer with Springsoft Inc.,Hsinchu, Taiwan, R.O.C. His research interests in-clude the general area of VLSI CAD.

Mely Chen Chi (M’80) received the B.S. degreefrom National Taiwan Normal University, Taipei,Taiwan, R.O.C., in 1970, and the M.S. and Ph.D.degrees from the Wesleyan University, Middletown,CT, in 1974 and 1978, respectively, all in physics.

Since 1999, she has been a Professor in the De-partment of Information and Computer Engineering,Chung Yuan Christian University, Taiwan, R.O.C.,and also the Founding Director of the Electronics andInformation Technology Center. She worked at theUniversity of Pennsylvania, Philadelphia, from 1977

to 1978 and worked in the Computer-Aided Design and Test Laboratory, BellLaboratories, Murray Hill, NJ, from 1978 to 1990. She served as a Senior Re-searcher and Manager at the Computer and Communication Laboratory, Indus-trial Technology and Research Institute, Hsinchu, Taiwan, R.O.C., from 1990 to1999. She has worked in the area of electronics design automation since 1978.

Prof. Chi has served in the technical committees of the IEEE InternationalSOC Conference and the IEEE International Symposium on Quality ElectronicDesign.

Date post:	31-Jan-2016
Category:	Documents
Upload:	xavier-danny
View:	217 times
Download:	0 times

chi2007 (1).pdf

Documents