On Signal-Gating Schemes for Low-Power Adders

8/8/2019 On Signal-Gating Schemes for Low-Power Adders

1/5

On Signal-Gating Schemes for Low-Power AddersZhijun Huang and Milo5 D. Ercegovac

Computer Science DepartmentUniversity of California Los AngelesLos Angeles, CA 90095{z jhuang, milos}@cs.ucla.edu

AbstractSignal gating schemes for low-power adder designare studied in this paper. Signal gating dynamically d e -activates portions of an adder according to the actualprecision of two operands. Based on program analy-sis, signal gating is developed for two different adders:symmetric adders and asymmetric adders. The effectof signal gating is investigated by incorporating severalgating schemes into a RISC pipeline. Experimental re-

sults indicate more power saving compared to previouswork.

1 IntroductionAdders are fundamental ar i thm etic units in digi-tal systems. As power consumption is becoming animp ortan t concern, low-power adders have been stud-ied extensively on the circuit and logic levels. In [1]-[4], power estimation and comparison of various adderstructures are presented. In [5]-[8], power optimiza-tion techniques are proposed to reduce switching ac-tivity. In these studies, adders have been treated asisolated units with little consideration of applicationda ta characteristics. Because power dissipation is di-

rectly related to data switching patterns, the isolatedadder o ptimiza tion would lead to limited power sav-ing. Program analysis has revealed th at there are alarge number of short-precision ad ditio ns in mo st ap-plications [9]-[12]. To tak e a dva ntag e of short-precisionda ta , signal ating can be applied to deactivate por-tions of an a% der to ma tch run-tim e d at a precision.Signal gatin g is a popula r power reduction tech niquethat has been used widely at all levels of abstraction.On the architecture level, an idle functional unit andits input/output registers can be powered down by dis-abling their clock signals 13][14]. For a busy func-tional unit, some portion(s can still be gated accord-ing to operand precisions as first proposed in [lo]. T h enotion of operand gating is further generalized to allstages of the microprocessor pipeline in [12]. In theadaptive power-aware system proposed in [15], an en-semble of functional units with different fixed widthsare provided and only one of them is adaptively en-abled according to in put da ta precision. A major limi-tation of these architecture-level signal gating schemes

is the lack of arithmetic details and reliable power es-timation, which would affect the optimal points andarea/power/delay tradeoffs. For example, the powerdistribution in adders w ith short-precision da ta is gen-erally not uniform. [16] attempted to consider adderdetails with signal gating logic and concluded th at lit-tle power sa ving could be achieved because of the over-head. Th is conclusion is true if th e adder design isconsidered separately. However, th e overhead can bereduced or amortized if the related units are also mod-ified t o accept short-precision da ta .In th is p ape r, we develop signal gating schemes witharit hm etic deta ils for different precision patte rns of ad-dition and app l the techniques into DLX microproces-sor pipeline [l'i'f The rest of this paper is organized asfollows. Section 2 gives basic definitions. Section 3presents program analysis results as the motivation ofthis work. Section 4 discusses signal gating schemesfor adders with symmetric-precision operands. Section5 studies signal gating schemes for special adders withasymmetric-precision operands. Section 6 discusses ex-perime ntal results including the power optimization ofthe DLX pipeline. Th e last section concludes this work.2 Definitions

A n-bit two's-complement number can be parti-tioned into two parts: sign extension bits (leading ze-ros or ones), and significand bits including the sign bit.The sign extension part is denoted as E and the sig-nificand part as D . an d ID1 denote the lengths of1 5 ]D l 5 n . 101 is the actual precision of th e num-ber. For two's-complement ad ditio n with operands Xand Y , he operation precision is defined as IDopl =m u z ( ~ ~ x ~ ,+ l ) . q{JDopJ= i}] is the probability ofoperation s with precision IDopl = i . We also define thefollowing probability variables for each operand:Pi D : the probability of bit i being in part D;Pi[E]: he probability of bit i being in part E;SPi: the static probability of bit i being logic '1';TRi: the rate of bit i toggling per cycle.Pi(D) and Pi(E) eflect d at a spati al correlation. If twoneighbor bits have the similar P(E) , hey are highlycorrelated. S P an d T R reflect the temporal correla-tion. Th e lower the values, the higher the d at a aretempo rally correlated.

E and D , respective 1"y. Therefore, n = IEl + 101 an d

0-7803-7147-x/01/$10.0002001IEEE 867
mailto:cs.ucla.edumailto:cs.ucla.edu


2/5

3 Program AnalysisW ith the execution tracing tool Shade [18],we haveanalyzed run-time features of operations and operandsin 32-bit Mediabench programs [19]. We have the fol-lowing observations. Fir st, abo ut 70% of the total ex-

ecuted instructions involve addition steps. This is be-cause addit ions/subtractions, load/store memory ad-dress calculation and branches all require additio n op-erations. Moreover, mo st instruction executions havea PC ncrementing step. In program djpeg, for ex-ample, the dis tr ibution of instruction types is: addi-t ions/subtractions are 36.08%, m u l t l d i v 3.26%, shiftand logic 21.02%' load/ store 30.49%, branches 7.44%.Second, most arithm etic opera tions have precisionsmuch smaller than the datapath hardware width. Indjpeg, 86% of additionlsubtraction has precision of 20bits or less, 57% has precision of 13 bits or less, asshown in Fig. 1. Third, the precision difference be-tween two opera nds is significant. In djpeg, the averageprecision difference between ad dition lsubtra ction 's twooperands is 7 bits while the difference between the twooperands in da ta me mory address calculation is 13 bits.Four th, the SP of each bit is less than 0.5 in most cases,and the TR s often not equal to 2 x SP x (1 - SP)because of th e correlatio n. These observations havemotivated this work.Precision Distributionof AdditionlSubtraction n djpeg

100 8 ;.a. -/ f l f lt

4 8 12 16 20 24 28 32Precision

Figure 1: Precision distribution of AddISub.

4 Signal Gating for Symmetric AddersWe def ine symmetr ic adders as n-bit adders withboth operands have the similar dat a range, which isthe general case. In our design, input da ta of an adderare stored in two registers. Upon ea ch clock rising edge,

new data are loaded into the registers an d the addit ionworks on the loaded da ta. Signal gatin g is applied toboth input registers and com binational addition logic.In man y cases, signal gatin g with one gating boundarycannot fully utilize dy nam ic da ta precisions. Adderswith multiple gatin g boundaries m ay be designed formo re energy saving. Here we only discuss signal gatingwith one boundary.

4.1 Signal Gating with One BoundarySuppose the upper G-bit por t ion of the adder isidentified as the cand idate to be gated . Th e generalstru ctu re of a signal-gated adder is i l lustrated in Fig. 2.ID L is iden tity detection logic and GCTL is gating con-trol logic as shown in 3. Th e behavior of si nal gatin gis as follows. Th e first step is to detect ]Ex7and IEy l ,the sign-extension widths of X and Y . Two leadingbits f rom the lower ( N - G) par t are involved in test-ing. Th is is necessary because we need the result oflower-par t compu tation to be in a correct form with-out overflow. The OR signal of upper (G + 2) bits ineach operand indicate if upper G bits can be viewed asextension b its of '0'. Th e NAND signal of upper (G+2 )bits indicate if upper G bits can be viewed as exten-sion bits of '1'. To protect the clock signal from glitchesand ensure correct timing, g is latched to be g l beforecontrolling the clock and is registered to be gll beforecontrolling the combinational circuit. If lExl > G and

lEyl > G, the G-bit por t ion is gated by disabling theclock of input registers and blocking the carry signal.Th e adder works as a short-precision adder a nd th e re-sult is then restored to th e full width. Otherwise, theadder works as a normal full-width adder.

X G bits Y:G bits X N-G) bits Y:(N-G)its

S U M (N-G)itsUM: itsFigure 2 : Sym metr ic adder w ith s ignal gating.

a b i c h

gclk.:i.-clk, Larsw biu811

(a) IDL (b) GCTL

Figure 3: Identical Detection and G atin g Control Logic.4.2 Overhead Analysis

To justify the signal gating technique, the energyoverhead should not exceed the power reductions.

86 8


3/5


4/5

X: s d d d d d d d d d i d d d d d dy: s s s s s s s s s s j d d d d d dy

cds Cdl Y:Sign Portion0 0 1 0 0 0 0 0 0 0 0 0 0

( 1 1 1 0 0 0 0 0 0 0 0 0 0 10 1 0 0 0 0 0 0 0 0 0 -11 0 0 0 0 0 0 0 0 0 0 I

Figure 6: Simplification in addition w ith imm ediates.

Because the logic of the upper portion is reduced,power consum ption goes down accordingly even if thereis no signal gating. If we want to gate the upper por-tion when s @ cd = 0, ou tpu t m ult iplexers a re neces-sary. More importantly , there is a t iming problem tobe solved. If th e upper portion w ith input registersis gated when waiting for the carry signal cd, the newX input may be lost when the upper por t ion star tscomputing. gating-controlled Latches can be insertedto hold new X at the registers outputs. As the over-head is a latch and a multiplexer each bit, the powersaving is minim al, if any. An alterna tive approach isto put the lower por t ion and the upper por t ion intotwo pipeline stages if possible. T he c arry signal of th elower-portion stage is used to gate the upper-portionstage.In implicit asymmetric addition, a partition similarto Fig. 6 can still be made. However, the upper por-tion of Y s not always the sign portion. It is possiblethat some or all bits in the upper portion is significantbits . W e est imate th at the s ignal gating design for im-plicit asymmetric addition would be complicated andthe power saving is not m uch.

6 Experiments6.1 Adder with Signal Gating

We first studied the power saving effect of thegeneral symm etric adders with signal gating. Theseschemes have been designed in gate-level VHDL. TheSynopsys design environment is used as a comm on plat-form to compare different schemes. Test da ta are gath-ered by tracing the execution of addition/subtractionin d j p e g . The breakdown of power consumption indifferent schemes is shown in Table 1. The baselinescheme is a 32-bit twos-complement carry prop agateadder /subtractor . G11 is the scheme with signal gat-ing on upper 11 bits. G18 is the scheme with gatingon upper 18 bits. 2G is the scheme with dual gatingon upper 11 and 18 bits. Th e power consumption isdivided into two parts: th e useful power consumption,Pus,nd th e overhead power, Po,,. us, ncludes Perk,Pdin, Preg, n d Padd. Pclkand Pdin are th e power con-sump tions of clock and da ta inpu t ports. Preg and Paddare the power consumptions of input registers and theadder/subtractor core.

Table 1: Power comparison ( n W / MH z )I Schemes I baseline I G 11 I G 18 I 2G 1I p i 195.7 i 140.7 99.2 i 109.2 I

di n I 30.6 I 46.2 I 58.9 I 57.6I F I) 120.8 I 90.51 I 81.1 I 76.0 1I-..

As we have expected, the power is reduced in thefollowing circuit blocks: t he clock port, in put registerand the addition core. Considering only the Pus,,h epower reduction is encouraging: G11 reduces 19% , G1827% and 2G 29%. For the overal l power consum ption,however, only G11 reduces the power by 6.2% whilethe other two schemes consumes even more power. Tojustify the use of signal gating in adders, the poweroverhead must be reduced or amortize d. If we consideradders in a whole design, it is possible to reduce theoverhead by amortizing the precision detection cost orkeeping the short-precision da ta for subsequent com-puta tions. For example, if signal gating is applied in amultiplier, there is no need to provide precision detec-tion logic for the final CPA . In th e following, we studyhow to keep short-precision data in the DLX processorpipeline.6.2 DLX Pipeline with Signal Gating

The DLX pipeline considered here is an improvedversion of the basic pipeline with reduced stalls frombranch hazards [17]. There are five stages: instructionfetch (IF), nstruction decode (ID), execute (EX ), da tamemory access ( M E M ) and write back (WB). IF , IDand EX have addit ion operat ions. Th e adder in s tageIF is the P C incrementor . T he adder in s tage ID cal-culates the branch target address (BTA), which is theaddition of PC and an immediate. The ALU imple-ments ar i thmetic and logic operat ions, as well as d a t amemory address calculation.Previous work in [14] studied clock gating on thewhole units and the related pipeline registers, whichwe call whole-unit ating . For exam ple, the EX stagewill be idle at 9 .176 of the total execution t ime andMEM will be idle at 69.5% of the tota l t ime when run-ning d j p e g , which are good candidates for whole-unitgating. In addition to whole-unit gating , we extendthe signal gating schemes described in this paper tothe w hole pipeline, which is called portion-unit gating.Instead of detecting precision at two inputs of ALU,the precision is detected at the ou tpu ts o f ALU anddata memory. The precision information is kept in thepipeline using some tag bits. In ou r experim ent, weuse dual signal gating on upper 11 and 18 bits of EX,M E M , W B units as well as heir inp ut registers. PC +1and B TA are asymm etric adders described in Section 5.There is no gating on IF and ID u nits because they arealways busy and instructions are not da ta to be com-pute d. Instruction compression techniques can be used

870


5/5

to reduce t he nu mbe r of bits to be processed, which isout of our scope here. Based on the simulation datain Table 1, each unit with dual signal gating is as-sumed to consume 33% less power on average whengated. Table 2 lists the the power reduction percent-age in pipeline blocks of two gating schemes comparedto the baseline DLX pipeline with no gating.

Schemes P C + 1W 0W +P 44.53BT A EX M E M W B I REG0 9.17 69.5 9.17 I 26.33

25.00 39.42 79.65 39.42 1 37.77

Only those units with power changing are listed inthe table. W is the whole-unit gating scheme. W +Pis the combining of whole-unit gating a nd portion-unitgating. I t can be seen that W + P achieve another 10-45% reduction compared to W . With respect to thegating overhead, there is little cost in the W schemebecause th ere is no precision detection and the ga tingcontrol signals are also pipeline control signals. Th eW +P scheme has precision detection in both ALU an dMEM outp uts. Th e total overhead may offset the ben-efit in EX stage, judged from Table 1. Consideringth at ALU an d MEM are gated when idle, the overheadwould be much less.7 Conclusions

Signal gating schemes for low-power adder designhave been studied in this paper. Th e program anal-ysis indicates that there is a large number of short-precision ad ditio ns and th e precision difference betweentwo operands is large. Based on the precision features,signal gating is developed for two types of adders: sym-metric adders an d asym metric adders. Th e effect of sig-nal gating is studied by treating a signal-gated adderas a separate unit as well as incorporating signal gat-ing into a RISC pipeline. Expe rimental results indicate10-45% power saving in the pipeline units compared t oprevious work.

References[l] Callaway, T.K .; Swartzlander, E .E., Jr . Estim at-ing the power consumption of CMOS adders, in Proc. IEEE 11th Symp. Computer Arithmetic,[2] Nagendra, C.; Irwin, M.J.; Owens, R.M. Power-delay characteristics of CMOS adders, IEEETrans. VLSI Systems, v01.2, no.3, Sept. 1994.[3] Nagendra, C.; Irwin, M.J.; Owens, R.M. Area-time-power tradeoffs in parallel adders, IEEETrans. Circuits and Systems 11: Analog and Digi-tal Signal Processing, vo1.43, (no.lO), Oct. 1996.[4] Freking, R.A .; Parhi , K.K. Theoretical estima-tion of power consumption in binary adders, in

pp 210-216, 1993.

Proc. IEEE Int. Symp. Circuits and Systems (IS-[5] M.D. Ercegovac and T. Lang, Reducing transi-tion co unts in arithme tic circuits, in IEEE Symp.Low Power Electronics, pp.64-65, 1994.[6] M.D. Ercegovac and T. Lang, Low-power accu-mulator(correlator), in IEEE Symp. Low PowerElectronics , pp.30-31, Oct. 1995.171 C.A. F abian an d M.D. Ercegovac, Input synchro-nization in low power CM OS a rithm etic circuit de-sign, in Proc. 30th Asilomar Conf. Signals, Sys-tems and Computers, pp.172-176, Nov. 1996.[8] Y .W ang and K.K. P arhi, New low power addersbased on new representations of carry signals, inProc. 35th Asalomlsr Conf Signals, Systems andComputers, Nov. 2000.[9] Bishop, B.; Kelliher, T.P.; Irwin, M .J. A detailedanalysis of MediaBench, in IEEE Workshop onSignal Processing Systems (SiPS99), pp.448-455,1999.[ lo] D. Brooks and M. Martonosi, Value-based clockgating and operation packing: dynam ic strate-gies for improving processor power and perfor-mance, ACM Trans. Computer Systems, vo1.18,no.2, pp.89-126, May 2000.[ l l] Stephenson, M.; Babb, J . ; Amarashinghe, S.Bitw idth analysis with application t o silicon com-pilat ion, ACM SIGPLAN Notices, vo1.35, (no.5 ),pp.108-20, May 2000.[12] Canal, R.; Gonzalez, A .; Sm ith, J .E. Very lowpower pipelines using significance compression,in Proc. 33rd Annual IEEE/ACM Int. Symp. onMicroarchitecture,pp.181-190, 2000.E131 Gowan, M.K.; Biro, L.L.; Jackso n, D.B. Powerconsiderations in the design of the Alpha 21264microprocessor, in Proc. 35th Design and Au-tomation Coni pp.726-731, 1998.[14] Wu Ye; Irw in, M.J. Power analysis of gatedpipeline registers, in 12th Annual IEEE Int.

ASIC/SOC Con&pp.281-285, 1999.[15] M. Bh ardwa j, R. M in, and A. Chan drakasa n,Power-aware systems, in Proc. 35th AsilomarConf. Signals, Systems and Computers, v01.2,[16] J . C hoi, J . Jeon, and K. Choi, Power minimiza-tion of functional units by partially guarded com-putat ion, in Proc. Int. symp. Low Power Elec-tronics and Design, pp.131-136. Jul. 2000.[17] J.L. Hennessy and D.A. Patterson , Computer Ar-chitecture: A Quantitative Approach,2nd Edition,Morgan Ka ufman n Publishers, Inc., 1996.[18] Sun Microsystems, Shade Users Manual, 1993.[19] Chun ho Lee; Potkon jak, M.; Mangione-Smith,W .H. MediaBench: a tool for evaluating andsynthesizing m ultimed ia and communications sys-tems, in Proc. 30th Annual IEEE/ACM Int.Symp. Microarchitecture,pp.330-335, Dec. 1997.

CAS98), ~0 1. 2, p.453-457, 1998.

pp.1695-1701, NOV. 000.

87 1

Date post:	29-May-2018
Category:	Documents
Upload:	kranthi59
View:	213 times
Download:	0 times

On Signal-Gating Schemes for Low-Power Adders

Documents