+ All Categories
Home > Documents > 15 X 15 mac

15 X 15 mac

Date post: 12-Jan-2016
Category:
Upload: innovatorinnovator
View: 232 times
Download: 0 times
Share this document with a friend
Description:
mac
Popular Tags:
21
Journal of VLSI Signal Processing 33, 83–103, 2003 c 2003 Kluwer Academic Publishers. Manufactured in The Netherlands. Energy Efficient Adiabatic Multiplier-Accumulator Design DUSAN SUVAKOVIC AND C. ANDRE T. SALAMA Edward S. Rogers Sr. Department of Electrical and Computer Engineering, University of Toronto, 10 King’s College Road, Toronto, Ontario M5S 3G4, Canada Received October 20, 2000; Revised August 29, 2001 Abstract. This paper presents a strategy for minimizing non-adiabatic dissipation in adiabatic arithmetic units. The non-adiabatic dissipation is minimized by architectural design involving a small number of complex logic gates. Circuit design of complex adiabatic gates, based on ordered binary decision diagrams (OBDD), is introduced. An optimized architecture for adiabatic parallel multipliers is proposed and savings in energy dissipation over competing architectures are estimated. Experimental results obtained from implementation of an adiabatic multiply-accumulate (MAC) unit suggest that the proposed strategy provides substantial improvement in energy efficiency over equivalent non-adiabatic and alternative adiabatic implementations, while achieving a competitive operating speed. Keywords: adiabatic, arithmetic, circuits, multiplier, low-power 1. Introduction Unlike other low power design techniques that attempt to minimize the energy used in computation [1], the en- ergy recovery or adiabatic technique involves recycling of that energy. The delivery and recovery of energy is performed virtually without dissipation [2], resulting in potentially better energy efficiency than in conven- tional digital systems. Since energy consumption is not necessary in order to perform computation [3], energy recovery using CMOS circuits is possible. Dissipation in adiabatic logic consists of two com- ponents: adiabatic and non-adiabatic dissipation. The former component can be reduced asymptotically to zero by slowing down the transfer of charge between digital circuitry and the power supply [2]. The latter component cannot be eliminated [4] and results from erasure of information that occurs in all conventional arithmetic architectures [3, 5]. Since non-adiabatic dis- sipation introduces a lower bound on the overall dissi- pation, adiabatic implementation of a digital system is justified only if this lower bound is significantly smaller than the dissipation achievable by conventional low power design techniques. Three different approaches to the problem of non- adiabatic dissipation have been proposed in previous work. The first approach minimizes the non-adiabatic dissipation at the circuit level [6–12], by replacing conventional logic gates in a digital system by their adiabatic counterparts. An improvement in energy ef- ficiency of 3–4 times over equivalent CMOS circuit implementations has been reported for small adiabatic multipliers and adders designed using this approach [13]. The second approach involves logically reversible system design [14], which eliminates non-adiabatic dissipation, while introducing significant overhead in circuitry and adiabatic dissipation that is particularly pronounced in DSP building blocks [2, 15]. The third approach is based on applying the energy recovery technique only to selected, high capacitance nodes, for which the non-adiabatic is negligible compared to the savings achieved by energy recovery. Such design is beneficial in system architectures in which switching at a small number of heavily loaded circuit nodes dom- inates the overall dissipation [16, 17]. None of the ap- proaches described above is convenient for adiabatic implementation of arithmetic units, which are the main source of dissipation in conventional DSP systems.
Transcript
Page 1: 15 X 15 mac

Journal of VLSI Signal Processing 33, 83–103, 2003c© 2003 Kluwer Academic Publishers. Manufactured in The Netherlands.

Energy Efficient Adiabatic Multiplier-Accumulator Design

DUSAN SUVAKOVIC AND C. ANDRE T. SALAMAEdward S. Rogers Sr. Department of Electrical and Computer Engineering, University of Toronto,

10 King’s College Road, Toronto, Ontario M5S 3G4, Canada

Received October 20, 2000; Revised August 29, 2001

Abstract. This paper presents a strategy for minimizing non-adiabatic dissipation in adiabatic arithmetic units.The non-adiabatic dissipation is minimized by architectural design involving a small number of complex logic gates.Circuit design of complex adiabatic gates, based on ordered binary decision diagrams (OBDD), is introduced. Anoptimized architecture for adiabatic parallel multipliers is proposed and savings in energy dissipation over competingarchitectures are estimated. Experimental results obtained from implementation of an adiabatic multiply-accumulate(MAC) unit suggest that the proposed strategy provides substantial improvement in energy efficiency over equivalentnon-adiabatic and alternative adiabatic implementations, while achieving a competitive operating speed.

Keywords: adiabatic, arithmetic, circuits, multiplier, low-power

1. Introduction

Unlike other low power design techniques that attemptto minimize the energy used in computation [1], the en-ergy recovery or adiabatic technique involves recyclingof that energy. The delivery and recovery of energy isperformed virtually without dissipation [2], resultingin potentially better energy efficiency than in conven-tional digital systems. Since energy consumption is notnecessary in order to perform computation [3], energyrecovery using CMOS circuits is possible.

Dissipation in adiabatic logic consists of two com-ponents: adiabatic and non-adiabatic dissipation. Theformer component can be reduced asymptotically tozero by slowing down the transfer of charge betweendigital circuitry and the power supply [2]. The lattercomponent cannot be eliminated [4] and results fromerasure of information that occurs in all conventionalarithmetic architectures [3, 5]. Since non-adiabatic dis-sipation introduces a lower bound on the overall dissi-pation, adiabatic implementation of a digital system isjustified only if this lower bound is significantly smallerthan the dissipation achievable by conventional lowpower design techniques.

Three different approaches to the problem of non-adiabatic dissipation have been proposed in previouswork. The first approach minimizes the non-adiabaticdissipation at the circuit level [6–12], by replacingconventional logic gates in a digital system by theiradiabatic counterparts. An improvement in energy ef-ficiency of 3–4 times over equivalent CMOS circuitimplementations has been reported for small adiabaticmultipliers and adders designed using this approach[13]. The second approach involves logically reversiblesystem design [14], which eliminates non-adiabaticdissipation, while introducing significant overhead incircuitry and adiabatic dissipation that is particularlypronounced in DSP building blocks [2, 15]. The thirdapproach is based on applying the energy recoverytechnique only to selected, high capacitance nodes, forwhich the non-adiabatic is negligible compared to thesavings achieved by energy recovery. Such design isbeneficial in system architectures in which switchingat a small number of heavily loaded circuit nodes dom-inates the overall dissipation [16, 17]. None of the ap-proaches described above is convenient for adiabaticimplementation of arithmetic units, which are the mainsource of dissipation in conventional DSP systems.

Page 2: 15 X 15 mac

84 Suvakovic and Salama

The work described in this paper builds on thethird approach by introducing special architectures foradiabatic implementation of parallel multipliers andmultiplier-accumulators (MACs) with a small numberof internal nodes, which facilitates energy recovery.In the proposed architectures, non-adiabatic dissipa-tion is minimized by using high fan-in gates, involvingtransistor networks with a topology of ordered binarydecision diagrams (OBDD) [18]. OBDD-style countercircuits, with as many as 15 inputs, are used as ma-jor building blocks in the implemented parallel mul-tiplier/MAC. Their feasibility and energy efficiency isexperimentally verified.

The paper is organized as follows. The sourcesof dissipation in adiabatic systems is summarizedin Section 2. Section 3 describes the circuit designof complex OBDD logic gates and adiabatic sense-amplifiers. The architecture design for adiabatic paral-lel multipliers that minimizes the number of latches andthe required complex logic networks are described inSection 4. Section 5 presents the design of a multiply-accumulate arithmetic unit built using high fan-in,OBDD-style counter gates. Conclusions are given inSection 6.

2. Energy Dissipation in Adiabatic Systems

Energy dissipation in adiabatic systems is made up of:adiabatic dissipation (Ea), power supply losses (E ps),non-adiabatic dissipation (Ena) and CMOS dissipation(Ec). The adiabatic dissipation is specified by

Ea = (R · Ca)

T· Ca · V 2

max (1)

where R is the on (triode region) resistance of the tran-sistors responsible for energy recovery, Ca is the adia-batically charged capacitance and T and Vmax are theslope (i.e. the rise or fall time) and the amplitude, re-spectively, of the power clock, serving both as clocksignal and the supply voltage [2]. The adiabatic dissi-pation can be reduced by reducing the supply voltageVmax or the adiabatic load capacitance Ca . Moreover,it can be made arbitrarily low by increasing T .

The power supply losses can be expressed as

E ps = (1 − η) · 1

2· Ca · V 2

max (2)

where η is the efficiency factor of the power supply,which increases with the power clock period T [19, 20].

Similarly to adiabatic dissipation, E ps depends on thethe supply voltage Vmax and the adiabatic load capac-itance Ca and can be reduced arbitrarily by increasingthe power clock period.

The non-adiabatic dissipation for systems usinglatches consisting of two cross coupled CMOS invert-ers is given by

Ena = Nla · Cla · V 2t (3)

where Nla is the average latch switching rate, Cla isthe total latch node capacitance and Vt is the thresholdvoltage of the PMOS transistor [4, 8]. Since Ea andE ps can be made arbitrarily low, the non-adiabatic diss-pation, caused by erasure of information in pipelinedsystems, is exposed as the dominant part of the overalldissipation. Ena depends on the latch implementation,but to a greater extent, its reduction is achievable at thearchitectural design level, as explained in Section 4.

Finally, the CMOS dissipation is given by

Ec = Nc · Cc · V 2dd (4)

where Cc and Nc are the total physical capacitance andits associated switching rate for the part of the systemconsisting of conventional CMOS gates. Since this pa-per focuses on adiabatic implementation of arithmeticunits, Ec is ignored in further discussion.

3. Design of Complex Adiabatic Logic Gates

The proposed structure of an adiabatic logic gate is il-lustrated in Fig. 1. It consists of a complex NMOS logicnetwork, two precharge transistors, a sense amplifier, alatch and two output adiabatic buffers/drivers. The de-sign of the complex NMOS logic networks, and theiroutput detection, are the key issues for the proposedcircuit technique.

The topology of the complex NMOS logic networkused as part of the adiabatic gate shown in Fig. 1, isthat of an ordered binary decision diagram (OBDD),which is known to be a more compact representationof a logic function than conventional representationsbased on product terms [18]. Figure 2(a) shows an ex-ample of an OBDD with four binary inputs, whereasFig. 2(b) shows the corresponding NMOS transistornetwork, in which each OBDD edge is replaced with aNMOS pass transistor the gate of which is controlled byan input signal. All transistors in the same row are con-trolled by either the non-inverted or inverted version ofthe same input signal, depending on the label on the

Page 3: 15 X 15 mac

Energy Efficient Adiabatic Multiplier-Accumulator Design 85

PWR1 PWR1

PWR1

senseamplifier

NMOS logicnetwork

adiabaticdrivers

PWR2

inputsignals

CMOSlatch

Vdd

adiabaticdrivers

PWR2precedingpipelinestage input

signals

nextpipelinestage

Figure 1. Adiabatic gate structure.

corresponding OBDD edge. For each combination ofthe input signals, the circuit shown in Fig. 2(b) per-forms computation by creating a low impedance pathbetween the root node and one of the output nodes,whereas the energy used to control the NMOS switchesis completely recoverable from the gate capacitances.Although the computation itself does not produce dis-sipation, the output detection requires dissipation of afinite amount of energy.

The feasibility of high fan-in logic gates in practicalimplementations is limited by their complexity, whichimpacts their speed, physical footprint, input capaci-tance and energy required for detection.

01

0

1

01

1 0

x1

x2x2

x3

x4 x4

01

1 0

NR

NE0NE1NE1 NE0

NR

x1

x2

x3

x4

x1

x2

x3

x4

(a) (b)

Figure 2. Logic function representation using OBDD: (a) OBDD and (b) equivalent NMOS logic network.

Techniques for reliable output detection for com-plex, high fan-in NMOS networks including OBDDbased networks, were reported in previous work[21, 22]. The output detection technique [23] used here,minimizes the energy required for detection by usingvoltage sensing.

The sense amplifier circuit is shown in Fig. 3(a).The key waveforms for the gate operation, includingthe power clock signals PWR1 and PWR2 for a two-phase non-overlapping adiabatic clocking scheme, areshown in Fig. 3(b).

The logic gate operates in two phases. In the sec-ond phase, nodes F and FB are precharged, while all

Page 4: 15 X 15 mac

86 Suvakovic and Salama

S SB

F FB

PWR2PWR2

OBDDnetwork

senseamplifier

SSB

SBS

PWR1PWR1PWR1

input

OUTBOUT

signasPWR1

SB

PWR1

SL

SLB

PWR2

OUTB

S

SLB

SL

PWR2

OUT

PWR1

F FB

(a)

(b)

(c)(d)

Symbol Wave

D0:A0:v(pwr2)

D0:A0:v(pwr1)

Voltage

s (lin)

0200m400m600m800m

11.21.4

Voltage

s (lin)

0200m400m600m800m

11.21.4

Time (lin) (TIME)50n 100n

*****

Symbol Wave

D0:A0:v(fb)

D0:A0:v(f)

Voltage

s (lin)

0

200m

400m

600m

800m

Time (lin) (TIME)50n 100n

*****

Symbol Wave

D0:A0:v(sb)

D0:A0:v(s)

Voltage

s (lin)

0

200m

400m

600m

800m

1

1.2

1.4

Time (lin) (TIME)50n 100n

*****

Figure 3. Circuit design for adiabatic OBDD-style gates: (a) sense amplifier, (b) typical waveforms, (c) adiabatic drivers controlled by senseamplifier and (d) adiabatic drivers controlled by latch.

transistors in the OBDD NMOS network are off. In thefirst phase of the next clock cycle, inputs of the NMOStree are energized, discharging either node F or FBto the ground and creating a small differential voltage

between nodes S and SB. Subsequently, PWR2 ener-gizes the sense amplifier and creates full swing differ-ential signals at nodes S and SB reflecting the detectedNMOS tree output.

Page 5: 15 X 15 mac

Energy Efficient Adiabatic Multiplier-Accumulator Design 87

The gate following the sense amplifier has small in-put capacitance, decoupling the sense amplifier nodesfrom the load capacitance of the following stage. Thechoice of this gate generally depends on the type oflogic in the following stage.

For relatively small capacitive loads, adiabaticdrivers powered by the same power clock phase as theadiabatic driver can be controlled directly by the senseamplifier, as shown in Fig. 3(c). However, for largerload capacitances, CMOS signal levels are required tocontrol the adiabatic driver instead of the pulse-shapedvoltage at the sense amplifier outputs. The CMOS levelsignals are obtained by the latch shown in Fig. 3(d),which is identical to the pulse-to-level converter givenin [17] with the addition of transistors M3 and M4controlled by power clock PWR1, added to eliminatethe effect of the non-zero voltage at the sense amplifiernodes S and SB during the opposite power clock phase,when PWR2 is high. The outputs of the latch are sta-ble for the duration of the PWR2 pulse, thus allowingcomplete adiabatic charging and discharging of largecapacitive loads.

3.1. Energy Efficiency of Proposed Adiabatic Gates

The addition of CMOS-type dissipation introduced bythe pulse-to-level converter is justified if it is muchsmaller than non-adiabatic dissipation due to incom-plete discharging of the load capacitance, eliminatedthis way. Based on Eqs. (3) and (4), this condition issatisfied when

Cld �(

Vdd

Vt

)2

· Cla (5)

where Cld and Cla are the load and latch capacitances,respectively. Condition (5) is easily met for low voltageoperation, which is preferred in order to achieve highenergy efficiency. Moreover, since the fraction of thetotal power supplied losses associated with the energyrecovered from the observed load capacitance Cld , canbe expressed as

E ps,ld = (1 − η) · 1

2· Cld · V 2

max (6)

the energy consumed by the CMOS latch is negligiblecompared to E ps,ld if

Cld � 2 · N

(1 − η)·(

Vdd

Vmax

)2

· Cla (7)

Assuming that the latch switching rate N is equal to0.5, that η = 0.9 and that the amplitude of the powerclock Vmax is equal to the DC supply voltage Vdd , con-dition (7) reduces to

Cld � 10 · Cla (7a)

Condition (7a) is more relaxed than (5), especiallyfor higher Vdd , however it is not likely to be satisfiedfor adiabatic implementation of typical arithmetic ar-chitectures in which, the fan-out and load capacitanceare small for the majority of the logic gates. Conse-quently, the non-adiabatic dissipation in latches woulddominate in such implementation.

It should be noted that non-adiabatic dissipation isnot eliminated by circuit techniques that allow gatepipelining without the use of latches, such as PAL [9]and SCAL [24]. The outputs of such gates are latcheddynamically, which also results in dissipation due toerasure of information. In addition, only small logicgates with 2 to 3 inputs are feasible in these circuittechniques due to their inherently poor handling of highfan-in and fan-out. Consequently, gate-pipelined arith-metic architectures involving such gates must consistof a very large number of gates with a substantial over-head in the number of delay-matching gates, causingsignificant total non-adiabatic dissipation.

For the purpose of comparison between PAL, SCALand the proposed circuit technique, an adiabatic 15-input counter and a 4-bit adder were implemented ina standard 0.25 µm CMOS process using all three cir-cuit techniques and simulated in HSPICE. The OBDD-style counter operates in a single stage and includes 4latched logic gates, whereas its PAL and SCAL coun-terparts operate in 10 stages and include 77 gates each.For the same supply voltage of 1.6 V and the rangeof operating frequencies between 1 and 50 MHz, theaverage dissipation per computation for the OBDD-style counter is 2.6–4.2 times less than that of PAL and2.5–4.4 times less than that of SCAL. The OBDD-styleadder operates in a single stage and includes 5 latchedlogic gates, whereas the PAL and SCAL adders operatein 8 pipelined stages and consist of 56 gates each. Us-ing the same supply voltage and operating frequenciesas for the counter, dissipation of the OBDD-style adderis found to be 1.1–1.5 and 1.6–3.7 times less than thatof its PAL and SCAL counterparts, respectively.

The obtained results suggest that adiabatic designbased on latched OBDDL-style gates can achieve betterenergy efficiency than other adiabatic techniques if the

Page 6: 15 X 15 mac

88 Suvakovic and Salama

utilization of complex logic gates is high, which meansthat the number of such gates and the associated latchesis small. For this reason, architectural optimizationsdescribed in Section 4 are aimed at minimizing thenumber of latches.

Finally, it should be pointed out that the effect ofcomplexity on the speed of the proposed adiabatic logicgates is less pronounced than in conventional CMOSgates [25]. This a result of low voltage swing at internalnodes of OBDD-style transistor networks and the use ofsense amplifiers at their outputs. HSPICE simulationsindicate that the maximum power clock frequenciesachievable for the implemented counter based on 15-input gates and the implemented adder based on 9-inputgates, are 250 MHz and 600 MHz, respectively for thepower clock voltage of 3.5 V, whereas for the powerclock voltage of 1 V, the maximum clock frequenciesare 72 and 230 MHz, respectively.

4. Architectural Design of Adiabatic ArithmeticUnits in General Purpose DSP Systems

In general purpose DSPs, the major source of energydissipation are multiplier-accumulator (MAC) unitsfeaturing fast parallel multipliers, with a 16 × 16 bitfixed-point multiplication, or higher complexity. MACarchitectures are typically based on small logic gatesimplementing Wallace tree [26] or Dadda [27] partialproduct reduction schemes, usually combined with aBooth encoding algorithm [28] that reduces the initialnumber of partial products.

In a straightforward approach, an adiabatic paral-lel multiplier can be designed by substituting eachlogic gate in a conventional CMOS multiplier with anequivalent adiabatic latched gate. Additional latchesare needed to accommodate gate pipelining by provid-ing matched pipeline delay paths. Since the number oflatches in a gate-pipelined implementation exceeds thenumber of combinational logic gates, if latched logicgates performing simple logic functions are used, thelatches dominate the overall multiplier area and en-ergy consumption, leading to poor utilization of theproposed circuit technique.

In order to minimize the number of latches and therelated non-adiabatic dissipation, multiplier architec-tures consisting of fewer logic gates must be sought.From that standpoint, an ideal n × n bit multiplier ar-chitecture would consist of 2n gates, i.e. of only onegate per output bit. However, the maximum fan-in forthese gates would be 2n, hence the gate complexity

would limit the feasibility of the ideal architecture torather small values of n.

In order to estimate the circuit complexity of mul-tipliers achievable by this approach, OBDDs for mul-tiplier output bits were generated for several values ofn, by a OBDD logic synthesis program. As shown inFig. 4, the total number of transistors in the OBDDnetworks approximately triples when n is increased by1. Therefore, the single stage implementation is onlyadvantageous for small values of n.

Multipliers for which n ≥ 16, as typically used inDSP processors, are not suitable for single-stage ar-chitectures due to an impractically large area and inputcapacitance. Therefore, adiabatic versions of such mul-tipliers need to be pipelined, while keeping the numberof logic gates and pipeline stages as low as possible.

A block diagram of a commonly used parallel n ×n-bit multiplier architecture that consists of three stages:the Booth stage, the partial product reduction stage andthe carry-propagate adder (CPA) is shown in Fig. 5. Itcomputes the product

P = A · B (8a)

A =n−1∑i=0

ai · 2i , (8b)

B =n−1∑i=0

bi · 2i and (8c)

P =2n−1∑i=0

pi · 2i (8d)

where ai , bi (i = 0 . . n − 1) and pi (i = 0 . . 2n − 1)are the bits in binary representations of A, B and P,

respectively.The most commonly implemented, radix-4 modi-

fied Booth algorithm [29], reduces the number of par-tial products for a n × n bit multiplication from n2 toapproximately n2/2. The second multiplier stage, typ-ically a Wallace tree architecture, reduces n2/2 binaryproduct terms to 2 output bits. The final, CPA stageperforms the two-input addition.

The pipelined multiplier architecture described inthis section minimizes the number of latches utilizingOBDD gates of manageable size. It reduces the numberof latches significantly when compared with adiabaticarchitectures based on small logic gates and almostcompletely eliminates delay-matching latches.

Page 7: 15 X 15 mac

Energy Efficient Adiabatic Multiplier-Accumulator Design 89

1 2 3 4 5 6 7 810

0

101

102

103

104

105

n

num

ber

of O

BD

D tr

ansi

stor

s pe

r m

ultip

lier

Figure 4. Number of OBDD transistors for single-stage implementation of n × n-bit multipliers.

n

an-1, ... a0

n

bn-1, ... b0

radix-4 Booth

partial product generator

compressor stage

(Wallace tree)

CPA

2n

p2n-1, ... p0

Figure 5. A typical parallel multiplier architecture.

4.1. Complex Adiabatic Gates for Radix-4 BoothPartial Product Generator

The radix-4 modified Booth partial product generatorin Fig. 5 includes the recoder and the multiplexer stages

as shown in Fig. 6(a). The inputs to the recoder stageare the multiplier bits bi . The multiplicand bits ai , alongwith the outputs of the recoder stage, are inputs to themultiplexer stage.

In multipliers built with conventional CMOS gates,each recoder and the multiplexer consists of severallogic gates. In a gate-level pipelined adiabatic imple-mentation, a latch is introduced for each one of thesegates. In addition, since recoding of the multiplier pre-cedes multiplexing, at least n latches are required todelay the multiplicand bits a0 . . an−1 to the multiplexerstage. Assuming that there are 3 gates per partial prod-uct in the multiplexer stage [30], the total number oflatches in such adiabatic Booth unit is greater than3n2/2 + n, exceeding n2 latches associated with ANDgates in a multiplier without Booth unit. An alternative,single stage Booth architecture is based on complexOBDD-style gates and involves only n2/2 latches.

The logic function performed by such Booth stagegates is equivalent to the OBDD shown in Fig. 6(b).This OBDD has 5 binary inputs b2k−1, b2k , b2k+1, a j

and a j+1. Booth recoding of inputs b2k+1, b2k , b2k−1 isperformed in the first three OBDD rows. The nodes inrow 4, from left to right, correspond with the recodedvalues “−2”, “−1”, “0”, “+1” and “+2”, respectively,

Page 8: 15 X 15 mac

90 Suvakovic and Salama

b2k-1b2k+1 b2k

Booth

“-2” “-1” “0” “1” “2”

b2k-1 b2k-1 b2k-1 b2k-1

b2k+1

b2k b2k

ai ai

ai+1 ai+1

01

1

1

0

00

0

0 1

1

1 10

11

0 0

01

10

out0out1

0 1‘0’

ai ai+1

0 1

ppi,k

PP selectorx nx n/2

recoder

(a) (b)

Figure 6. Radix-4 modified Booth partial product generator: (a) conventional circuit and (b) OBDD for single-stage implementation.

such that for any combination of b2k+1, b2k , b2k−1, thecorresponding OBDD path between the root node androw 4, ends in the node representing the Booth-recoded3-bit value b2k+1b2kb2k−1. The remaining part of thegraph below row 4 is equivalent to the multiplexer gatein Fig. 6(b). The subgraphs following nodes “−1” and“+1” in row 4 depend only on input ai , whereas the sub-graphs following nodes “−2” and “+2” depend only onai+1. There is no subgraph for node “0”, since the out-put is decided for that case. Consequently, this node isconnected directly to the OBDD ‘zero’ output “out0”.

The circuit implementation of the radix-4 BoothOBDD includes 22 transistors and features a simpletopology resulting in a compact layout.

Partial product generation for Booth algorithms ofradix-8 and higher involves recoded values such as ±3for which carry propagate addition needs to be per-formed. Its single stage implementation is impracti-cally complex since it requires the inclusion of all mul-tiplicand bits an−1 . . a0 as inputs to the Booth gate.

A compromise solution for single stage implemen-tation of radix-8 modified Booth algorithm, that avoidsthe use of a carry-propagate adder as part of the par-tial product generator, in exchange for a certain in-crease in the number of partial products is proposed

by Bewick [31]. Its single stage implementation usingOBDD-style logic gates is described below.

In order to avoid full n-bit addition in computing 3A,the addition

3A = A + 2A =n−1∑i=0

ai · 2i +n−1∑i=0

ai · 2i+1 (9a)

is broken into sums of two 4-bit inputs as follows:

3A = a0 +n/4−1∑

k=0

24k+1 ·(

3∑j=0

2 j · (a4k+ j + a4k+ j+1)

)

(9b)

Since each such sum is generally a 5-bit number, (10b)can be rewritten as

3A = a0 +n/4−1∑

k=0

24k+1 ·(

24 · c4k+5 +3∑

j=0

2 j · s4k+ j+1

)

(9c)

The lower 4 bits of k-th such sum s4k+1, s4k+2, s4k+3

and s4k+4 can be generated as functions of the 5 multi-plicand bits: a4k , a4k+1, a4k+2, a4k+3 and a4k+4 and used

Page 9: 15 X 15 mac

Energy Efficient Adiabatic Multiplier-Accumulator Design 91

b3j-1b3jb3j+1b3j+2a4ka4k+1a4k+2a4k+3a4k+4

radix-8BoothOBDDgate

ppj,4k+3

b3j-1b3jb3j+1b3j+2a4ka4k+1a4k+2a4k+3

radix-8BoothOBDDgate

ppj,4k+2

b3j-1b3jb3j+1b3j+2a4ka4k+1a4k+2

radix-8BoothOBDDgate

ppj,4k+1

b3j-1b3jb3j+1b3j+2a4ka4k+1

radix-8BoothOBDDgate -

ppj,4k

b3j-1b3jb3j+1b3j+2a4ka4k+1a4k+2a4k+3a4k+4

ppj,4k+4

partialproduct

partialproduct

partialproduct

partialproduct

additional

(c4k+5)

(a)

(b)

Figure 7. Single stage OBDD-style radix-8 Booth encoder: (a) regular partial products and (b) additional partial product.

instead of the 3A values for bit positions 4k, 4k + 1,4k + 2 and 4k + 3, respectively. As shown in Fig. 7(a),the maximum number of OBDD inputs for these 4 bitpositions is 9.

To preserve the correctness of computation, the out-put carry bit c4k+5 in (9c), which results from the 4-bitaddition in (9b) is evaluated by a separate gate andrepresents an additional partial product. It can have anon-zero value only for the recoded multiplier valuesof “+3” or “−3” . As shown in Fig. 7(b), the OBDDgate producing this additional partial product, at bit po-sition 4k + 4, also has 9 inputs. Single stage, adiabaticimplementation of the Radix-8 modified Booth algo-rithm requires n/4 of such partial products, in additionto the n2/3 regular ones. This is a negligible overheadfor 16 × 16-bit and larger multipliers, thus making theradix-8 Booth stage an attractive option for reduction ofnon-adiabatic dissipation. However, the average num-ber of 82 transistors per OBDD gate represents a signif-icant increase compared to the radix-4 Booth stage andcreates additional adiabatic dissipation that needs to becompensated by slowing down the system operation.

4.2. Adiabatic Architecture of the PartialProduct Reduction Stage

The second building block of the parallel multiplier inFig. 5 is a Wallace tree [26] or Dadda counter [27],consisting of full adders. 4-to-2 compressors are not

considered in this analysis since they inherently consistof two levels of logic and their gate-pipelined versionis equivalent to two cascaded full-adders.

Typical full adder circuits in low power DSP de-signs consist of several simpler gates, such as the oneshown in Fig. 8(a) [32]. Although only 3 gates areused in this circuit, the gate pipelined version requires5 latches. However, full adder circuit implementationwithout cascading gates is also possible.

For a full adder design that does not include cascadedlogic gates, the gate pipelined version of a Wallace treeis obtained by adding a latch at every full adder outputand by insertion of a number of delay matching latches.An example of a Wallace tree bit slice reducing thenumber of partial products from 15 to 2, is shown inFig. 8(b). It includes 13 full adders and has 7 pipelinestages. It also involves a total of 31 latches, 5 of whichare the delay matching latches. However, a Wallacetree in Fig. 8(b) using the gate-pipelined version of thefull adder shown in Fig. 8(a), involves 75 latches andoperates in 14 pipeline stages.

An alternative architecture for the Wallace treeshown in Fig. 8(b), utilizing complex counters, is givenin Fig. 8(c). It operates in 3 pipeline stages and fea-tures a 15:4 counter and two full-adders and a total of 9latches, one of which is inserted to provide delay match-ing. It reduces the number of latches by 3.44 comparedwith the circuit shown in Fig. 8(b) and by 8.33, com-pared to the circuit shown in Fig. 8(a). The 15:4 counter

Page 10: 15 X 15 mac

92 Suvakovic and Salama

c sFA

LLc s

FA

LLc s

FA

LLc s

FA

LLc s

FA

LL

c sFA

LL

c sFA

LLc s

FA

LL Lc s

FA

LL

c sFA

LL

c sFA

LL

L

c sFA

LL

L L

L

c sFA

LL

pp1 pp2 pp3 pp4 pp5 pp6 pp7 pp8 pp9 pp10 pp11pp12pp13

pp14pp15

LL LL

c sFA

LL L

15-input counterb3 b2 b1 b0

pp1 .. pp15

15

c sFA

LL

0 1

a bicii

s c

L

0 1

a bicii

s c L L

L L

(a)

(b)(c)

Figure 8. Partial product reduction schemes: (a) gate-pipelining of a conventional FA circuit, (b) gate-pipelined, 15-input Wallace tree and(c) alternative tree using complex counter.

consists of only four OBDD-style logic gates, whereasthe equivalent circuit based on full-adders [33] includes11 full-adders and at least 22 logic gates.

The topology for the counter gates is obtained fromthe topology of a generalized counter graph with mul-tiple outputs, shown in Fig. 9(a). This graph effectivelycounts logic ‘ones’ among the input bits and each node

in row i represents a possible sum of ‘ones’ in the in-put bits in1, in2 . . . ini . Since the possible sums rangefrom 0 to i , the number of nodes in the i-th row isi + 1. Therefore, each combination of the n input bitsincluding k ‘ones’ (k ≤ n), corresponds with one ofthe graph paths that starts at the root node and ends atoutput node k.

Page 11: 15 X 15 mac

Energy Efficient Adiabatic Multiplier-Accumulator Design 93

Figure 9. 7-input counter: (a) counter graph topology, (b) OBDDs encoding output bits and (c) layout.

Since counting ones is binary encoding of k, thenumber of binary-output OBDD gates in the counterunit equals the number of bits in the binary represen-tation of k.

The OBDD for the counter output bit m is obtainedfrom the graph in Fig. 9(a) by merging all its outputnodes whose label number in binary representation hasa “1” at the bit position m and by labeling new outputnode as “1”. All remaining nodes are merged into node“0”. Subsequently, Bryant’s reduction algorithm [18]is applied to the OBDD. The resulting OBDDs for the7-input, 3-output counter [34] are shown in Fig. 9(b).

Whereas the OBDD complexity for a general logicfunction, measured as the number of transistors in the

equivalent circuit, increases exponentially, the OBDDcomplexity of the counter gates increases linearly withthe number of inputs. The number of transistors versusthe number of inputs for counter output bits 0, 1, 2 and 3is plotted in Fig. 10. Counter output gates with fan-in aslarge as 15 have been implemented and experimentallyverified.

The regular and local interconnection patterns in thecounter OBDDs enable very compact layout of theircircuit implementation. Since vertices originating fromadjacent nodes in one row, end in the same node in thefollowing row, all transistors controlled by the sameinput signal (including the inverted and non-invertedversions) can be connected by abutment and laid out

Page 12: 15 X 15 mac

94 Suvakovic and Salama

0 5 10 150

20

40

60

80

100

120

140

160

number of input bits

num

ber

of O

BD

D tr

ansi

stor

s

bit 0 OBDD bit 1 OBDD bit 2 OBDD bit 3 OBDD

Figure 10. Number of OBDD transistors for counter output gates.

in one row, as shown in Fig 9(c). This way, the tran-sistor junction capacitance at the internal nodes of theOBDD network is reduced, along with the dissipationnecessary for output detection.

As explained in Section 3.2, the output detection forOBDD-style transistor networks relies on the differen-tial voltage between the OBDD network output nodes,one of which is discharged to the ground, whereas theother retains a small positive voltage. The worst casefor output detection occurs when the voltage at the lat-ter node is minimal. This occurs for the combinationof input bits that connects this node to the maximuminternal capacitance, thus causing maximum voltagedrop due to charge sharing.

The maximum internal capacitive load for all counterOBDD networks is listed in Table 1. For each countersize, the maximum worst case load capacitance relativeto the total internal capacitance, is that for the MSBgate. For all other bit positions, the variation in the in-ternal capacitive load for different input combinationsis relatively small. This is explained by the fact that inMSB OBDD networks, all transistors connected to theoutput node “1” are controlled by the non-inverted in-put signals, whereas all transistors connected to output

node “0”, are controlled by the inverted input signals,as shown in Fig. 9(b) for the OBDD corresponding withbit 2 of the (7, 3) counter. If, for example, the inputs tothe (7, 3) counter are

(in0, in1, in2, in3, in4, in5, in6, in7)

= (0, 0, 0, 0, 1, 1, 1),

a conducting path will be created between output “0”and the root node, whereas output “1” is connected to12 out of 15 internal nodes since 3 out of 4 transistorsconnected to it are turned on.

The variation in the internal load capacitance ofOBDD output nodes is much smaller for OBDD net-works at other bit positions since their output nodes areconnected to the equal number of transistors controlledby non-inverted signals and their complements. Thisway, for all input combinations, both output nodes areconnected to approximately equal numbers of internalnodes.

Output detection for the MSB OBDD counter net-works can either be achieved by precharging the outputnodes with energy sufficient to provide the necessary

Page 13: 15 X 15 mac

Energy Efficient Adiabatic Multiplier-Accumulator Design 95

Table 1. Internal capacitance of large counter OBDDs.

Maximum Variation ofOBDD Number of capacitive load load capacitancecounter internal nodes (number of nodes) (% of total)

15-input

bit 3 64 56 12.5–87.5

bit 2 80 44 45–55

bit 1 52 26 50

bit 0 28 14 50

14-input

bit 3 56 49 12.5–87.5

bit 2 72 40 44.4–55.6

bit 1 48 24 50

bit 0 26 13 50

13-input

bit 3 48 42 12.5–87.5

bit 2 64 36 43.75–56.25

bit 1 44 22 50

bit 0 24 12 50

12-input

bit 3 40 35 12.5–87.5

bit 2 56 32 42.9–57.1

bit 1 40 20 50

bit 0 22 11 50

voltage swing for the worst case input combinationor alternatively, by providing additional energy onlywhen it is needed. The latter solution is illustrated inFig. 11, in which additional precharged capacitance isconnected to an output node of a MSB OBDD network

in1in1

in2in2

in3in3

precharge

precharge

precharge

precharge

precharge

precharge

precharge

precharge

out1 out0

root

OBDDnetwork

in4

in5

in6

in7

in4

in5

in6

in7

Figure 11. Dynamic OBDD charging for MSB of 7-input counter.

for each turned-on transistor connected to that node.This way, less energy is dissipated, on the average, inoutput detection.

4.3. Adiabatic Architecturesfor Carry-Propagate Adders

The number of latches involved in the implementa-tion of the carry-propagate adder (CPA) is minimizedif the logic function for each of its output bits is imple-mented as a single complex logic gate. For the n + nbit addition, the number of output latches in such im-plementation is n + 1. The differential NMOS networkevaluating the arithmetic sum for bit position i (wherethe LSB position is labelled as 1), which minimizes thetransistor height is shown in Fig. 12(a). It consists ofi − 1 carry propagate networks and one 3-input XORnetwork. The carry propagation network in Fig. 12(a)does not have an OBDD topology, but rather takes ad-vantage of the specific nature of the carry propagationlogic function, reducing the total transistor height toi + 1 for i-th bit position [35], for the fan-in of 2i .

The transistor height of a OBDD-style network per-forming the same function is 2i , as shown in Fig. 12(b).Given that the number of stacked transistors is the limit-ing factor for implementation, larger single stage CPAsare achievable using the carry propagation network inFig. 12(a), than using the network in Fig. 12(b). How-ever, the advantage of circuit in Fig. 12(b) is in that itsintermediate nodes represent valid carry bits for bit po-sitions from 1 to i , thus allowing a very compact OBDDManchester carry adder implementation, as also shownin Fig. 12(b).

Page 14: 15 X 15 mac

96 Suvakovic and Salama

a ba

b

aba

b

c_in c_in

c_out c_out

c_in c_in

bb

aa

ss

c0c0

a1, a1,

b1, b1

carry

propagate

c1 c1

cn-2cn-2

an-1, an-1,

bn-1, bn-1

carry

propagate

cn-1cn-1

an, an,

bn, bnXOR3

sn sn

c_in c_in

bb

aa

c_out c_out

c0c0

a1, a1,

b1, b1

OBDD carrypropagate

c1 c1

cn-2cn-2

an-1, an-1,

bn-1, bn-1

OBDD carrypropagate

cn-1cn-1

XOR3

s1 s1

a1, a1,

b1, b1

XOR3

sn sn

an, an,

bn, bn

XOR3

s2 s2

a2, a2,b2, b2

OBDD-style carry propagate

network

n-bit adder

n-bit adder

carry propagate network

XOR3 network

(a)

(b)

Figure 12. Carry propagate transistor networks: (a) minimum transistor height network and (b) OBDD-style network.

Adiabatic adders of sizes exceeding the maximumnumber of stackable transistors have to be implementedin more than one pipeline stage. We propose the carry-select adder architecture as the most appropriate, sinceit allows two-stage implementation for a wide range ofadder sizes. The design of a 2-stage, 32-bit carry-selectadder is described in the following example dealing

with the design of a 32-bit carry-select adder with theminimized transistor height.

The 8 most significant bits in first stage of the 32-bitcarry-select adder architecture are shown in Fig 13(a).At this stage, the sum and the carry output bit for 8-bit groups 32–25, 24–17 and 16–9 are found for bothpossible values of the input carry signal c24, c16, c8 and

Page 15: 15 X 15 mac

Energy Efficient Adiabatic Multiplier-Accumulator Design 97

a25,..a32

b25,..b32

c24 = 0

s32_0

s31_0

s30_0

s29_0

s28_0

s27_0

s26_0

s25_0

c32_0

8-bi

t car

ry c

hain

8-bi

t add

er

7-bi

t add

er

6-bi

t add

er

5-bi

t add

er

4-bi

t add

er

3-b

a.

2-b

a. 1-ba25,..a32

b25,..b32

c24 = 1

s32_1

s31_1

s30_1

s29_1

s28_1

s27_1

s26_1

s25_1

c32_1

8-bi

t car

ry c

hain

8-bi

t add

er

7-bi

t add

er

6-bi

t add

er

5-bi

t add

er

4-bi

t add

er

3-b

a.

2-b

a. 1-b

s32_0

s32_1

c24_0

c24_1

c16_0

c16_1

c8

OBDD

(a)

(b)

Figure 13. Adiabatic carry-select adder architecture: (a) first stage—bit group 32-25 and (b) second stage—bit position 32.

c0, respectively. Assuming that the input carry signalc0 is known at the first stage, only one sum and the carryoutput are required for the group 8-1. The number oflatched gates at the first stage is therefore 63. The carrypropagate networks used are those shown in Fig. 12(a)and the maximum transistor height for a single gate is9 (for gates s32 1, s32 0, s24 1, s24 0, s16 1, s16 0and s8).

OBDD-style gates at the second stage evaluate theadder output si by selecting one of si 0 and si 1, basedon the carry signals c8, c16 0, c16 1, c24 0 and c24 1.The block diagram of one such gate, for bit position32, is shown in Fig. 13(b). The maximum number ofinputs per gate at the second stage is 7. The numberof latches at this stage is 33, hence the total number oflatches for the adder is 96.

The described architecture enables practically fea-sible 2-stage implementations of larger adders, with amoderate increase in the OBDD size. For example, a 2-stage carry select architecture for a 64-bit adder, based

on the same type of circuits and 11-bit groups involvestransistor networks whose transistor height does notexceed 12.

By comparison, the carry-lookahead architecture re-quires at least 3 stages of logic gates since it takes atleast two stages to generate the carry signal for eachbit position and one additional stage to generate thefinal result. Using the gate count for the 3-stage 32-bit adder based on enhanced multiple output dominologic (EMODL) [35], the total number of latches forthe carry lookahead architecture would be 161, which is67% more latches than needed for the proposed carry-select architecture.

Further, a gate-pipelined ripple-carry (RCA) imple-mentation of a n + n bit adder, consisting of single-stage, latched full-adder gates would have a latency ofn clock cycles and involve a catastrophic number of3/2 · n2 + 1/2 · n latches, the majority of which wouldbe the delay matching latches. In the case of the 32-bitaddition, the number of latches would amount to 1552,

Page 16: 15 X 15 mac

98 Suvakovic and Salama

thus disqualifying RCA as a candidate architecture foradiabatic implementation.

4.4. Comparison with Previous Designs

In order to assess the overall savings in non-adiabaticdissipation in parallel multipliers, achievable by theproposed architectural optimizations, the gate countsfor 16 × 16 and 32 × 32-bit Wallace/Dadda multipliersbased on small logic gates [36] were used. The numberof latches in adiabatic multipliers obtained from theseby replacing each logic gate with an equivalent latchedadiabatic gate, was estimated to be 15% higher thanthe total number of logic gates taking into account thedelay matching latches. Also, the number of latchesin the multipliers, of the same size but based on com-plex gates as described in this section, was calculated.Radix-4 Booth single stage partial product generatorwas used for the 16 × 16-bit multiplier, whereas radix-8 Booth partial product generator was used for the 32 ×32-bit multiplier. As shown in Table 2, the proposed ar-chitectural approach reduces the number of latches bya factor of 8 for the 16 × 16-bit multiplier and by afactor of 10.7 for the 32 × 32-bit multiplier.

For both multiplier sizes considered, the number oflatches in the proposed architecture is dominated bythe number of latches in the partial product reduc-tion tree. It should be noted that for the example ofsuch a tree shown in Fig. 8(c), the reduction of par-tial products from 15 to 4 involves 4 latches, whereasthe elimination of further 2 partial products involves 5more latches. This observation suggests that, for DSPalgorithms computing sums of a large number of prod-ucts, better reduction in the number of latches can beachieved by an application-specific architecture that

Table 2. Comparison between conventional and proposed multi-plier architectures in adiabatic implementation.

Number of Number ofDesign gates latches

Conventional 16 × 16-bit 2569 2920multiplier [34]

Proposed 16 × 16-bit adiabatic 340 364multiplier architecture

Conventional 32 × 32-bit 10,417 11,980multiplier [34]

Proposed 32 × 32-bit adiabatic 1026 1122multiplier architecture

does not reduce the result of each separate multipli-cation down to 2 partial products, but rather uses the(15, 4) counter as many times as possible and performs4 to 2 compression only once, to calculate the final re-sult. The asymptotic minimum number of latches forsuch an architecture compressing X partial products to4 using (15, 4) counters, is 1.45X. The use of (7, 3)counters would involve a minimum of 2.25X latches.

5. 15 × 15-Bit Adiabatic MAC: Designand Implementation

5.1. Specifications

In order to illustrate the design procedure outlinedabove, an adiabatic multiply-accumulate (MAC) unit,employing high fan-in, OBDD-based counter gates,was designed in a 0.25 µm CMOS process [23]. Thefollowing specifications were adopted for the design:

• It was assumed that the MAC is an adiabatic subsys-tem in a conventional CMOS environment and thatits inputs are driven by non-adiabatic CMOS circuits;

• MAC input word lengths were chosen to be 15 bitsfor the multiplicand and 15 bits for the multiplier inorder to take full advantage of the 15:4 compressionrate;

• The MAC was intended for applications, such asFIR filtering, where result is not a single product butrather a sum of multiple products and is required onlyonce at the end of a sequence of multiply-accumulatecomputations.

The MAC datapath architecture includes threepipeline stages, as shown in Fig. 14. Stage 1 consistsof 15 × 15 = 225 adiabatic, two-input AND/NANDgates generating partial products ppi j = ai b j for the15-bit multiplication operands A(A = a14a13 .. a0) andB (B = (b14b13 .. b0). Inputs ai and b j are assumed to benon-adiabatic, latched signals that are present at the in-puts of AND/NAND gates during the first clock phase.Since there are no latches in this stage, all circuitsare energized and de-energized through power clockPWR1 without non-adiabatic losses.

Pipeline stage 2 consists of adiabatic gates that per-form n-to-4 compression, where n ≤ 15 . The bit-sliceof stage 2 is a counter circuit with the fan-in of upto 15, producing 4-bit outputs. Each output signal iscomputed by a separate logic gate, whereas all gatesin one counter share the same inputs driven by the first

Page 17: 15 X 15 mac

Energy Efficient Adiabatic Multiplier-Accumulator Design 99

15-input NMOS trees

7-input NMOS trees3

input Aa0 .. a14

input Bb0 .. b14

adiabatic latches

adiabatic latches

PWR2

PWR1

stage 1

stage 2

stage 3

to CPA

Partial product generator

(AND/NAND gates)

15

PWR1

4

3

bjai

pwr1

y y

pwr1

ai

bj

y

y

15 15 15 15

bit 3 bit 2 bit 1 bit 0

l o g i c n e t w o r k s

s e n s e a m p l i f i e r s

a d i a b a t i c d r i v e r s

15stage2

stage 3

slice k

slice k+3

slice k+2

slice k+1

slice k

7 7 7

bit 2 bit 1

l o g i c n e t w o r k s

s e n s e a m p l i f i e r s

a d i a b a t i c d r i v e r s

4stage3

to k+2

to k+1

slice k

3

latches

bit 0

from k-1

from k-2

to CPA

in(3:0)

(a) (b)

(c) (d)

Figure 14. Implemented MAC architecture: (a) overall architecture, (b) stage 1, (c) stage 2 and (d) stage 3.

stage. Gates in different bit slices are custom sized forthe actual number of partial products generated for theparticular bit positions. All gate outputs at stage 2 arelatched and latches are powered by power clock PWR2.

Pipeline stage 3 is the accumulator stage. The accu-mulator is based on 7-input counter circuits, each onewith 3-bit outputs. The outputs are double latched, withthe first set of latches powered by PWR1 and the sec-ond, by PWR2. The second set of latches provides syn-chronization of the signals in the feedback path with the

inputs from stage 2. As illustrated in Fig. 14, all signalsconnected to a particular accumulator bit slice have thesame bit weight. The direct path inputs to the k-th bitslice are driven by circuits at bit slices k, k − 1, k − 2and k − 3 of stage 2, whereas the inputs in the feedbackpath are driven by circuits at bit slices k, k−1 and k − 2of stage 3.

The throughput of the described architecture is onemultiplication per clock cycle. The sum of products isavailable at the output of stage 3 with the latency of

Page 18: 15 X 15 mac

100 Suvakovic and Salama

Figure 15. Chip micrograph.

one and a half clock cycle and it is compressed to threesignals per bit position. If one additional multiplica-tion with zeroed inputs is performed at the end of themultiply-accumulate sequence, the output of stage 3 isfurther compressed to 2 bits and only a carry-propagateadder (CPA) is needed to obtain the final result. In ap-plications such as FIR filtering, computing not a singleproduct but rather a sum of multiple products, the finalresult is required only at the end of the computation. Insuch cases, the activity rate of the final CPA is ratherlow, making its implementation and energy efficiencyless critical at the system level.

5.2. Performance Analysis

The described MAC unit was implemented and the chipmicrograph is shown in Fig. 15. The MAC is functionalfor clock frequencies up to 66 MHz, while operatedfrom a 1 V power supply.

The total non-adiabatic dissipation per clock cycle(i.e. per multiplication) is 0.28 pJ, as shown in Table 3.and it is caused by latch/sense amplifier activity. In ad-dition to non-adiabatic dissipation, the MAC requires4.5 pJ of recoverable energy in order to perform itsoperation. 10% of that energy, or 0.45 pJ per clock cy-cle is lost due to the internal dissipation in the powersupply. Therefore, the total energy consumption per

Table 3. MAC performance analysis.

Power supply 0–1 V adiabaticMaximum clock frequency 66 MHz

Energy efficiency: per multiplication:Energy used 0.28 pJ + 4.5 pJ

Energy recovered 4.5 pJ

Non-adiabatic dissipation 0.28 pJ

Power supply dissipation 0.1 ∗ 4.5 pJ = 0.45 pJ

Total dissipation 0.28 pJ + 0.45 pJ = 0.73 pJ

Page 19: 15 X 15 mac

Energy Efficient Adiabatic Multiplier-Accumulator Design 101

Table 4. MAC performance comparison.

Adiabatic MACCMOS (conventionalMAC architecture) This design

Total dissipation 17.6 pJ 1.57 pJ 0.73 pJ

Non-adiabatic N/A 1.12 pJ 0.28 pJdissipation

Number of latches N/A 773 190

Number of pipeline 2 8 3stages

Latency 2 cycles 4 cycles 1.5 cycles

Maximum frequency 100 MHz 66 MHz 66 MHz

Number of transistors 58000 14500 10450

multiplication for the adiabatic MAC and the powersupply is 0.28 pJ + 0.45 pJ = 0.73 pJ. Compared witha conventional CMOS implementation of an equiva-lent MAC using the same process [32], the adiabaticMAC described in this paper consumes 23 times lessenergy per computation. The comparison between thetwo units is listed in Table 4.

Finally, to demonstrate the advantage of the pro-posed architecture in adiabatic implementation, the im-plemented architecture was compared with an alterna-tive adiabatic architecture in which Wallace tree com-pression is performed using full-adder (FA) gates. Thecharacteristics of the alternative architecture are alsolisted in Table 4. The FA based architecture includes773 latches, whereas the proposed architecture includesonly 190 latches, achieving the reduction in the relatednon-adiabatic dissipation by the factor of 4. The totalcharged capacitance per clock cycle is approximatelythe same for the two architectures. It is dominated bylatch capacitance in the case of FA-based architec-ture, whereas in the proposed architecture, the combi-national circuit capacitance dominates. Generally, theproposed architecture is more energy efficient than theFA based one and this advantage is more pronouncedfor higher power supply efficiency. The proposed ar-chitecture also has lower latency since it operates in 3pipeline stages, compared to 8 for the FA-based one.

6. Conclusions

Issues related to architectural design of parallel multi-pliers for adiabatic implementation have been analyzedin this paper. Non-adiabatic dissipation in latches as-sociated with adiabatically driven full-swing signals

was identified as a lower bound on the overall en-ergy consumption. It has been shown that a significantimprovement in achievable energy efficiency of adia-batic arithmetic units can be made by using complexlogic gates as building blocks for such units, rather thansmall gates that typically comprise equivalent CMOSdesigns. This way, the number of full-swing signalsand the associated latches causing non-adiabatic dissi-pation is reduced. In addition, such architectural designapproach minimizes the pipeline depth of the inherentlygate-pipelined adiabatic systems.

A circuit technique enabling logic design and outputdetection for complex logic gates has been developed.Logic design based on ordered binary decision dia-grams (OBDD) was used to achieve circuit compactionand design automation. Custom built CAD tools forOBDD-style logic synthesis and layout were develo-ped and used in the design of the prototype chip. Spe-cial attention was given to the design and analysis ofhigh fan-in counter gates featuring high computationalefficiency due to their linear complexity and regulartopology.

A low power OBDD output detection scheme in-volving sense amplifiers was developed. The proposedlogic gate design achieves operation at clock speedscomparable to those typically used in DSP systems.In addition, the proposed circuit style allows low volt-age operation, which also boosts energy efficiency. Theworst case analysis of internal capacitive load for thesense amplifier was performed for the counter gates.The highest capacitive load that represents the worstcase for output detection occurs for counter MSB gates.A conditional output precharging technique for suchOBDD networks is proposed to minimize the averageenergy required for output detection.

A parallel multiplier architecture was developed thatminimizes non-adiabatic dissipation and its advan-tage over alternative architectures was demonstrated.A multiply-accumulate (MAC) unit based on countergates with up to 15 inputs was designed and imple-mented in a 0.25 µm CMOS process. This design wasfound to be 27 times more energy efficient than anequivalent conventional design and 4 times more en-ergy efficient than an alternative adiabatic architectureconsisting of smaller gates.

Acknowledgment

The work was supported by NSERC, Micronet, Gen-num, Mitel, Nortel Networks and PMC Sierra.

Page 20: 15 X 15 mac

102 Suvakovic and Salama

References

1. A.P. Chandrakasan, S. Sheng, and R.W. Brodersen, “Low-PowerCMOS Digital Design,” IEEE Journal of Solid-State Circuits,vol. 27, 1992, pp. 473–483.

2. W. Athas, L.J. Svensson, J.G. Koller, N. Tzartzanis, and E. Chou,“Low-Power Digital Systems Based on Adiabatic SwitchingPrinciples,” IEEE Transaction on VLSI Systems, vol. 2, 1994,pp. 398–407.

3. R. Landauer, “Irreversibility and Heat Generation in the Com-puting Process,” IBM Journal of Research and Development,vol. 5, 1961, pp. 183–191.

4. J.S. Denker, “A Review of Adiabatic Computing,” in Symposiumon Low Power Electronics Proceedings, 1994, pp. 94–97.

5. C.H. Bennet, “Logical Reversibility of Computation,” IBM Jour-nal of Research and Development, vol. 6, 1973, pp. 525–532,

6. A.G. Dickinson and J.S. Denker, “Adiabatic Dynamic Logic,”IEEE Journal of Solid-State Circuits, vol. 30, 1995, pp. 311–315.

7. A. Kramer, J.S. Denker, S.C. Avery, A.G. Dickinson, and T.R.Wik, “Adiabatic Computing with the 2N-2N2D Logic Family,”in IEEE Symposium on VLSI Circuits, 1994.

8. D. Maksimovic, V.G. Oklobdzija, B. Nikolic, and K.W. Current,“Clocked CMOS Adiabatic Logic with Integrated Single-Phase Power-Clock Supply: Experimental Results,” in ISLPEDProceedings, 1997, pp. 323–327.

9. V.G. Oklobdzija, D. Maksimovic, and F. Lin, “Pass-TransistorAdiabatic Logic Using Single Power-Clock Supply,” IEEETCAS II: Analog and Digital Signal Processing, vol. 44, 1997,pp. 842–846.

10. Y. Moon and D.-K. Jeong, “An Efficient Charge Recovery LogicCircuit,” IEEE Journal of Solid- State Circuits, vol. 31, 1996,pp. 514–522.

11. K.T. Lau and F. Liu, “Improved Adiabatic Pseudo-DominoLogic,” Electronics Letters, vol. 33, 1997, pp. 1982–1983.

12. J. Lim, K. Kwon, and S.-I. Chae, “Reversible Energy RecoveryLogic Circuit Without Non-Adiabatic Energy Loss,” ElectronicsLetters, vol. 34, 1998, pp. 344–345.

13. M.C. Knapp, P.J. Kindlmann, and M.C. Papaefthymiou, “Imple-menting and Evaluating Adiabatic Arithmetic Units,” in CICCProceedings, 1996, pp. 115–118.

14. R.C. Merkle, “Reversible Electronic Logic Using Switches,”Nanotechnology, vol. 4, 1993, pp. 21–40.

15. J. Lim, D.G. Kim, and S.I. Chae, “A 16-bit Carry-LookaheadAdder Using Reversible Energy Recovery Logic for Ultra-Low-Energy Systems,” IEEE J. Solid-State Circuits, vol. 34, 1999,pp. 898–903.

16. W.C. Athas, N. Tzartzanis, L.J. Svensson, and L. Peterson, “ALow-Power Microprocessor Based on Resonant Energy,” IEEEJournal of Solid- State Circuits, vol. 32, 1997, pp. 1693–1701.

17. W.C. Athas, N. Tzartzanis, W. Mao, R. Lal, K. Chong, L.Peterson, and M. Bolotski, “Clock-Powered CMOS VLSIGraphics Processor for Embedded Display Controller Applica-tion,” in ISSCC Proceedings, 2000, pp. 296–297.

18. R.E. Bryant, “Graph-Based Algorithms for Boolean FunctionManipulation,” IEEE Transactions on Computers, vol. C-35,1986, pp. 677–691.

19. D. Maksimovic and V.G. Oklobdzija, “Integrated Power ClockGenerators for Low-Energy Logic,” in IEEE Power ElectronicSpecialists Conference Proceedings, 1995, pp. 61–67.

20. W. Athas, L. Svensson, and N. Tzartzanis, “A Resonant Sig-nal Driver for Two-Phase, Almost Non-overlapping Clocks,” inISCAS Proceedings, 1996, pp. 129–132.

21. P. Zhou, J.C. Czilli, G.A. Jullien, and W.C. Miller, “CurrentInput TSPC Latch for High Speed, Complex Switching Trees,”in ISCAS Proceedings, 1994, pp. 335–338.

22. G.A. Jullien, W.C. Miller, R. Grondin, L. Del Pup, S.S. Bizzan,and D. Zhang, “Dynamic Computational Blocks for Bit-LevelSystolic Array,” IEEE Journal of Solid-State Circuits, vol. 29,1994, pp. 14–22.

23. D. Suvakovic and C.A.T. Salama, “A Pipelined Multiply-Accumulate Unit Design for Energy Recovery DSP Systems,”in ISCAS Proceedings, 2000.

24. S. Kim and M.C. Papaefthymiou, “True Single-Phase AdiabaticCircuitry,” IEEE Transactions on Very Large Scale Integration(VLSI) Systems, vol. 9, 2001, pp. 52–63.

25. K.W. Martin, Digital Integrated Circuit Design, New York:Oxford University Press, 2000.

26. C.S. Wallace, “A Suggestion for a Fast Multiplier,” IEEE Trans-actions on Computers, vol. EC13, 1964, pp. 14–17.

27. L. Dadda, “Some Schemes for Parallel Multipliers,” Alta Fre-quenza, vol. 34, 1965, pp. 349–356.

28. A.D. Booth, “A Signed Binary Multiplication Technique,” Quar-terly J. Mechan. Appl. Math., vol. IV, 1951.

29. O.L. MacSorley, “High-Speed Arithmetic in Binary Computa-tions,” IRE Proc., vol. 49, 1961, pp. 67–91.

30. D. Villeger and V.G. Oklobdzija, “Analysis of Booth EncodingEfficiency in Parallel Multipliers Using Compressors for Re-duction of Partial Products,” in The Twenty-Seventh AsilomarConference on Signals, Systems and Computers, 1993, pp. 781–784.

31. G. Bewick, “Fast Multiplication: Algorithms and Implementa-tion,” Ph.D. Thesis, Stanford University, 1994.

32. M. Izumikava et al., “A 0.25-µm CMOS 0.9-V 100-MHz DSPCore,” IEEE J. Solid-State Circuits, vol. 32, 1997, pp. 52–61.

33. E.E. Swartzlander, Jr., “Parallel Counters,” IEEE Transactionson Computers, vol. c-22, 1973, pp. 1021–1024.

34. P.J. Song and G. De Micheli, “Circuit and ArchitectureTrade-offs for High-Speed Multiplication,” IEEE J. Solid-StateCircuits, vol. 26, 1991, pp. 1184–1198.

35. Z. Wang, G.A. Jullien, W.C. Miller, J. Wang, and S.S. Bizzan,“Fast Adders Using Enhanced Multiple-Output Domino Logic,”IEEE J. Solid-State Circuits, vol. 32, 1997, pp. 206–214.

36. T.K. Callaway and E.E. Swartzlander, “Optimizing ArithmeticElements for Signal Processing,” in Workshop on VLSI SignalProcessing Proceedings, 1992, pp. 91–100.

Dusan Suvakovic received his B.S., M.S and M.A.Sc. degrees inElectrical Engineering from the University of Novi Sad, Yugoslavia

Page 21: 15 X 15 mac

Energy Efficient Adiabatic Multiplier-Accumulator Design 103

in 1988, University of Belgrade, Yugoslavia in 1992 and Universityof Toronto in 1998, respectively. He is currently working towardsthe completion of his Ph.D. thesis at the University of Toronto. Hisresearch interests are in the area of low energy DSP design as wellas low-power, high-speed digital circuits. From 1988 to 1995, hewas a research associate at M. Pupin Institute, Belgrade and a de-sign engineer at Perle Systems, Markham Ontario and Mark IV In-dustries, Mississauga Ontario. In December 2001, he joined BellLaboratories—Lucent Technologies, Murray Hill NJ as a memberof technical staff.

C. Andre T. Salama received the B.A.Sc. (Hons.) M.A.Sc. and Ph.D.degrees, all in Electrical Engineering, from the University of BritishColumbia in 1961, 1962 and 1966 respectively.

From 1962 to 1963 he served as a Research Assistant at the Uni-versity of California, Berkeley. From 1966 to 1967 he was employedat Bell Northern Research, Ottawa, as a Member of Scientific Staffworking in the area of integrated circuit design. Since 1967 he hasbeen on the staff of the Department of Electrical and Computer En-gineering, University of Toronto where he held the J.M. Ham Chairin Microelectronics from 1987 to 1997. In 1992, he was appointed

to his present position of University Professor for scholarly achieve-ments and preeminence in the field of microelectronics. In 1989–90,he was awarded the ITAC/NSERC Research Fellowship in informa-tion technology. In 1994, he was awarded the Canada Council I.W.Killam Memorial Prize in Engineering for outstanding career con-tributions to the field of microelectronics. In 2000, he received theIEEE Millenium Medal.

He was associate editor of the IEEE Transactions on Circuits andSystems in 1986–88 and a member of the International ElectronDevices Meeting (IEDM) Technical Program Committee in 1980–82, 1987–89 and 1996–98. He was the chair of the Solid State DevicesSubcommittee for IEDM in 1998 and is a member of the editorialboard of Solid State Electronics, the Analog IC and Signal ProcessingJournal and the Technical Program Committee of the InternationalSymposium on Power Semiconductor Devices and ICs (ISPSD). Hechaired the technical program committee of ISPSD in 1996 and wasthe general chair for the conference in 1999.

Dr. Salama is the Scientific Director of Micronet, a network of cen-tres of excellence focussing on microelectronics research and fundedby the Canadian Government. He is also a principal investigator forCommunications and Information Technology Ontario, a centre ofexcellence funded by the Province of Ontario.

He has published extensively in technical journals, is the holder ofeleven patents and has served as a consultant to the semiconductorindustry in Canada and the U.S. His research interests include thedesign and fabrication of semiconductor devices and integrated cir-cuits with emphasis on deep submicron devices as well as circuits andsystems for high speed, low power signal processing applications.

Dr. Salama is a Fellow of the Institute of Electrical and Electron-ics Engineers, a Fellow of the Royal Society of Canada, a memberof the Association of Professional Engineers of Ontario, the Elec-trochemical Society and the Innovation Management Association ofCanada.


Recommended