Designing Dynamic Carry Skip Adders: Analysis and...

Circuits Syst Signal Process (2014) 33:1019–1034DOI 10.1007/s00034-013-9688-y

Designing Dynamic Carry Skip Adders:Analysis and Comparison

Raffaele De Rose · Marco Lanuzza ·Fabio Frustaci · Sohan Purohit

Received: 30 November 2012 / Revised: 26 September 2013 / Published online: 31 October 2013© Springer Science+Business Media New York 2013

Abstract Addition represents an important operation that significantly impacts theperformance of almost every data processing system. Due to their importance andpopularity, addition algorithms and their corresponding circuit implementations haveconsistently received attention in research circles, over the years. One of the mostpopular implementations for long adders is the carry skip adder. In this paper, wepresent the design space exploration for a variety of carry skip adder implementa-tions. More specifically, the paper focuses on the implementation of these adders us-ing traditional as well as novel dynamic circuit design styles. 8–16–32–64-bit adderswere implemented using traditional domino, footless domino, and data driven dy-namic logic (D3L) in ST Microelectronics 45 nm 1 V CMOS process. In order tofurther exploit the advantages of the domino and D3L approaches, a new hybridmethodology combining both strategies was implemented and presented in this work.The adders were analyzed for energy-delay trade-offs at different process corners.They were also examined for their sensitivity to process and supply voltage varia-tions. Comparative simulation results reveal that the full D3L adder ensures a betterenergy-delay product over all process corners (down to 34 % and 25 % lower thanthe domino and hybrid implementations, respectively, at the typical corner), while

R. De Rose · M. Lanuzza (B) · F. FrustaciDepartment of Informatics, Modeling, Electronics and System Engineering, University of Calabria,Via P. Bucci 42C, 87036 Arcavacata Di Rende, CS, Italye-mail: [email protected]

R. De Rosee-mail: [email protected]

F. Frustacie-mail: [email protected]

S. PurohitIntel Corporation, Austin, TX 78746, USAe-mail: [email protected]

mailto:[email protected]




1020 Circuits Syst Signal Process (2014) 33:1019–1034

showing at the same time similar performance in terms of process and supply voltagevariability as compared to the other considered carry skip adder configurations.

Keywords Carry skip adders · Dynamic circuits

1 Introduction

Addition forms an essential operation in any digital system and can significantly im-pact the performances of the overall system [20]. Addition also forms the backboneof other arithmetic circuits such as multipliers, dividers, comparators, etc. Therefore,high-speed adders can be considered core elements of modern digital signal proces-sors (DSPs) and multimedia processors and, consequently, they critically influencethe performance-power profiles of these systems.

Since the carry propagation is a major speed limiting factor, the design of fastcarry chains has always garnered great interest from researchers working on high-performance arithmetic circuits and systems [8, 22]. Among several possible additionschemes, Manchester Carry Chain (MCC)—based circuits have become very populardue to their simplicity and efficiency [22]. In the last few years, several efficientMCC-based addition circuits have been proposed in literature [2, 5, 12]. In particular,the one introduced in [2] allows low-complexity, area-efficient, and high-performanceadder implementations.

While speed is crucial, the advent of mobile computing has put a lot of emphasison reducing the power consumption of digital systems. DSP and multimedia process-ing systems form part of these mobile computing solutions and, therefore, they aresubjected to stringent power constraints. Consequently, arithmetic processing sys-tems and, hence, addition circuits need to tailor down their power consumption. Asdemonstrated in [24], the chosen logic design style, together with the adopted tran-sistor sizing criterion, can significantly affect the energy dissipation. In order to de-sign high-speed adders, the dynamic domino logic is usually exploited [22]. As wellknown, the correct functionality of a dynamic domino circuit depends on the appro-priate design of the system clock distribution tree. However, due to its high switchingactivity, the clock distribution tree contributes significantly to the total power budgetof the system, sometimes accounting for almost up to 40 % of the total system power[10]. In order to limit the power consumption, while still retaining the speed advan-tage of traditional domino circuits, the Data Driven Dynamic Logic (D3L) was pro-posed [18]. This logic design style allows circuits to operate in the precharge-evaluatefashion like conventional dynamic circuits, without the need for a clock signal to gen-erate the precharge and evaluation sequence. As matter of fact, these circuits makeuse of input signal vectors to generate the precharge and evaluation patterns. There-fore, D3L circuits retain speed advantages shown by conventional dynamic circuits,while avoiding the extra power consumption associated with the clock tree. However,in dynamic circuits with long pre-charge propagation paths, the energy advantagesof D3L are typically obtained at the expense of a non-negligible penalty in terms ofspeed performances [6, 7, 14, 16].

In this work, an extensive analysis of the impact of different design styles on thedesign of a fast carry skip adder is presented. This adder topology has been chosen

Circuits Syst Signal Process (2014) 33:1019–1034 1021

as a case study since it is one of the most popular addition implementation strategiesin applications where balancing between speed and energy consumption is criticallyrequired [13]. Compared to the faster Carry-Look-Ahead (CLA) approach, the carryskip adder approach has been shown to achieve competitive performance with con-siderably lower energy dissipation [13]. In this work, detailed evaluations of fourdifferent transistor level designs for n-bit (where n = 8, 16, 32, 64) carry skip adderare reported. All circuits exploit the carry-skip chain (CSC) proposed in [2] to speedup the carry propagation and, consequently, improve the overall adder performance.The four designs were implemented using standard domino, footless domino, D3L,and dynamic hybrid (standard domino + D3L) logic design styles, respectively, andthey were laid-out exploiting the STMicroelectronics 45 nm 1 V CMOS technology.In particular, in this paper, we expand the work reported in [4] by post-layout charac-terizing the considered adder structures. As additional analysis, post-layout compar-ative characterizations were performed considering different process corners and theeffects of random process variability and power supply variations.

The rest of the paper is organized as follows. Section 2 presents an overview ofthe carry-skip adder architecture considered for this study. Section 3 presents thetransistor level designs of the four adders, followed by detailed post-layout simulationresults and analysis in Sect. 4. Finally, the main results of the work are summarizedin Sect. 5.

2 The Carry-Skip Adder Architecture

As illustrated in Fig. 1, the generic n-bit carry skip adder uses four basic logic blocks:the carry Propagate Block (PB), the carry Generate Block (GB), the carry propagationblock (which contains the Skip Logic) and the Sum Block (SB). The PBs and the GBscalculate the ith carry propagate (Pi = Ai + Bi ) and carry generate (Gi = AiBi )signals, respectively. The carry propagation block is based on the MCC circuit whichuses the ith carry propagate and generate signals to generate the output carry bits(Ci+1 = Gi + PiCi ). Finally, the SBs produce the final sum bits (Si = PiXORCi ).

The use of MCCs allows very simple and efficient carry propagation. As depictedin Fig. 1, the number of carry signals produced by a single MCC block is typically

Fig. 1 n-Bit carry skip adder using cascaded 4-bit blocks


Fig. 2 4-Bit Manchester carrychain in standard domino

limited to four [17]. The motivation behind this can be explained with reference toFig. 2, which shows the basic architecture of a 4-bit MCC dynamic standard dominoimplementation. As shown in Fig. 2, by limiting the size of the smallest basic blockto 4 bits, the maximum height of the NMOS transistor pull-down stack is reducedto six transistors, thereby limiting the body-effect induced rise in transistor thresh-old voltages [17]. This approach also limits the propagation delay through the passtransistors, which is a quadratic function of the number of the bits in the block [22].The critical path delay of a single 4-bit MCC block also includes the delay of the in-termediate buffers inserted between two consecutive MCC blocks. Furthermore, thefirst stage of the MCC is redundant since it is only used to generate the complementof the input carry signal. The improved MCC circuits proposed in [5] and in [12]speed up the carry propagation, but they still exhibit the area and delay overhead dueto the intermediate buffers and redundant input stages of the basic 4-bit MCC block.The solution proposed in [2], called carry-skip chain (CSC), eliminates the redundantstages and intermediate buffers, thus resulting in an area-efficient high-performanceMCC circuit implementation. The CSC also incorporates efficient carry-skip speed-up logic [2]. In this work, the CSC approach has been adopted to design fast n-bitcarry-skip adders.

3 Transistor Level Carry Skip Adder Designs

For this study, four different transistor level carry skip adder designs were imple-mented by exploiting standard domino, footless domino, D3L and dynamic hybridlogic design styles. This section details the different transistor level implementations.

3.1 Standard Domino Carry Skip Adder

Figure 3(a–b) shows the standard domino implementations for the PB and the GB ba-sic sub-circuits, respectively. In both these circuits, the evaluating NMOS transistorswere equally sized in order to make the corresponding pull-down network equiva-lent to a single 0.12 µm wide NMOS device. Since all stages in a domino circuitprecharge simultaneously, and due to the presence of a single device in the prechargepath, PMOS devices with a channel width of 0.16 µm were used in the prechargepull-up network. This approach allows the capacitive loading presented to the clocktree to be reduced, thus lowering the power consumption of the clock distributionnetwork. As shown in Fig. 3(c), the SBs used for computing the final sum bits were


Fig. 3 (a) 2-Input standarddomino XOR; (b) 2-inputstandard domino AND;(c) 2-input static XOR

Fig. 4 Implementation of (a) basic 4-bit carry-skip chain (CSC) block and (b) carry-skip signal generatorusing standard domino logic (the highlighted NMOS transistors are within the critical evaluation path)

implemented using static 2-input XOR gates. In the XOR gate, pull-down and pull-up networks (PDNs and PUNs, respectively) were both sized to be equivalent to thecorresponding pull-down and pull-up devices of the output inverter.

The 4-bit CSC-based MCC, implemented by using standard CMOS dynamicdomino logic, is shown in Fig. 4(a). The output node exploits two carry-skip pull-down transistors controlled by the skip signal and the input carry, respectively. Asillustrated in Fig. 4(b), the skip signal (sk) is generated by the logical AND of allcarry propagate signals in the block (i.e., sk = PiPi+1Pi+2Pi+3). Note that the carry-skip pull-down not only speeds up the generation of the final carry, but it also restoresthe signal strength at this node. This eliminates the need for intermediate buffers be-tween CSC blocks. Moreover, it is worth noting that NMOS transistors in PDNs ofboth circuits were sized by using a progressive transistor sizing approach with a ta-pering factor of 1.5 [17].

3.2 Footless Domino Carry Skip Adder

As shown in Fig. 5(a), the speed of the standard domino carry skip adder can beimproved by implementing the 4-bit CSC block with the footless domino logic ap-


Fig. 5 Implementation of (a) basic 4-bit carry-skip chain (CSC) block and (b) carry-skip signal generatorusing footless domino logic

proach. Indeed, in this case, only a single NMOS pull-down transistor is used at eachnode of the circuit. Similarly, the AND gate used to calculate the skip signal can berealized in footless domino logic (Fig. 5b). Note that NMOS and PMOS transistorswere sized as in standard domino design.

In terms of transistor count, a single footless domino 4-bit CSC block saves fivetransistors per each 4-bit CSC block as compared to the standard domino design.This is a significant reduction in terms of transistor count of the carry chain, whichresults in a more speed-energy-area-efficient implementation. More importantly, theuse of footless domino logic in the CSC blocks allows reducing the system clock loadcapacitance, leading to lower dynamic power consumption of the clock distributionsystem.

Despite its speed advantages, footless domino logic has a severe drawback: re-moving the footed transistor may result in high static power dissipation due to short-circuit paths during the precharge phase [9]. In order to avoid this effect, a delayelement (e.g. a transmission gate) should be inserted in the system clock distributiontree with the aim of delaying the clock signal between two cascaded blocks [9].

3.3 Carry Skip Adder Implementation Using Data Driven Dynamic Logic

The D3L design methodology allows designers to minimize or, even, eliminate theclock distribution network required by conventional dynamic circuits, thus leading tosignificantly lower energy consumption [18]. In fact, instead of the traditional clockedprecharge, D3L makes use of a combination of input signals to achieve the alternateprecharge and evaluation phases [18]. This helps to retain the speed advantage tradi-tionally associated to dynamic circuits, without the extra cost of clock-related powerconsumption and clock tree design.

Figure 6(a–b) illustrates the D3L implementation of the generic PB and GB sub-circuits, where the clocked pre-charging transistors are replaced by PMOS prechargetransistors driven by the input signals. Note that the clock signal used in domino dy-namic logic to coordinate the gate operations is eliminated at the expense of higher


Fig. 6 Implementation of(a) 2-input XOR gate and(b) 2-input AND gate using D3L

Fig. 7 Implementation of (a) basic 4-bit carry-skip chain (CSC) block and (b) carry-skip signal generatorusing D3L

capacitance of the input lines. The transistor level schematic of the 4-bit D3L CSCblock is shown in Fig. 7(a–b). In this case, the precharge phase is driven by propa-gate and generate signals. It is worth noting that the removal of the clocked NMOStransistor in each pull-down path reduces the evaluation path delay of the genericblock, as in the footless domino version. One impact of the D3L approach from atiming perspective is that the precharge phase is no longer simultaneous for all theblocks. Consequently, there exists a precharge propagation path through cascadedblocks. From a design perspective, this implies that all the precharge networks inD3L circuits have to be properly sized in order to avoid that the precharge propaga-tion becomes critical for the system [15]. However, in the designed D3L carry skipadder, the precharge path is much shorter than the evaluation path, thus allowing allthe PMOS precharge transistors to be sized with a channel width of 0.16 µm withoutincurring any additional delay penalties.


Fig. 8 (a) Simulation setup,(b) clock distribution tree

This reduces the loading on the intermediate signals used for achieving prechargeand thus reduces the overall power consumption of the circuit. As for the previousimplementations, NMOS evaluating transistors were sized with a progressive sizingmethodology. In terms of transistor count, a single 4-bit D3L CSC block exploits fourmore transistors than the 4-bit footless domino CSC block.

3.4 Carry Skip Adder Implementation Using Hybrid Dynamic Design

In this design, a hybrid approach comprising a combination of standard domino andD3L circuits was adopted. In fact, the circuits generating both propagate and gener-ate signals were designed in standard domino logic (Fig. 3(a–b)), whereas the CSCblocks were implemented using D3L design style (Fig. 7(a–b)). In this way, the inputcapacitance of the circuit is reduced with respect to the full D3L implementation,while still retaining power advantages of the D3L carry propagation chain.

4 Results

The above discussed designs were laid-out exploiting the commercial ST Microelec-tronics 45 nm 1 V CMOS technology. Figure 8(a) shows the simulation setup used inthis work to evaluate the compared circuits. Input buffers were placed between idealvoltage sources and data/clock inputs in order to provide realistic input signals. Thecharacterization phase was performed by loading each output signal with a 0.8 fFcapacitance, which corresponds to the input capacitance of a D-type Flip-Flop in thereferred technology.

In order to correctly distribute the clock signal to the dynamic gates used in thecircuits, a two-level clock buffer tree, depicted in Fig. 8(b) (where CGB

L , CPGL and

CCSL are the clock load capacitances due to the clocked transistors within the PB, the

GB and the CSC sub-circuits, respectively), was designed. The logical effort method[21] was used for sizing the inverter chains in the clock buffer network. In particular,Eq. (1) was applied, where CL represents the clock load capacitances of the criticalinverter chain and Cgl denotes the gate capacitance of the inverter at the lth stage ofthe clock buffer, with l = 1, . . . ,4. We have

Cg2

Cg1= 3Cg3

Cg2= Cg4

Cg3= CL

Cg4(1)


Table 1 Delay comparison of various adder implementations

Implementation 8 bit 16 bit

tbuff-data[ps]

tin-sum[ps]

tin-cout[ps]

tpre[ps]

tbuff-data[ps]

tin-sum[ps]

tin-cout[ps]

tpre[ps]

Standard domino 14.9 278.6 138.8 85.3 14.9 377.5 249.1 89.7

Footless domino 14.9 273.4 138 101.9 14.9 371.8 249.4 106.4

Hybrid 14.9 278.7 141.3 147 14.9 376.4 253.1 150.1

D3L 15.6 278.7 140.8 116.3 15.6 377.1 252.2 116.3


tbuff-data[ps]

tin-sum[ps]

tin-cout[ps]

tpre[ps]

tbuff-data[ps]

tin-sum[ps]

tin-cout[ps]

tpre[ps]

Standard domino 14.9 599.7 472.7 95.2 14.9 1038.3 912.8 101.9

Footless domino 14.9 595.8 473.1 111.3 14.9 1037.1 915.9 118

Hybrid 14.9 599.2 476.4 154 14.9 1039.9 917.8 158.8

D3L 15.6 601.8 477.5 116.3 15.6 1046.2 922.9 116.3

It is worth mentioning that the standard domino circuit has required the most com-plex clock distribution tree, whereas the CCSC

L is reduced for the footless domino cir-cuit and completely eliminated for the hybrid adder. As previously mentioned, thefull D3L circuit does not require a clock distribution tree.

All the four designs were analyzed for delay, energy and energy-delay product(EDP). The analysis was repeated for adder widths of 8-16-32-64 bits in order toinvestigate the performance dependence of various design styles on adder widths. TheCadence Spectre simulator was employed to evaluate the speed performance, whereasthe average energy dissipation was measured using Synopsis Nanosim. Comparativepost-layout delay and energy results, obtained for the typical NMOS typical PMOSprocess corner at 27 °C, are reported in Tables 1 and 2, respectively.

Table 1 clearly shows that the D3L circuit implementation exhibits a slightlyslower evaluation due to the increased loading capacitance on intermediate data sig-nals in the PB, GB and CSC sub-circuits. Additionally it shows a higher prechargedelay than standard and footless domino implementations. Furthermore, due to thehigher capacitances on the input lines, the D3L circuit also shows approximately4.7 % higher data input buffer delay compared to other carry skip adders. This indi-cates that the D3L circuit effectively presents a larger load to the circuits driving it.In order to correctly evaluate the energy consumption of the different implementa-tions, the energy dissipation of the data input buffers (Ebuff-data), the clock distribu-tion network (Eclk) and the carry skip adder (ECSA) have been separately measured.Comparative energy results, shown in Table 2, demonstrate that, although the D3Limplementation exhibits a slightly higher data input buffer energy consumption dueto the higher capacitances on its input lines, it always shows the lowest total energydissipation owing to the complete removal of the clock distribution network. Thisconfirms the full D3L approach as the choice implementation strategy when design-ing high-speed circuits for energy-constrained environments.


Table 2 Energy comparison of various adder implementations


Ebuff-data[fJ]

Eclk [fJ] ECSA[fJ]

ETOT[fJ]

Ebuff-data[fJ]

Eclk[fJ]

ECSA[fJ]

ETOT[fJ]

Standard domino 38.6 58.9 113.2 210.6 77.1 98.0 227.6 402.6

Footless domino 38.6 43.4 110.4 192.3 77.1 72.6 223.1 372.8

Hybrid 38.6 33.6 113.8 185.9 77.1 56.1 228.6 361.7

D3L 38.7 – 100.4 139.0 77.4 – 203.6 281.0


Ebuff-data[fJ]

Eclk[fJ]

ECSA[fJ]

ETOT[fJ]

Ebuff-data[fJ]

Eclk[fJ]

ECSA[fJ]

ETOT[fJ]

Standard domino 154.2 173.1 462.9 790.2 308.4 322.7 957.7 1588.8

Footless domino 154.2 126.9 454.5 735.6 308.4 234.5 942.4 1485.3

Hybrid 154.2 98.5 462.8 715.5 308.4 181.4 959.2 1449.0

D3L 154.7 – 416.2 570.9 309.4 – 860.4 1169.8

Fig. 9 Energy-delay-productcurves for the different adderimplementations

The Energy-Delay Product (EDP) value, calculated as the product of the worstcase DATA-INPUT → SUM-OUTPUT delay with the total dissipated energy, givesa quantitative measure of the speed-energy trade-off and, hence, represents a partic-ularly useful quality metric when designing circuits that balance the high-speed-low-power domain. The EDP values for the different circuit implementations are summa-rized and plotted in Fig. 9.

Due to its better energy results, the D3L adder always exhibits the lowest EDPvalues, thus achieving the best speed-energy trade-off.

4.1 Corner Analysis

Post-layout corner simulation results are summarized in Table 3. The TT corner in-volves typical NMOS and PMOS transistors. The FF corner is related to fast NMOS


Table 3 Corner simulation results for adder implementations for varying no. of bits

Implementation 8 bit

TT FF SS

Delay[ps]

Energy[fJ]

EDP[e-23]

Delay[ps]

Energy[fJ]

EDP[e-23]

Delay[ps]

Energy[fJ]

EDP[e-23]

Standard domino 278.6 210.6 5.87 246.4 292.9 7.22 388.0 187.3 7.27

Footless domino 273.4 192.3 5.26 237.0 271.6 6.44 373.9 167.5 6.26

Hybrid 278.7 185.9 5.18 241.9 264.2 6.39 381.8 162.8 6.22

D3L 278.7 139.0 3.87 237.3 192.2 4.56 373.3 122.7 4.58


TT FF SS

Delay[ps]

Energy[fJ]

EDP[e-23]

Delay[ps]

Energy[fJ]

EDP[e-23]

Delay[ps]

Energy[fJ]

EDP[e-23]

Standard domino 377.5 402.6 15.20 322.1 568.5 18.31 507.7 355.7 18.06

Footless domino 371.8 372.8 13.86 315.5 533.1 16.82 498.9 321.9 16.06

Hybrid 376.4 361.7 13.62 318.8 520.4 16.59 505.8 314.6 15.91

D3L 377.1 281.0 10.60 315.6 390.4 12.32 498.7 247.0 12.32


TT FF SS

Delay[ps]

Energy[fJ]

EDP[e-23]

Delay[ps]

Energy[fJ]

EDP[e-23]

Delay[ps]

Energy[fJ]

EDP[e-23]

Standard domino 599.7 790.2 47.39 497.6 1121.8 55.81 789.9 693.6 54.79

Footless domino 595.8 735.6 43.83 490.4 1058.4 51.90 780.4 632.7 49.38

Hybrid 599.2 715.5 42.87 494.4 1037.5 51.30 788.4 621.6 49.00

D3L 601.8 570.9 34.35 493.1 792.8 39.09 784.3 500.4 39.25


TT FF SS

Delay[ps]

Energy[fJ]

EDP[e-23]

Delay[ps]

Energy[fJ]

EDP[e-23]

Delay[ps]

Energy[fJ]

EDP[e-23]

Standard domino 1038.3 1588.8 164.97 842.4 2267.0 190.97 1339.8 1393.6 186.71

Footless domino 1037.1 1485.3 154.04 835.1 2143.2 178.97 1334.1 1277.3 170.40

Hybrid 1039.9 1449.0 150.69 842.7 2102.8 177.21 1349.0 1257.4 169.63

D3L 1046.2 1169.8 122.39 843.0 1622.5 136.77 1348.2 1027.2 138.48

and PMOS transistors, whereas the SS corner considers slow NMOS and PMOS de-vices.

The obtained results show that, depending on the process corners, the D3L im-plementation achieves significantly lower energy dissipation compared to the otherimplementations. Moreover, the full D3L implementation provides 34 %, 30 %,28 % and 26 % improvement in EDP, respectively, over the 8–16–32–64-bit stan-


Table 4 Comparison of delay variability of different adder implementations for varying no. of bits

8 bit 16 bit 32 bit 64 bit

μ

[ps]σ

[ps]σ/μ

[%]μ

[ps]σ

[ps]σ/μ

[%]μ

[ps]σ

[ps]σ/μ

[%]μ

[ps]σ

[ps]σ/μ

[%]

Standard domino 278.6 25.0 9.0 377.5 34.5 9.1 599.7 51.4 8.6 1038.3 90.1 8.7

Footless domino 273.4 26.7 9.8 371.8 37.2 10.0 595.8 53.0 8.9 1037.1 90.3 8.7

Hybrid 278.7 26.3 9.4 376.4 37.6 10.0 599.2 56.2 9.4 1039.9 96.4 9.3

D3L 278.7 27.0 9.7 377.1 35.7 9.5 601.8 57.4 9.5 1046.2 94.1 9.0

dard domino implementation in the typical corner. It is easy to observe that the D3Lcircuit always offers the lowest energy-delay product values over all process corners.

4.2 Delay Variability Analysis

The effects of random process variations on delay variability of all the considered cir-cuit implementations were evaluated through 1000 sample Monte Carlo simulationsperformed in cadence environment. The results of the delay variability analysis areshown in Table 4, which reports the mean (μ), standard deviation (σ ) and the relativevariation (σ/μ) of the delay. The standard domino circuit always presents the lowestdelay variability (i.e., the lowest σ and σ/μ values) for all the evaluated adder sizes.The footless domino circuit exhibits the highest delay variability for adder sizes of 8and 16 bits, while for the cases of 32 and 64 bits, the D3L circuit and the hybrid cir-cuit show the highest σ/μ values, respectively. Overall, it can be observed that all theadder implementations show similar delay variability (around 8–10 %) in the pres-ence of mismatch variations. Therefore, the delay variability seems to be relativelyindependent of the different circuit implementations of the carry skip adder.

4.3 Power Supply Variability Analysis

As well known, varying the value of VDD is an effective strategy to enhance thecircuit performances (VDD increasing) or, alternatively, to reduce power dissipation(VDD decreasing). However, contrary to a deliberate VDD modification, the powersupply in a digital circuit can also experience an unwanted variation ΔVDD from itsnominal value due to noise-related effects [11, 23]. Thus, differently from a designedVDD modification, which has a pre-determined and desired impact on circuit perfor-mance, an unwanted VDD fluctuation can easily cause a random variation of circuitperformance. In practical designs, the variation ΔVDD can be kept down by a propersizing of the supply distribution rails and by the use of decoupling capacitors, whichis primarily a design effort on the full-chip level. However, as a general design prac-tice, individual VLSI circuits are usually designed to tolerate a 5 %–10 % supplyvoltage variation [3]. Since the ratio ΔVDD/VDD turns out to be small, the impact ofthe power supply uncertainty on adder delay can be evaluated by the delay sensitivitywith respect to VDD [1]:

SτVDD

= limΔVDD→F6

0Δττ

ΔVDDVDD

= VDD

τ· dτ

dVDD(2)


The above figure of merit can be theoretically derived by correctly modeling thecarry-skip chain of the analyzed adders. Let us consider the example of the standarddomino implementation of Fig. 4(a). The critical path of the chain is the series of thehighlighted NMOS devices through which the carry-out node can be discharged. Topurpose of modeling, this path can be considered as formed by a single equivalentNMOS transistor. According to the alpha-power law proposed in [19], the delay τ ofthe CSC is proportional to the saturation current IDS of the equivalent NMOS, whichcan be expressed as

IDS = K·(vgs − vth)α (3)

where K is a technology-dependent constant, proportional to the transistor aspectratio; vgs is the gate-to-source voltage; vth is the transistor threshold voltage; α is atechnology-dependent coefficient ranging from 1 (deep sub-micrometer transistors)to 2 (long-channel transistors). The transient response of the output voltage v0 can beformulated as follows:

v0(t) = VDD − 1/C ·∫ t

0K · (vgs − vth)

α dt (4)

In order to express the delay τ introduced by the equivalent NMOS device, the lat-ter can be replaced with its equivalent resistance, whose mean value can be calculatedas

Requ = 1

−VDD/2

∫ VDD/2

VDD

v0

K · (vin − vth)αdv0 = 3

4· VDD

K · (VDD2 − vth)α

(5)

where it has been assumed vin = VDD/2 (the CSC delay τ is defined as the differenceof the time when v0 crosses the value VDD/2 and the time when vin reaches the samevalue). Therefore, the simple circuit of Fig. 4 can be treated as a RC network, whosedelay is given by [13]

τ = log 2 · Requ · C = 0.52 · VDD

K · (VDD2 − vth)α

· C (6)

By substituting Eq. (6) in Eq. (2), and after some simple simplifications, the fol-lowing expression of the delay sensitivity with respect to supply variations can beachieved:

SτVDD

= 1 − 0.5 · α · VDDvth

0.5 · VDDvth

− 1(7)

As explained in [1], the value of α, which is typical of the adopted technology, canbe found using the simulator: first, the values IDS1 and IDS2 of the NMOS saturationcurrent have to be obtained for vgs equal to the VDD,max allowed by the technologyand for vgs equal to the (2/3)VDD,max, respectively; afterwards, the value of α can becalculated from Eq. (3), as follows:

α = log(IDS1/IDS2)/

log

[(VDD,max − vth)

/(2VDD,max

3− vth

)]∼= 1.24 (8)


Fig. 10 Power supplysensitivity of the various carryskip adder implementations

Fig. 11 Delay under VDDvariations for 64 bit carry skipadder (delay values arenormalized to their nominalvalues, i.e. VDD = 1 V)

It is worth noting that Eq. (7) calculates the delay sensitivity for a single CSCwhereas the critical path of the whole carry skip adder is composed of a single GB,a series-connected CSCs and a single SB sub-circuits. However, the above modelingis absolutely general for any dynamic gate and it can be successfully applied also tothe GB and SB sub-circuits. Since the sensitivity of each component of the adder’scritical path can be modeled as in Eq. (7), it follows that the same Eq. (7) expressesthe delay sensitivity with respect to VDD of the whole adder. Moreover, as statedabove, Eq. (7) is totally general for any kind of dynamic gate and, consequently, it isvalid not only for the standard domino implementation of the adder, but also for allthe other analyzed implementations. Simulation results (i.e. dotted lines), reported inFig. 10, confirm the latter consideration. The delay sensitivity with respect to VDD ispractically the same for all the adder implementations. Moreover, the proposed modelis in very good agreement with simulation results, showing a maximum error of only9 % which occurs when VDD = 0.8 V.

For the sake of completeness, in Fig. 11, the adders’ delay is plotted for the 64-bitconfiguration and for different values of the power supply. Delay values have beennormalized to their nominal values (i.e., VDD = 1 V). As predicted by the abovesensitivity analysis, the different adder implementations have roughly the same delay


Fig. 12 Energy under VDDvariations for 64 bit carry skipadder (energy values arenormalized to their nominalvalues, i.e. VDD = 1 V)

percentage variation with varying VDD. In particular, higher delay variations can beobserved for lower VDD values. As matter of fact, for VDD = 0.8 V (−20 %), the delayincreases by about 45 %. On the contrary, for VDD = 1.2 (+20 %), the delay reductionis lower (around −20 %). Figure 12 depicts the variation of the dissipated energy withvarying VDD. Once again, energy values have been normalized to their nominal values(i.e., VDD = 1 V). It is worth noting that the hybrid implementation shows the moststable design with varying VDD. Indeed, for VDD = 1.2 V (+20 %), the hybrid designshows an energy dissipation increase of +46 %, whereas the standard and footlessdomino implementations undergo a larger energy overhead (+57 %).

5 Conclusion

This paper has presented four different implementations of carry-skip adders usingdifferent dynamic circuit design styles. For each implementation, adders of lengthsvarying from 8 bits to 64 bits have been investigated for energy, delay, EDP over allprocess corners, as well as for robustness against random process and power supplyvariations. Moreover, a thorough study of the power supply sensitivity of these addershas been also presented. Comparative simulation results reveal that the full D3L adderfeatures the best energy-delay trade-off among all the considered implementations atthe different process corners, while showing a roughly similar sensitivity to randomprocess and power supply variations with respect to the other designed adder circuits.

References

1. M. Alioto, G. Palumbo, Impact of Supply Voltage Variations on Full Adder Delay: Analysis andComparison. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 14(12), 1322–1335 (2006)

2. A.A. Amin, Area-efficient high-speed carry chain. Electron. Lett. 43(23), 1258–1260 (2007)3. A. Chandrakasan, W. Bowhill, F. Fox, Design of High Performance Microprocessor Circuits (IEEE

Press, New York, 2001)4. R. De Rose, M. Lanuzza, F. Frustaci, Design and Evaluation of High-Speed Energy-Aware Carry Skip

Adders, in Proc. of IEEE 22nd International Conference on Microelectronics (2010), pp. 124–1275. H. Eriksson, P. Larsson-Edefors, A. Alvandopour, A 2.8 ns 30 mW/MHz area-efficient 32-b Manch-

ester carry-bypass adder, in Proc. of ISCAS 2001 (2001), pp. 84–876. F. Frustaci, M. Lanuzza, P. Zicari, S. Perri, P. Corsonello, Low-Power Split-Path Data-Driven Dy-

namic Logic (SPD3L). IET Circuits Devices Syst. 3(6), 303–312 (2009)


7. F. Frustaci, M. Lanuzza, P. Zicari, S. Perri, P. Corsonello, Designing High Speed Adders in Power-Constrained Environments. IEEE Trans. Circuits Syst. II 56(2), 172–176 (2009)

8. S. Hauck, M. Hosler, T.W. Fry, High-performance carry chains for FPGA’s. IEEE Trans. Very LargeScale Integr. (VLSI) Syst. 2(8), 138–147 (2000)

9. P. Hofstee et al., 1 GHz single-issue 64b PowerPC processor, in Proc. of IEEE Int. Solid-State CircuitsConf. (2000), pp. 92–93

10. H. Kawaguchi, T. Sakurai, A reduced clock-swing flip-flop (RCSFF) for 63 % power reduction. IEEEJ. Solid-State Circuits 33(5), 807–811 (1998)

11. M. Lanuzza, R. De Rose, F. Frustaci, S. Perri, P. Corsonello, Comparative analysis of yield optimizedpulsed flip-flops. Microelectron. Reliab. 52, 1679–1689 (2012)

12. J.H. Lou, J.B. Kuo, A 1.5 V bootstrapped pass-transistor-based Manchester carry chain circuit suitablefor implementing low-voltage carry look-ahead adders. IEEE Trans. Circuits Syst. I, Fundam. TheoryAppl. 11(45), 1191–1194 (1998)

13. B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs (Oxford University Press, Lon-don, 2000)

14. S. Purhoit, M. Lanuzza, S. Perri, P. Corsonello, M. Margala, Design and evaluation of an energy-delay-area efficient datapath for coarse-grain reconfigurable computing systems. J. Low Power Electron.5(3), 326–338 (2009)

15. S. Purhoit, M. Lanuzza, M. Margala, New Performance/Power/Area Efficient Reliable Full AdderDesign, in Proc. of the ACM Great Lakes Symposium on VLSI, GLSVLSI (2009), pp. 493–498

16. S. Purhoit, M. Lanuzza, M. Margala, Design Space Exploration of Split-Path Data Driven DynamicFull Adder. J. Low Power Electron. 6(4), 469–481 (2010)

17. M. Rabaey, A. Chandrakasan, B. Nikolic, Digital Integrated Circuits (Prentice-Hall, New York, 2002)18. R. Rafati, S.M. Fakhraie, K.C. Smith, Lower-Power Data-Driven Dynamic Logic (D3L), in Proc. of

IEEE International Symposium on Circuits and Systems, ISCAS 2000 (2000), pp. 752–75519. T. Sakurai, A.R. Newton, Alpha-power law MOSFET model and its applications to CMOS inverter

delay and other formulas. IEEE J. Solid-State Circuits 25, 584–594 (1990)20. R. Shalem, E. John, L.K. John, A novel low power energy recovery full adder cell, in Proc. of the 9th

Great Lakes Symposium on VLSI (1999), pp. 380–38321. I. Sutherland, R. Sproull, D. Harris, Logical Effort (Morgan Kaufmann, San Mateo, 1999)22. N. Weste, K. Eshraghian, Principles of CMOS VLSI Design (Addison-Wesley, Reading, 1993)23. S.S. Yoon, S.R. Yoon, S.W. Kim, C. Kim, Charge-Sharing-Problem Reduced Split-Path Domino

Logic, in Proc. of VLSI Design (2004), pp. 201–20524. R. Zlatanovic, B. Nikolic, Power-Performance Optimization for Custom Digital Circuits, in Proc. of

PATMOS (2005), pp. 404–414

Date post:	14-Jul-2020
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

Designing Dynamic Carry Skip Adders: Analysis and...

Documents