Date post: | 27-May-2017 |
Category: |
Documents |
Upload: | umamahesh-mavuluri |
View: | 223 times |
Download: | 1 times |
Design of Low Power ALU using Area Efficient Carry Select Adder
CHAPTER 1
OVERVIEW OF THE PROJECT
1.1 Introduction
Design of any Low power VLSI circuit with less area and high speed has
become a main concern for digital designers. Building low power VLSI systems has
emerged as highly in demand because of the fast growing technology in mobile
communications and computation. The battery technology does not advance at the
same rate as microelectronics technology. There is a limited amount of power
available for the mobile systems. So designers are faced with more constraints such as
high speed, high throughput, small silicon area, and at the same time, low power
consumption. So building low power, high performance adder cells are of great
interest [1]-[5].
In the past few decades ago, the electronics industry has been experiencing an
unprecedented spurt in growth, thanks to the use of integrated circuits in computing,
telecommunications and consumer electronics. We have come a long way from the
single transistor era in 1958 to the present day ULSI (Ultra Large Scale Integration)
systems with more than 50 million transistors in a single chip [6].
As the performance of processors has increased, the demand for high
speed arithmetic blocks has also increased. With clock frequencies approaching 1
GHz, arithmetic blocks must keep pace with the continued demand for more
computational power. The purpose of this thesis is to present methods of
implementing the area and power efficient carry select adder.
To reduce the power and area requirements of the computational complexities,
the size of transistors are shrunk into the deep sub-micron region [7] and
predominantly handled by process engineering.
There are several Adder designs have been proposed to reduce the power
consumption. Logic minimization not only results in better system throughput but also
results in low power consumption designs. For low power results it is always
Department of ECE, MRITS 1
Design of Low Power ALU using Area Efficient Carry Select Adder
advisable to use CMOS technology in which the power dissipation is a complex
function of the gate delays, clock frequency, process parameters, circuit topology and
structure, and the input vectors applied. Once the processing and structural parameters
have been fixed, the measure of power dissipation is dominated by the switching
activity (toggle count) of the circuit .The dynamic power is given by,
P=1/2 * Cload * (Vdd2/Tcycle) * E(switching),
Where Cload is the load capacitance of the gate, Tcycle is the clock cycle time,
E (switching) is the expected number of signal transitions per cycle and Vdd is the
supply voltage [8].
1.2 Objective
To design a high speed Arithmetic Logic Unit (ALU) by using the efficient
carry select adder.
Adder is the important block in ALU, speed of the ALU is limited by the
adder because it has to pass carry to more number of bits. In digital adders, for speed
up the operation Ripple Carry Adder (RCA) is modified as CSLA. To achieve more
speed CSLA is replaces by SQRT CSLA. The CSLA is used in many computational
systems to alleviate the problem of carry propagation delay by independently
generating multiple carries and then select a carry to generate the sum [9]-[10].
However, the CSLA is not area efficient because it uses multiple pairs of Ripple
Carry Adders (RCA) to generate partial sum and carry input Cin=0 and Cin=1, the
final sum and carry are selected by the multiplexers(mux) [11]-[15].
1.2.1 Existing SQRT Carry Select Adder
In general the complete SQRT CSLA is divided into different blocks. Block
size and the number of blocks depend upon the size of SQRT CSLA according to the
SQRT technique. From second block onwards, each block contains three different
levels, first level is ripple carry adder with input carry zero, second level is ripple
carry adder with input carry one and the third level is multiplexer which is used to
select one of the ripple carry adders output according to the previous block carry. The
disadvantage in SQRT CSLA is more area requirement as it uses two levels of RCAs.
Department of ECE, MRITS 2
Design of Low Power ALU using Area Efficient Carry Select Adder
For achieving better area efficiency [13]-[15] Binary to Excess-1 Converter (BEC) is
replaced in the place of RCA with Cin=1 in the regular CSLA. To replace n bit RCA
an n+1 bit BEC is required.
Though BEC technique reduces area and power [16] but not up to
considerable amount and also the design is not suitable for sub threshold level
modifications.
The drawback with this logic structure is that it does not reduce the area and
power to a satisfactory level. There is still scope to reduce the delay. In order to
reduce the power and area a new logic structure for a BEC is proposed.
1.2.2 Proposed SQRT Carry Select Adder
The 16-bit SQRT CSLA using BEC in its second level requires 792
transistors. There is a scope to reduce the number of transistors along with the area
reduction and power dissipation reduction by using proposed logic. For the
implementation of a 16-bit SQRT CSLA, 736 transistors are required by using
proposed logic.
The proposed logic implementation for second level RCA is Special
Hardware using Multiplexers (SHM). In this the inputs are applied to first level RCA.
And the output of RCA is applied to second level SHM and then to third level
multiplexer. Third level multiplexer selects either RCA output or SHM output
according to the previous carry.
By using the proposed logic 8-bit Arithmetic Logic Unit (ALU) which
performs arithmetic operations such as addition, subtraction, increment and decrement
and logical operations such as AND, OR, XOR and XNOR is designed.
1.3 Tools usedSOFTWARE:
Logic Editor: DSCH2.6c
Layout Editor: Micro wind 2.6a.
Department of ECE, MRITS 3
Design of Low Power ALU using Area Efficient Carry Select Adder
The performance of the proposed design is analyzed. The simulations are
performed with 120nm(0.12um) using simulation tool Microwind2, power supply of
1.2V and nominal temperature of 27°C to extract the critical path delay and power
consumption.
1.4 Thesis outline
The next chapter describes literature survey such as different types of adders,
different types low power design techniques in the design of low power ALU and
different logic styles are analyzed.
Existing design such as 8- bit ALU using ripple carry adders are designed in
chapter 3 along with the implementation of SQRT CSLA using BEC technique.
Chapter 4 describes implementation of proposed SQRT CSLA and proposed
ALU using efficient carry select adder.
Comparative analysis and results are shown in the chapter 5.
Conclusion and future scope are discussed in chapter 6.
Department of ECE, MRITS 4
Design of Low Power ALU using Area Efficient Carry Select Adder
CHAPTER 2
LITERATURE SURVEY2.1 Introduction
In nearly all digital IC designs today, the addition operation is one of the most
essential and frequent operations. Instruction sets for DSP’s and general purpose
processors include at least one type of addition. Other instructions such as subtraction
and multiplication employ addition in their operations, and their underlying
hardware is similar if not identical to addition hardware. Often, an adder or multiple
adders will be in the critical path of the design, hence the performance of a design will
be often be limited by the performance of its adders. When looking at other attributes
of a chip, such as area or power, the designer will find that the hardware for addition
will be a large contributor to these areas. It is therefore beneficial to choose the
correct adder to implement in a design because of the many factors it aspects in the
overall chip. In this chapter we begin with the basic building blocks used for addition,
then go through different algorithms and name their advantages and disadvantages.
2.2 Basic Adder Blocks
2.2.1 Half Adder
The half adder is an example of a simple, functional digital circuit built from
two logic gates. The half adder adds to one-bit binary numbers (AB). The output is
the sum of the two bits (S) and the carry (C). Note how the same two inputs are
directed to two different gates. The inputs to the XOR gate are also the inputs to the
AND gate. The input "wires" to the XOR gate are tied to the input wires of the AND
gate; thus, when voltage is applied to the A input of the XOR gate, the A input to the
AND gate receives the same voltage.
2.1
2.2
Department of ECE, MRITS 5
Design of Low Power ALU using Area Efficient Carry Select Adder
Fig.2.1 Half adder
2.2.2 Full Adder
In electronics, an adder is a digital circuit that performs addition of numbers.
Full adders are fundamental units in various circuits, especially in circuits used for
performing arithmetic operations such as compressors, comparators, parity
checkers, and arithmetic logic units and so on. The full adder takes into account a
carry input such that multiple adders can be used to add larger numbers. To remove
ambiguity between the input and output carry lines, the carry in is labeled Cin while
the carry out is labeled Cout. The full-adder circuit adds three one-bit binary
numbers (Cin, A, B) and outputs two one-bit binary numbers, a sum (S) and a carry
(Cout). The full-adder is usually a component in a cascade of adders, which add 8,
16, 32, etc. binary numbers. The carry input for the full-adder circuit is from the
carry output from the circuit "above" itself in the cascade. The carry output from
the full adder is fed to another full adder "below" itself in the cascade. Hence, a full
adder is a digital circuit that performs an addition operation on three binary digits.
The full adder produces a sum and carries value, which are both binary digits. It can
be combined with other full adders or work on its own.
Fig.2.2 Schematic Symbol of 1-bit full-adder cell
Department of ECE, MRITS 6
A B
CO CIN
1-bit Full Adder
S
Design of Low Power ALU using Area Efficient Carry Select Adder
The final OR gate before the carry-out output may be replaced by an XOR gate
without altering the resulting logic. This is because the only discrepancy between OR
and XOR gates occurs when both inputs are 1; for the adder shown here, one can
check this is never possible. Using only two types of gates is convenient if one desires
to implement the adder directly using common IC chips.
A full adder can be constructed from two half adders by connecting A and B
to the input of one half adder, connecting the sum from that to an input to the second
adder, connecting Ci to the other input and or the two carry outputs. Equivalently, S
could be made the three-bit XOR of A, B, and Ci and Co could be made the three-bit
majority function of A, B, and Ci. The output of the full adder is the two-bit arithmetic
sum of three one-bit numbers.
Figure 2.3 Circuit diagram of 1-bit full-adder cell
2.3
2.4
2.2.3 Partial Full Adder
The Partial Full Adder (PFA) is a structure that implements intermediate
signals that can be used in the calculation of the carry bit. It is an extension of FA
Department of ECE, MRITS 7
Design of Low Power ALU using Area Efficient Carry Select Adder
which include the signals generate (g), kill (k), and propagate (p).When g=1, it means
carryout will be 1 (generated) regardless of carry-in. When k=1, it means carryout
will be 0 (killed) regardless of carry-in. When p=1, it means carryout will equal
carry-in (carry-in will be propagated). Table 2.1 reflects these three additional signals,
with a comment on the carryout bit in an additional column. Equations 2.5 − 2.7 are
the Boolean equations for generate, kill, and propagate, respectively. It should be
noted that for the propagate signal, the XOR function can also be used, since in the
case of a, b=1, the generate signal will assert that carryout is 1. The Boolean equations
for the sum and carryout can now be written as functions of g, p, or k shown by
Equations 2.8 and 2.9. Figure 2.4 shows a circuit for creating the Generate, Propagate,
and Sum signals. It is a partial full adder because it does not calculate the carryout
signal directly; rather, it creates the signals needed to calculate the carryout signal.
Generatei (gi) = ai . bi 2.5
Killi (ki) = ai . bi 2.6
Propagatei (pi) = ai bi 2.7
Sumi = Pi Cini 2.8
Carry-outi+1 = ai . bi + bi . carry-ini +ai .carry-ini 2.9
Figure 2.4 Generation of GENERATE, PROPAGATE and SUM
Department of ECE, MRITS 8
Design of Low Power ALU using Area Efficient Carry Select Adder
Inputs OutputsCarry-in a B Carry-out Sum G K p Carry-status
0 0 0 0 0 0 1 0 delete0 0 1 0 1 0 0 1 propagate0 1 0 0 1 0 0 1 propagate0 1 1 1 0 1 0 1 generate/propagate1 0 0 0 1 0 1 0 delete1 0 1 1 0 0 0 1 propagate1 1 0 1 0 0 0 1 propagate1 1 1 1 1 1 0 1 generate/propagate
Table 2.1 Truth table of partial full adder
2.3 Adder Algorithms
2.3.1 Ripple Carry Adder
The Ripple Carry Adder (RCA) is one of the simplest adders to implement.
This adder takes in two N-bit inputs (where N is a positive integer) and produces (N +
1) output bits (an N-bit sum and a 1-bit carryout). The RCA is built from N full adders
cascaded together, with the carryout bit of one FA tied to the carry-in bit of the next
FA. Figure 2.5 shows the schematic for an N-bit RCA. The input operands are labeled
‘a’ and ‘b’ the carryout of each FA is labeled Cout (which is equivalent to the carry-in
(c-in) of the subsequent FA), and the sum bits are labeled sum. Each sum bit requires
both input operands and Cin before it can be calculated. To estimate the propagation
delay of this adder, we should look at the worst case delay over every possible
combination of inputs. This is also known as the critical path. The most significant
sum bit can only be calculated when the carryout of the previous FA is known. In the
worst case (when all the carry-out’s are 1), this carry bit needs to ripple across the
structure from the least significant position to the most significant position. Figure 2.6
has a darkened line indicating the critical path.
Department of ECE, MRITS 9
Design of Low Power ALU using Area Efficient Carry Select Adder
Hence, the time for this implementation of the adder is expressed in Equation
2.10, where tRCAcarry is the delay for the carryout of a FA and t RCAsum is the delay for
the sum of a FA.
Propagation Delay (tRCAgroup) = (N-1) . tRCAcarry + tRCAsum 2.10
From Equation 2.10, we can see that the delay is proportional to the length of
the adder. An example of a worst case propagation delay input pattern for a 4 bit
ripple carry adder is where the input operands change from 1111 and 0000 to 1111
and 0001, resulting in a sum changing from 01111 to 10000.
From a VLSI design perspective, this is the easiest adder to implement. One
just needs to design and layout one FA cell, and then array N of these cells to create
an N-bit RCA. The performance of the one FA cell will largely determine the speed of
the whole RCA. From the critical path in Equation 2.10, minimizing the carryout
delay (tRCAcarry) of the FA will minimize t RCAprop. Various implementations of the
FA cell to minimize the carryout delay .
Figure 2.5 Schematic for an N-bit Ripple Carry Adder
Figure 2.6 Critical paths for an N-bit Ripple Carry Adder
Department of ECE, MRITS 10
Design of Low Power ALU using Area Efficient Carry Select Adder
2.3.2 Carry Skip Adder
From examination of the RCA, the limiting factor for speed in that adder is the
propagation of the Cout bit. The Carry Skip Adder (CSKA, also known as the Carry
Bypass Adder) addresses this issue by looking at groups of bits and determines
whether this group has a carryout or not. This is accomplished by creating a group
propagate signal (PCSKAgroup) to determine whether the group carry-in (carry-in CSKAgroup)
will propagate across the group to the carryout (carry-out CSKAgroup). To explore the
operation of the whole CSKA, take an N-bit adder and divide it into N/M groups,
where M is the number of bits per group. Each group contains a 2-to-1 multiplexer,
logic to calculate M sum bits, and logic to calculate PCSKAgroup. The select line for the
mux is simply the PCSKAgroup signal, and it chooses between carry-inCSKAgroup or cout 4.
To aid the explanation, we refer the reader to Figure 2.7, which shows
the hardware for a group of 4 bits (M=4) in the CSKA. There are four full adders
cascaded together and each FA creates a carryout (cout), a propagate (p) signal, and a
sum (sum not shown). The propagate signal from each FA comes at no extra hardware
cost since it is calculated in the sum logic (the hardware is identical to the sum
hardware for the PFA shown in Figure 2.4). For the carry-outCSKAgroup to equal carry-in
CSKAgroup, all of the individual propagates must be asserted (Equations 2.11 and 2.12). If
this is true then carry-inCSKAgroup skips" past the group of full adders and equals the
carryout CSKAgroup. For the case where PCSKAgroup is 0, at least one of the propagate
signals is 0. This implies that either a delete and/or generate occurred in the group. A
delete signal simply means that the carryout for the group is 0 regardless of the carry-
in, and a generate signal means that the carryout is 1 regardless of the carry-in. This is
advantageous because it implies that the carry-out for the group is not dependent
on the carry-in. No hardware is needed to implement these two signals because the
group carryout signal will reflect one of the three cases (a d, g or group p occurred).
The additional hardware to realize the group carryout in Figure 2.7 is accomplished
with a 4-input AND gate and a 2-to-1 multiplexer (mux). In general, an M-input AND
gate and a 2-to-1 mux are required for a group of bits, including the logic to calculate
the sum bits.
Department of ECE, MRITS 11
Design of Low Power ALU using Area Efficient Carry Select Adder
PCSKAgroup = P0 . P1 . P2 . P3 2.11
Carry-outCSKAgroup = Carry-inCSKAgroup . PCSKAgroup 2.12
In examining the critical path for the CSKA, we are primarily concerned
whether the carry-in can be propagated (“skipped") across a group or not. Assuming
all input bits come into the adder at the same time, each group can calculate the group
propagate signal (mux select line) simultaneously. Every mux then knows which
signal to pass as the carryout of the group. There are two cases to consider after the
mux select line has determined. In the first case, carry-in CSKA group will propagate
to the carryout. This means PCSKAgroup=1 and the carryout is dependent on the carry-in.
In the second case, the carryout signal of the most significant adder will become
the group carryout. This means PCSKA group =0 and the carryout is independent of the
carry-in. If we isolate the particular group (as in Figure 2.7), the second case (signal
cout4) always takes longer because the carryout signal must be calculated through
logic, whereas the first case (carry-inCSKAgroup) requires only a wire to propagate the
signal. Looking at the whole architecture, however, this second case is part of the
critical path for only the first CSKA group. Since the second case is not dependent on
the group carry-in, all the groups in the CSKA can compute the carryout in parallel. If
a group needs its carry-in (PCSKAgroup=1), then it must wait until it arrives after being
calculated from a previous group. In the worst case, a carryout must be calculated in
the first group, and every group afterwards needs to propagate this carryout. When the
final group receives this propagated signal, then it can calculate its sum bits. Figure
2.8 shows a 16-bit CSKA with 4-bit groups and Figure 2.9 shows a darkened line
indicating the critical path of the signals in the 16-bit CSKA.
If we assume a 16-bit CSKA with 4-bit groups, with each group containing a
4-bit RCA for the sum logic, then the worst case propagation delay through this adder
is expressed in equation 2.13. In this equation, tRCAcarry and tRCAsum are the
delays to calculate the carryout and sum signals of an RCA, respectively. Each group
has 4 bits, so the delay through the first group has 4 RCA carryout delays. This
carryout of the first group potentially propagates through 3 muxes, where one mux
delay is expressed as t muxdelay. Finally, when the carryout signal reaches the final
Department of ECE, MRITS 12
Design of Low Power ALU using Area Efficient Carry Select Adder
group, the sum for this group can be calculated. This is represented by the final two
components of Equation 2.4.
Figure 2.7 One group in a Carry Skip Adder, in this case M=4
Figure 2.8 A 16-bit Carry Skip Adder N=16, M=4
Figure 2.9 Critical path through 16-bit CSKA
Department of ECE, MRITS 13
Design of Low Power ALU using Area Efficient Carry Select Adder
tCSKA16= 4 * tRCAcarry + 3 * tmuxdelay + 3 * tRCAcarry + tRCAsum 2.13
For Equation 2.13, there are some assumptions about the delay through the
circuit. First, we assume in the first CSKA group that the group propagates signal is
calculated before the carryout of the most significant adder. Thus, the mux for this
first group is waiting for the carryout. For the final CSKA group, we assume that it
takes longer for sum15 to be calculated than for sum16 to be calculated. Once the
carry-in for this last group is known, the delay for sum16 is the delay of the mux; for
sum 15 it is a delay of 3*tRCAcarry + t RCAsum (3 ripples through the adder before
the last sum bit can be calculated).
For an N-bit CSKA, the critical path equation is expressed in Equation 2.5. M
represents the number of bits in each group. There are N/M groups in the adder, and
every mux in this group except for the last one is in the critical path. As in Equation
2.13, Equation 2.14 assumes that each group contains a ripple carry adder.
tCSKAN = M * tRCAcarry +( (N/M)-1)tmuxdelay + (M-1) * tRCAcarry + tRCAsum 2.14
From a VLSI design perspective, this adder shows improved speedup over a
RCA without much area increase. The additional hardware comes from the 2-to-1
mux and group propagates logic in each group, which is about 15% more area. One
drawback to this structure is that its delay is still linearly dependent on the width of
the adder, therefore for large adders where speed is important, the delay may be
unacceptable. Also, there is a long wire in between the groups that carryout
CSKAgroup needs to travel on. This path begins at the carryout of the first CSKA
group and ends at the carry-in to the final CSKA group. This signal also needs to
travel through ((N/M)-1)) muxes, and these will introduce long delays and signal
degradation if pass gate muxes are used. If buffers are required in between these
groups to reproduce the signal, then the critical path is lengthened. An example
of a worst case delay input pattern for a 16-bit CSKA with 4-bit groups is where the
input operands are 1111111111111000 and 0000000000001000. This forces a
carryout in the first group that skips through the middle two groups and enters the
final group. This carry-in to the final group ripples through to the final sum bit
Department of ECE, MRITS 14
Design of Low Power ALU using Area Efficient Carry Select Adder
(sum15). To determine the optimal speed for this adder, one needs to find the delay
through a mux and the carryout delay of a FA. It is one of these two delays that will
dominate the delay of the whole CSKA. For short adders (≤ 16 bits), the t carryout of
a FA will probably dominate delay, and for long adders the long wire that skips
through stages and muxes will probably dominate the delay.
2.3.3 Carry Look Ahead Adder
From the critical path equations in Sections 2.2.1 and 2.2.2, the delay is
linearly dependent on N, the length of the adder. It is also shown in Equations 2.10
and 2.14 that the tcarryout signal contributes largely to the delay. An algorithm that
reduces the time to calculate tcarryout and the linear dependency on N can greatly speed
up the addition operation. Equation 2.9 shows that the carryout can be calculated with
g, p, and carry-in. The signals g and p are not dependent on carry-in, and can be
calculated as soon as the two input operands arrive. Weinberger and Smith invented
the Carry Look Ahead (CLA) Adder [19]. Using Equation 2.9, we can write the
carryout equations for a 4-bit adder. These equations are shown in Equations
2.15−2.18, where Ci represents the carryout of the ith position (0 ≤ i ≤ (N − 1)), and gi
with just the input operands and initial carry-in (c3). This process of calculating ci by
using only the pi, gi and c0 signals can be done indefinitely, however, each
subsequent carryout. Generated in this manner becomes increasingly difficult because
of the large number of high fan-in gates [20].
C1 = g0 + p0 .c0 2.15
C2 = g1 + p1 .c1 = g1 + p1 . g0 + p1 . p0 . c0 2.16
C3 = g2 + p2 .c2 = p2 . g1 +p2 . p1 . g0 +p2 . p1 . p0 . c0 2.17
C4 = g3 + p3 .c3
=g3 + p3.g2 + p3 . p2 . g1 +p3 . p2 . p1 . g0 +p3 . p2 . p1 . p0 . c0 2.18
The CLA adder uses partial full adders as described in Section 2.1.3 to
calculate the Generate and propagate signals needed for the carryout equations. Figure
2.10 shows the schematic for a 4-bit CLA Adder. The CLA logic block implements
the logic in Equations 2.15−2.18, and the gate schematic for this block is in Figure
Department of ECE, MRITS 15
Design of Low Power ALU using Area Efficient Carry Select Adder
2.11. For a 4-bit CLA adder the 4th carryout signal can also be considered as the 5th
sum bit.
Although it is impractical to have a single level of carry look-ahead logic for
long adders, this can be solved by adding another level of carry look-ahead logic. To
achieve this, each adder block requires two additional signals: groups generate and a
group propagates. The equations for these two signals, assuming adder block sizes of
4 bits, are shown in Equations 2.19 and 2.20. A group generate occurs if a carry is
generated in one of adder blocks, and a group propagate occurs if the carry-in to the
adder block will be propagated to the carryout. Figure 2.11 shows the gate schematic
of the two additional signals.
Group Generate = g3 + p3.g2 + p3 . p2 . g1 +p3 . p2 . p1 . c3 2.19
Group Propagate = g3 + p3.g2 + p3 . p2 . g1 +p3 . p2 . p1 . c3 2.20
2.19 2.20 with multiple levels of CLA logic, carry look-ahead adders of any
length can be built. To illustrate the use of another level of CLA logic, Figure 2.8
shows the schematic for a 16-bit CLA Adder. There is a second level of CLA logic
which takes the group generate and group propagate signals from each 4-bit
adder sub cell and calculates the carryout signals for each adder block. If an adder
has multiple levels of CLA logic, only the final level needs to generate the
Figure 2.10 4-bit carry look-ahead adder
Department of ECE, MRITS 16
Design of Low Power ALU using Area Efficient Carry Select Adder
c4 signal. All other levels replace this c4 signal with the group generate and group
propagate. The CLA logic for this 16-bit adder is identical to the CLA logic for the 4-
bit adder in Figure 2.11; therefore the equations for the carryout signals are in
Equations 2.15−2.18.
Figure 2.11 Schematic for a 16-bit CLA adder
A third level of CLA logic and four 16-bit adder blocks can be used to build a
64-bit adder. The CLA logic would create the c16, c32, and c48 signals to be used as
carry-ins to the 16-bit adder blocks and the c64 as the sum 64 signal. If a design calls
for an adder of length 32, a designer can simply use two 16-bit adder blocks and the
first two carryout signals (c16, c 32) from the third level of CLA logic. The identical
hardware in the CLA logic, coupled with the fact that the adder blocks can be
instantiated as sub cells, makes building long adders with this architecture simple.
Determining the critical path for a CLA adder is difficult because the gates in
the carry path have different fan-in. To get a general idea, we first assume that all gate
delays are the same. The delay for a 4-bit CLA adder then requires one gate delay to
calculate the propagate and generate signals, two gate delays to calculate carry
signals, and one gate delay to calculate the sum signals; this equates to four gate
delays. For a 16-bit CLA adder there is one gate delay to calculate the propagate and
generate signal (from the PFA), two gate delays to calculate the group propagate and
generate in the first level of carry logic, two gate delays for the carryout signals in the
second level of carry logic, and one gate delay for the sum signals. The second level
Department of ECE, MRITS 17
Design of Low Power ALU using Area Efficient Carry Select Adder
of carry logic for the 16-bit CLA adder contributes an additional two gate delays over
the 4-bit CLA adder, thus increasing the total to six gate delays. Continuing in this
manner (a 64-bit add takes eight gate delays, a 256- bit add takes ten gate delays), we
see that the delay for a CLA adder is dependent on the number of levels of carry logic,
and not on the length of the adder. If a group size of four is chosen, then the number
of levels in an N-bit CLA is expressed in Equation 2.21 and in general the number of
levels in a CLA for a group size of k is expressed in Equation 2.22. For an N-bit CLA
adder, each level of carry logic introduces two gate delays in addition to a gate delay
for the generate and propagate signals and a gate delay for the sum. The total gate
delay is expressed in Equation 2.23, which shows that the delay of a CLA adder is
logarithmically dependent on the size of the adder. This theoretically results in one of
the fastest adder architectures.
CLA levels (with group size of 4) = [ log 4 N] 2.21
CLA levels (with group size of k) = [ log k N] 2.22
CLA gate delay = 2 + 2 . [ log k N] 2.23
From a VLSI design perspective, this adder may take more time to implement,
but there still exists regularity with the architecture that allows building long adders
fairly easily. The reuse of the CLA logic definitely contributes to the feasibility of
building a long adder without additional design time. Also, after an adder is built, it
can be used as a subcell, as is done with the 4-bit adders as blocks in the 16-bit CLA
adder. A drawback to CLA adders are their larger areas. There is a large amount of
hardware dedicated to calculating the carry bits from cell to cell. However, if the
application calls for high performance, then the benefits of decreased delay can
outweigh the larger area.
2.3.4 Carry Select Adder
Adding two numbers by using redundancy can speed addition even further.
That is, for any number of sum bits we can perform two additions, one assuming the
carry-in is 1 and one assuming the carry-in is 0, and then choose between the two
results once the actual carry-in is known. This scheme, proposed by Sklanski in 1960,
Department of ECE, MRITS 18
Design of Low Power ALU using Area Efficient Carry Select Adder
is called conditional-sum addition [21]. An implementation of this scheme was first
realized by Bedrij and is called the Carry Select Adder (CSLA) [22].
The CSLA divides the adder into blocks that have the same input operands
except for the carryin. Figure 2.12 shows a possible implementation for a 16-bit
CSLA using ripple carry adder blocks. The carryout of the first block is used as the
select line for the 9-bit 2-to-1 mux. The second and third blocks calculate the signals
sum 16 − sum 8 in parallel, with one block having its carryin hardwired to 0 and
another hardwired to 1. After one 8-bit ripple adder delay there is only the delay of
the mux to choose between the results of block 2 or 3. Equation 2.24 shows the delay
for this adder. The 16-bit CSLA can also be built by dividing it into even more blocks.
Figure 2.13 shows the block diagram for the adder if it were divided into 4-bit RCA
blocks. Equation 2.25 expresses the delay for this structure.
tCSLA16a =t8bitRCA + t9bitmux 2.24
tCSLA16b =t4bitRCA + 3 . t5bitmux 2.25
The CSLA described so far is called the Linear Carry Select Adder, because its delay
is linearly dependent on the length of the adder. In the worst case, the carry signal
Figure 2.12 Schematic for a 16-bit CSLA with 8-bit RCA blocks
Department of ECE, MRITS 19
Design of Low Power ALU using Area Efficient Carry Select Adder
must ripple through each mux in the adder. Also, notice that the sub cells are done
with their addition at the same time, yet the more significant bits are waiting at the
input of the mux to be selected. From a VLSI design perspective, the CSLA uses a
large amount of area compared to the other adders. There is hardware in this
architecture which computes results that are thrown away on every addition, but the
Figure 2.13 Schematic for a 16-bit CSLA with 4-bit RCA blocks
Fact that the delay for an addition can be replaced by the delay of a mux makes this
architecture very fast. Also, the Linear CSLA has regularity that makes it easier to
layout.
2.3.5 SQRT Carry Select Adder
To increase SQRT technique is developed. In this design the number of bits
per block is not depend upon the total number of bits corresponding logical equation
is shown in 2.26. Using that technique for 16-bit SQRT CSLA the bits per block is as
follows 2-2-2-3-4-5. For 8-bit the sequence is 1-3-4. The 16- bit SQRT CSLA is
shown in figure 2.14.
tadd= tsetup+ (m X tcarry)+sqrt (2n) X tmux + tsum 2.26
Department of ECE, MRITS 20
Design of Low Power ALU using Area Efficient Carry Select Adder
Figure 2.14 Schematic for a 16-bit SQRT CSLA
2.4 Low power design techniques
Designing systems aiming for low power is not a straightforward task, as it is
involved in all the IC design stages beginning with the system behavioral description
and ending with the fabrication and packaging processes. In some of these stages
there are guidelines that are clear and there are steps to follow that reduce power
consumption, such as decreasing the power-supply voltage. While in other stages
there are no clear steps to follow, so statistical or probabilistic heuristic methods are
used to estimate the power consumption of a given design.
There are three major components of power dissipation in complementary
metal–oxide–semiconductor (CMOS) circuits.
1) Switching Power: Power consumed by the circuit node capacitances during
transistor switching.
2) Short Circuit Power: Power consumed because of the current flowing from power
supply to ground during transistor switching.
3) Static Power: Due to leakage and static currents.
4) Dynamic Power: As given in equation 2.1
The first two components are referred to as dynamic power. Dynamic power
constitutes the majority of the power dissipated in CMOS VLSI circuits. It is the
power dissipated during charging or discharging the load capacitances of a given
Department of ECE, MRITS 21
Design of Low Power ALU using Area Efficient Carry Select Adder
circuit. It depends on the input pattern that will either cause the transistors to switch
(consume dynamic power) or not to switch (no dynamic power consumed) at every
clock cycle.
The summation is over all the nodes of the circuit. Reducing any of these
components will end up with lower-power consumption, although, it is of equal
importance to increase the system-clock frequency for faster operation. Estimating the
power of a large circuit is a complex task. Heuristic algorithms, statistical, and
probabilistic methods are used to generate random-input patterns to test the switching
activity of the circuit. These methods become less accurate when the size of the
circuit increases. It is better to decompose the large circuit into smaller modules and
then use these methods to estimate the power consumption of each module. When the
decomposed modules are small enough, exact methods can be used to optimize their
performance.
2.4.1 Transistor sizing optimization
The transistor sizing for optimal performance is technology dependent. As the
demand for high speed, low power consumption and high packing density continues
to grow each year, there is need to scale the device to smaller dimensions. As the
market trend moves towards greater scale of integration, the move towards a reduced
supply voltage also has the advantage of improving the reliability of IC components
of ever-reducing dimensions. This change can be easily understood if one recalls that
IC component with smaller dimensions have more of a tendency to breakdown at high
voltages. It has already been accepted that scaled-down CMOS devices even at 2.5V
do not sacrifice device performance as they maintain device reliability.
Scaling brings about the following benefits:
Improved device characteristics for low voltage operation due to the
improvement in the current driving capabilities, reduced capacitance through small
geometries and junction capacitances, improved interconnect technology, higher
density of integration.
Department of ECE, MRITS 22
Design of Low Power ALU using Area Efficient Carry Select Adder
The major device problem associated with simple scaling lies in the increase of
the threshold voltage and the decrease of the carrier surface mobility, when the
substrate doping concentration is increased to prevent punch-through.
2.4.2 Low-power clock distribution
The clock network constitutes one of the most important parts of a
synchronous very large scale integration (VLSI) chip as it can significantly influence
the speed, area, and power dissipation of the system. Recent research on clock
network construction has developed procedures for building a zero or near-zero skew
clock networks with sharp clock edge rates at the clock utilization points. However,
one major drawback associated with clock networks is their power dissipation.
Studies have shown that the clock network can dissipate 20–50% of the total power
on a chip. In the context of the growing importance of low-power designs for portable
electronics, it is necessary to develop strategies to significantly reduce the power
dissipation of the clock network, since this will lead to a major reduction in the
overall power dissipation of the chip. Using a lower to distribute the signal over the
chip, the clock network can be made to dissipate less power. However, for reasons
related to performance requirements, the rest of the circuitry on the chip may use a
higher Vdd and this implies that the clock levels would have to be converted to this
higher value at the utilization points.
2.4.3 Low power design through voltage scaling
The equation (2.1) shows that the avg. switching power dissipation is
proportional to the square of the power supply voltage; hence, reduction of VDD will
significantly reduce the power consumption.
If the power supply voltage is scaled down while all other parameters are
kept constant, the propagation delay time would increase. The dependence of circuit
speed on the power supply voltage and the above equation. Suggest that a quadratic
improvement or reduction of power consumption is possible as the power supply
voltage is reduced. If the circuit is always operated at maximum frequency allowed by
its propagation delay, the operating frequency or the no. of switching events per unit
Department of ECE, MRITS 23
Design of Low Power ALU using Area Efficient Carry Select Adder
time will drop as the propagation delay becomes larger with the reduction of power
supply voltage. The net result is that the dependence of switching power dissipation
on the power supply voltage becomes stronger than a simple quadratic equation. The
propagation delay expressions show that the negative effect of reducing the power
supply voltage upon delay can be compensated for, if the threshold voltage of the
transistors (VT) is scaled down accordingly. However, this approach is limited
because the threshold voltage may not be scaled to the same extent as the supply
voltage. When scaled linearly, reduced threshold voltages allow the circuit to produce
the same speed performance at a lower VDD.
2.4.4 Reduction of switching activity
Switching activity can be reduced by algorithmic optimization, proper choice
of logic topology, glitch reduction, and gated clock signals.
Algorithmic optimization
This depends heavily on the application and the characteristics of data such
as dynamic range, correlation, and statistics of data transmission. The representation
of data can have a significant impact on switching activity at the system level. In
applications where data bits change sequentially and are highly correlated, the use of
Gray Coding leads to a reduced number of transitions compared to binary coding.
Another example is the use of sign-magnitude representation instead of conventional
two’s complement representation for signed data. A change in sign will cause
transitions of the higher order bits in the two’s complement representation, whereas
only the sign bit will change in sign-magnitude representation. Hence, switching
activity can be reduced by using the sign-magnitude representation in applications
where the data sign changes are frequent.
Glitch reduction
An important architecture level measure to reduce switching activity is
based on delay balancing and reduction of glitches. In multi-level logic circuits, the
propagation delay from one logic block to the next can cause spurious signal
transitions ,or glitches .Glitches occur primarily due to a mismatch or imbalance in
the path lengths in the logic network. Such a mismatch in path lengths results in a
Department of ECE, MRITS 24
Design of Low Power ALU using Area Efficient Carry Select Adder
mismatch of signal timing with respect to the primary inputs. Redesigning the logic
network in order to balance the delay paths can significantly reduce glitches, and
consequently, the dynamic power dissipation in complex multi-level networks.
Gated Clock Signals
Another effective design technique for reducing the switching activity in
CMOS logic circuits is the use of conditional or gated clock signals. If certain logic
blocks in a system are not immediately used during the current clock cycle,
temporarily disabling the clock signals of these blocks will obviously save switching
power that is otherwise wasted. An N-bit number comparator compares the
magnitudes of two unsigned N-bit binary numbers and produces an output to indicate
which one is larger. In the conventional approach, all input bits are first latched into
two N-bit registers, and subsequently applied to the comparator circuit .In this case,
two N-bit register arrays dissipate power in every cycle. Yet, if only the most
significant bits of the two binary numbers are different from each other, then the
decision can be made by comparing the MSBs only. The two MSBs are latched in a
two-bit register which is driven by the original system clock. At the same time, these
two bits are applied to an XNOR gate and its output is used to generate the gated
clock signal with an AND gate. If the two MSBs are different, the XNOR produces
logic 0 at the output, disabling the clock signal of the lower order registers. If the two
MSBs are same, the gated clock signal is applied to the lower-order registers and the
decision is made by the (N-1) bit comparator. The gated clock strategy effectively
reduces the overall switching power dissipation of the system by about 50%, since a
large portion of the system is disabled for half of all input combinations.
2.4.5 Reduction of switching capacitance
The amount of switched capacitance plays a significant role in the dynamic
power dissipation of the circuit. Hence, reduction of this parasitic capacitance is a
major goal for low-power design of digital integrated circuits.
System-Level Measures
At the system level, one approach to reduce the switched capacitance is to
limit the use of shared resources. If a single shared bus is connected to all modules,
Department of ECE, MRITS 25
Design of Low Power ALU using Area Efficient Carry Select Adder
for example, a large bus capacitance comes into play due to-the large number of
drivers and receivers sharing the same transmission medium, and the parasitic
capacitance of the long bus line. Obviously, driving the large capacitance will require
a significant amount of power consumption during each bus access. Alternatively, the
global bus structure can be partitioned into a number of smaller dedicated local buses
to handle the data transmission between the neighboring modules. As a result, the
switched capacitance during each bus access is significantly reduced, although
multiple buses may increase the overall routing area on the chip.
Circuit-Level Measures
The type of logic style used to implement a digital circuit also affects the
output load capacitance of the circuit. The capacitance of a function of the number of
transistors that are required to implement a given function. Pass-gate logic design is
attractive since fewer transistors are required for certain functions such as XOR and
XNOR. Pass-transistor structures typically require complementary control signals;
dual-rail logic is used to provide all signals in complementary form. This diminishes
the inherent advantages of pass-transistor logic gates over conventional CMOS logic.
Thus, the use of pass-transistor logic gates to achieve low-power dissipation must be
carefully considered, and the choice of logic design style must ultimately be based on
a detailed comparison of all design aspects such as silicon area, overall delay as well
as switching power dissipation.
Mask-Level Measures
The amount of parasitic capacitance that is switched (i.e., charged up or
charged down) during operation can also be reduced at the physical design level, or
mask level. A simple mask-level measure to reduce power dissipation is keeping the
transistors at minimum dimensions whenever possible and feasible, thereby
minimizing the parasitic capacitances. Designing a logic gate with minimum-size
transistors certainly affects the dynamic performance of the circuit, and this trade-off
between dynamic performance and power dissipation should be carefully considered
in critical circuits.
Department of ECE, MRITS 26
Design of Low Power ALU using Area Efficient Carry Select Adder
2.4 Different logic stylesSeveral variants of static CMOS logic styles have been used to implement low-
power 1-bit adder cells. Several logic styles have been used to design full adder cells.
Each design style has its own merits and demerits.
In general, they can be broadly divided into two major categories:
1) Static logic style and
2) Dynamic logic style
A major distinction, also with respect to power dissipation, must be made
between static and dynamic logic styles. As opposed to static gates, dynamic gates are
clocked and work in two phases, a precharge and an evaluation phase. The logic
function is realized in a single NMOS pull-down or PMOS pull-up network, resulting
in small input capacitances and fast evaluation times. This makes dynamic logic
attractive for high speed applications. However, the large clock loads and the high
signal transition activities due to the precharging mechanism result in excessive high
power dissipation. Also, the usage of dynamic gates is not as straightforward and
universal as it is for static gates, and robustness is considerably degraded. With the
exception of some very special circuit applications, dynamic logic is no viable
candidate for low-power circuit design.
Although they all perform the same function, their styles of generating the
intermediate nodes and the outputs are different, the loads on the inputs and
intermediate nodes are different, and the transistor count varies significantly.
There are standard implementations for the full adder cell that are implemented. They
are the following:
1) Double pass transistor logic uses both N and P channel transistors, with dual
logic paths for every function. It uses 28 transistors.
2) The complementary pass-transistor logic (SR-CPL) full adder, it has 26
transistors and uses the CPL logic family.
3) Multiplexer based low power full adder which makes use of 34 transistors, it
makes use of only multiplexer operation.
Department of ECE, MRITS 27
Design of Low Power ALU using Area Efficient Carry Select Adder
All these adder cells are compared based on power consumption, speed, power delay
product, area, and driving capability.
Classical designs of full adders normally use only one logic style for the whole
full-adder design. While other hybrid designs exploit the features of different logic
styles to improve upon the performance of the designs using single logic style. All
hybrid designs use the best available modules implemented using different logic
styles or enhance the available modules in an attempt to build a low power full-adder
cell. Generally, the main focus in such attempts is to reduce the numbers of transistors
in the adder cell and, consequently, reduce the number of power dissipating nodes. In
doing so, the designers often trade off other vital requirements such as driving
capability, noise immunity, and layout complexity. Most of these adders lack driving
capabilities as the inputs are coupled to the outputs. Their performance as a single unit
or in small chains is good but when large adders are built by cascading these 1-b full-
adder cells, the performance degrades drastically. The performance degradation can
be handled by inserting buffers in between stages to enhance the delay characteristics.
However, this leads to an extra overhead and the initial advantage of having a lesser
number of transistors is lost.
Department of ECE, MRITS 28
Design of Low Power ALU using Area Efficient Carry Select Adder
CHAPTER 3
DESIGN OF ALU AND SQRT CSLA
3.1 Introduction to ALU and SQRT CSLA
The arithmetic logic unit (ALU) is one of the main components inside a
microprocessor. It is responsible for performing arithmetic and logic operations such
as addition, subtraction, increment, and decrement, logical AND, logical OR, logical
XOR and logical XNOR. An ALU is a digital circuit that performs arithmetic and
logical operations. Generally the performance of ALU is degraded by adder because
of carry propagation. To reduced carry propagation delay so many adders are
proposed.
In digital adders, for speed up the operation Ripple Carry Adder (RCA) is
modified as CSLA. To achieve more speed CSLA is replaced by SQRT CSLA. The
CSLA is used in many computational systems to alleviate the problem of carry
propagation delay by independently generating multiple carries and then select a carry
to generate the sum [8]-[9]. However, the CSLA is not area efficient because it uses
multiple pairs of Ripple Carry Adders (RCA) to generate partial sum and carry input
Cin=0 and Cin=1, the final sum and carry are selected by the multiplexers (mux). For
achieving better area efficiency [10]-[14] Binary to Excess-1 Converter (BEC) is
replaced in the place of RCA with Cin=1 in the regular CSLA.
The total 16-bit SQRT CSLA is divided into different blocks. Block size and
the number of blocks depend upon size of SQRT CSLA according to the SQRT
technique. From second block onwards, each block contains three different levels,
first level is ripple carry adder with input carry zero, second level is ripple carry
adder with input carry one and the third level is multiplexer which is used to select
one of the ripple carry adders output according to the previous block carry. The
disadvantage in SQRT CSLA is more area requirement as it uses two levels of RCAs.
To reduce the area BEC is replaced in place of second level RCA. In place of 2-bit
RCA, 3- bit BEC is used.
Department of ECE, MRITS 29
Design of Low Power ALU using Area Efficient Carry Select Adder
3.1.1 Delay and Area evaluation methodology of the basic adder blocks
The AND, OR, and Inverter (AOI) implementation of an XOR gate is shown in fig 3.1 we add up the number of gates in the longest path of area evaluation approach, the CSLA adder blocks of 2:1 mux, Half Adder (HA), and FA are evaluated and listed in Table 3.1.
Table 3.1 Delay and area for basic gates
Figure 3.1 AOI implementation of XOR gate
3.1.2 Binary to Excess one Converter (BEC)
As stated above the main idea of this work is to use BEC instead of the RCA with cin =1 in order to reduce the area and power consumption of the regular CSLA. To replace the n-bit RCA, an
Department of ECE, MRITS 30
Design of Low Power ALU using Area Efficient Carry Select Adder
(n+1)-bit BEC is required. A structure and the function table of a 4-b BEC are shown in Fig.3.1.2 and Table 3.1.2, respectively.
Fig. 3.2 illustrates how the basic function of the CSLA is obtained by using the 4-bit BEC together with the mux. One input of the 8:4 mux
gets as it input (B3, B2, B1, and B0) and another input of the mux is the BEC output. This produces the two possible partial results in parallel and the mux is used to select either the BEC output or the direct inputs according to the control signal Cin. The importance of the BEC logic stems from the large silicon area reduction when the CSLA with large number of bits are designed. The Boolean expressions of the 4-bit BEC is listed as (note the functional symbols NOT, &AND, XOR)
Fig 3.2 A 4- bit BEC
Department of ECE, MRITS 31
Design of Low Power ALU using Area Efficient Carry Select Adder
Fig 3.3 Functional block of CSLA
Figure 3.4 Block diagram for a 16-bit SQRT CSLA
3.1.3 Delay and area evaluation methodology of regular 16-bit SQRT CSLA
The structure of the 16-b regular SQRT CSLA is shown in Fig. 3.4. It has five
groups of different size RCA. The delay and area evaluation of each group in which
the numerals within [] specify the delay values, e.g., sum2 requires 10 gate delays.
The steps leading to the evaluation are as follows.
1) The group2 has two sets of 2-b RCA. Based on the consideration of delay
values of Table 3.2 , the arrival time of selection input c1 [time (t) =7] of
6:3 mux is earlier than s3[t=8] and later than s2[t=6]. Thus, sum3 [t=11] is
Department of ECE, MRITS 32
Design of Low Power ALU using Area Efficient Carry Select Adder
summation of s3 and mux [t=3] and sum2[t=10] is summation of c1 and
mux.
2) Except for group2, the arrival time of mux selection input is always
greater than the arrival time of data outputs from the RCA’s. Thus, the
delay of group3 to group5 is determined, respectively as follows:
3) The one set of 2-b RCA in group2 has 2 FA for and the other set has 1 FA
and 1 HA for. Based on the area count of Table I, the total number of gate
counts in group2 is determined as follows:
4) Similarly, the estimated maximum delay and area of the other groups in the
regular SQRT CSLA are evaluated and listed in Table 3.2.
Table 3.2 Delay and area for SQRT CSLA
3.1.4 Delay and area evaluation methodology of modified 16-bit SQRT CSLA
The structure of the proposed 16-b SQRT CSLA using BEC for RCA with to
optimize the area and power is shown in Fig. 3.5. We again split the structure into five
groups. The delay and area estimation of each group are shown in Figure.
Department of ECE, MRITS 33
Design of Low Power ALU using Area Efficient Carry Select Adder
Figure 3.5 A 16-bit SQRT CSLA using BEC
1) The group2 has one 2-b RCA which has 1 FA and 1 HA for carry input zero.
Instead of another 2-b RCA with carry input one a 3-bit BEC is used which adds one
to the output from 2-b RCA. The sum3 and final (output from mux) are depending on
and mux and partial (input to mux) and mux, respectively. The sum2 depends on
and mux.
2) For the remaining group’s the arrival time of mux selection input is
always greater than the arrival time of data inputs from the BEC’s. Thus, the delay of
the remaining groups depends on the arrival time of mux selection input and the mux
delay.
3) The area count of group2 is determined as follows:
Table 3.3 Delay and area for modified SQRT CSLA
3.1.5 Transistor Level design of existing technique
1) Conventional full adder
Department of ECE, MRITS 34
Design of Low Power ALU using Area Efficient Carry Select Adder
A conventional full adder takes 28 transistors to implement sum and carry
functions. The conventional full adder is shown in figure 3.6
2) A 2-bit RCA
A two bit Ripple Carry Adder (RCA) is formed by connecting the two full
adders. It takes total 56 transistors to implement. It is shown in figure 3.7.
Figure 3.6 A conventional full adder
3) A 3-bit BEC
A 3- bit BEC uses two XOR, one AND, one NOT gates, which takes 32
transistors overall whereas 2-bit RCA, which is the basic block in place of 3-bit BEC
takes 56 transistors. A 3-bit BEC is shown in figure 3.8. comparison between 2-bit
RCA, 3- bit BEC is shown in table 3.4.
Department of ECE, MRITS 35
Design of Low Power ALU using Area Efficient Carry Select Adder
Figure 3.7 A 2-bit RCA using conventional full adder
Figure 3.8 Transistor level 3-bit BEC
Table 3.4 Comparison between 2-bit RCA and BEC
Department of ECE, MRITS 36
Design of Low Power ALU using Area Efficient Carry Select Adder
Logic for Second
Level
Number of transistors
Critical path delay
(ns)
Area
(µm2)
Power dissipation
(µw)
Static Dynamic
Total
RCA using CMOS
56 1.900 1342 6.706 42.565 49.271
BEC using CMOS
32 1.200 781 3.269 25.746 29.015
Though BEC technique reduces area and power [16] but not up to
considerable amount and also the design is not suitable for sub threshold level
modifications. The drawback with this logic structure is that it does not reduce the
area and power to a satisfactory level. There is still scope to reduce the delay. In order
to improve the delay a new logic structure for a full-adder cell is proposed.
3.2 ALU
The arithmetic logic unit (ALU) is one of the main components inside a
microprocessor. It is responsible for performing arithmetic and logic operations such
as addition, subtraction, increment, and decrement, logical AND, logical OR, logical
XOR and logical XNOR. An ALU is a digital circuit that performs arithmetic and
logical operations. The ALU is a fundamental building block of the Central
Processing Unit (CPU) of a computer, and even the simplest microprocessors contain
one. The processors found inside modern CPUs and Graphics Processing Units
(GPUs) have inside them very powerful ALUs. We have designed ALU by using
multiplexer and full adder circuit. The input and output sections consist of 4xl and 2xl
multiplexers and logic is implemented by using full adder.
The full adder performs the computing function of the ALU. A full adder
could be defined as a combinational circuit that forms the arithmetic sum of three
input bits. It consists of three inputs and two outputs.
Department of ECE, MRITS 37
Design of Low Power ALU using Area Efficient Carry Select Adder
The arithmetic logic unit (ALU) is one of the main components inside a
microprocessor. It is responsible for performing arithmetic and logic operations such
as addition, subtraction, increment, decrement, logical AND, logical OR logical XOR
and logical XNOR. An ALU is a digital circuit that performs arithmetic and logical
operations. We have designed ALU using 4Xl mux, 2Xl mux and an 8T full adder.
Here all the blocks in ALU are designed using Gate Diffusion Input (GDI).
3.2.1 GDI Technique
AS there is a scope to reduce power, area and delay using GDI cell technique
A simple GDI cell is shown in Fig.3.9. We can implement any bullion function using
GDI cell. Low swing problems will arise, because we apply inputs directly to the
sources of P and N transistors. N transistor weak to pass logic high and P transistor
weak to pass logic low. When transition occur from the high to low at the P transistor
source and the low to high at the N transistor source, low swing problem will arise. To
avoid that demands special emphasis is that 50% of the cases, the GDI cell operates as
regular CMOS inverter, which is widely used as a digital buffer for logic-level
restoration. In some of these cases , when Vdd=1 without a swing drop from the
previous stages, a GDI cell functions as an inverter buffer and recovers the voltage
swing. Basic logic gates are shown in figure 3.10.
Figure 3.9 Simple GDI cell
Department of ECE, MRITS 38
Design of Low Power ALU using Area Efficient Carry Select Adder
Figure 3.10 Basic logic gates GDI cell
3.2.2 A 10-transistor full adder
A full adder using GDI technique takes 10 transistor where as conventional
full adder takes 28 transistors. It is shown in figure 3.11.
3.2.3 An 8-transistor full adder
Full adder can implement with 8-transistors by using GDI technique. A 10
transistor full adder differentiates the 8 transistor full adder with two pull up
transistors. It is shown in figure 3.12.
Department of ECE, MRITS 39
Design of Low Power ALU using Area Efficient Carry Select Adder
Figure 3.11 A 10- transistor full adder
Figure 3.12 A 8- transistor full adder
3.2.4 A 1-bit ALU
ALU is designed using multiplexers and full adder circuit. The input and
output sections consist of 4x1 and 2x1 multiplexers and logic is implemented by using
full adder. A set of three select signals have been incorporated in the design to
determine the operation being performed and the inputs and outputs being selected.
Figure 3.13 shows the block diagram of 1-bit ALU using two 4x1 multiplexers and
one 2x1 multiplexer. The complement of B is used for SUBTRACTION operation.
The full adder performs the SUBTRACT operation by two’s complement method.
Department of ECE, MRITS 40
Design of Low Power ALU using Area Efficient Carry Select Adder
Table 3.5 shows the truth table for the operations performed by the ALU based on the
status of the select signals.
Table 3.5 Truth table of one bit ALU
Figure 3.13 A 1-bit ALU
3.2.5 8-bit ALU using ripple carry adders
An 8- bit ALU is formed by connecting eight 1-bit ALUS in series. 8-bit ALU
using 10 transistors and 8- transistors are shown in figure 3.14.
Department of ECE, MRITS 41
s2 s1 s0 Operation
0 0 0 AND
0 0 1 XOR
0 1 0 XNOR
0 1 1 OR
1 0 0 DECREMENT
1 0 1 ADDTION
1 1 0 SUBTRACTION
1 1 1 INCREMENT
Design of Low Power ALU using Area Efficient Carry Select Adder
Figure 3.14 Eight bit ALU using 10 and 8 transistor full adders
An eight bit ALU using ripple carry adders takes more propagation delay. The
speed of ALU is limited by propagation of carry. To reduce the carry propagation the
proposed design using carry select adder is implemented.
Department of ECE, MRITS 42
Design of Low Power ALU using Area Efficient Carry Select Adder
CHAPTER 4
DESIGN OF ALU USING MODIFIED SQRT CSLA
4.1 Introduction to different transistor types
Combinational logic forms the core of most digital integrated circuits such as
fast arithmetic units and controllers. The design requirements imposed on the logic
circuitry can vary widely. Area is often the prime concern, as it has direct impact on
cost. In many state-of-the-heart designs, speed tends to be the dominating
requirement. Contemporary microprocessors are excellent examples of designs in this
class. For other applications, minimizing the power consumption is crucial, as in the
design of portable applications such as mobile telephones. These different design
requirements generally translate into the use of different circuit styles, or even
different manufacturing technologies.
The static CMOS has excellent properties in many areas: low sensitivity to
noise and process variations, excellent speed, and low power consumption. Most of
those properties are carried over to more static CMOS gates such as NAND gates with
three or more inputs become large and slow. Other design styles like complementary,
the ratioed and the pass transistor logic styles have been devised to address this issue,
all of which belong to the class of static circuits.
4.1.1 Complementary CMOS
A static CMOS gate is a combination of two networks, called the pull-up
network (PUN) and the pull-down network (PDN). The PUN consists solely of PMOS
transistors and provides a conditional connection to Vdd. The PDN potentially
connects the output to Vss and contains only NMOS devices. The PUN and PDN
networks should be designed so that, whatever the value of the inputs, one and only
one of the networks is conducting in steady state. In this way, a path always exists
between Vdd and the output, realizing a high output (one) or alternatively, between Vss
and output for a low output (zero).
Department of ECE, MRITS 43
Design of Low Power ALU using Area Efficient Carry Select Adder
Properties of complementary CMOS
Complementary CMOS gates inherit all the nice properties like high noise
margin, no static power consumption, as there is never a direct path between Vdd and
Vss in steady state mode and comparable rise and fall times.
The complementary gate is inverting (implementing functions such as NAND,
NOR & XNOR). Implementing a non inverting Boolean function (such as AND, OR,
XOR) in one stage is not possible and requires the addition of an extra inverter stage.
4.1.2 Pseudo NMOS
A grounded PMOS device presents an even better load. This configuration
which is called pseudo-NMOS because it resembles the depletion NMOS load, is
superior to the other approach. First of all, the PMOS transistor does not experience
anybody effect as its Vsb is constant and equal to 0. Secondly, the PMOS device is
driven by a Vgs equal to –Vdd, resulting in a higher load-current level for similarly
sized devices.
Figure 4.1 Pseudo NMOS
An important disadvantage is that it consumes static power when the output is
low, because a direct path exists between Vdd and ground through the load and device
drivers.
The grounded PMOS load is a good imitation of an ideal current-source load.
For a certain circuit configurations, some simple modifications can further improve
Department of ECE, MRITS 44
Design of Low Power ALU using Area Efficient Carry Select Adder
either the speed or the power consumption. The following approach allows to
completely eliminating the static current.
4.1.3 Differential cascade voltage switch logic (DCVSL)
Let us consider that the complement of each signal is always available. This
requires each gate to generate both polarities of the output signal. Such a gate, called
Differential Cascade Voltage Switch Logic (DCVSL) is presented. The PDN1 &
PDN2 are complementary, and implement the required logic function and its inverse.
Assume now that, for a given set of inputs, PDN1 conducts while PDN2 does not.
Node out is pulled down. This turns on the load transistor M2, pulling up out’. This in
turn cuts off load transistor M1. The gate is clearly free of static current paths as only
PDN1 & M2 are conducting.
Figure 4.2 DCVSL logic gate Basic Principle
Figure 4.3 XOR-XNOR gates
Department of ECE, MRITS 45
Design of Low Power ALU using Area Efficient Carry Select Adder
The availability of complementary signals eliminates extra inverter stages. An
example in the circuit implements a two input XOR and XNOR gate. The transistor
connected to the A-inputs are shared between the two PDNs. DCVSL has, for
instance, been used for the implementation of fast error-correcting logic in memories.
The DCVSL gate has the speed advantage; the reduction of the parasitic
capacitances at the output nodes produces a faster response. At the same time the
static power consumption is eliminated. This comes at the expense of extra area, as
each gate requires two pull-down networks.
4.1.4 Pass transistor logic
This is another promising approach to implement complex logic by realizing it
as a logical network of switches or pass transistors. The pass transistor approach has
the advantage of being simple and fast. Complex CMOS combinational logic is
implemented with a minimal number of transistors. This reduces the parasitic
capacitances and results in fast circuits. The static and transient performance of such a
structure strongly depends upon the availability of a high-quality switch with low
parasitic capacitance and resistance. Although the MOS transistor in itself is a switch
of reasonable performance, some deficiencies will become apparent. Pass transistor
logic networks are, therefore, often constructed from bidirectional transmission gates
(pass gates). These gates are composed of an NMOS transistor and a PMOS device in
a parallel arrangement. The pass transistor acts as a bidirectional switch controlled by
the gate signal C. When C=1, both MOSFETs are on, allowing the signal to pass
through the gate i.e., A=B if C=1. On the other hand, C=0 places both transistors in
cutoff, creating an open circuit between nodes A and B.
Figure 4.4 Pass transistor logic
Department of ECE, MRITS 46
Design of Low Power ALU using Area Efficient Carry Select Adder
Although the pass transistor possesses some excellent properties, such as an
almost constant resistance and no threshold loss, it has the disadvantage that it
requires both an NMOS and a PMOS transistor, which have to be located in different
wells. This reduces the layout efficiency of the design. Also, the control signal has to
be presented in both the polarities, which once again has a negative influence on the
layout density. Furthermore, the parallel connection of PMOS and NMOS results in
increased node capacitances and reduced performance. It would therefore be
advantageous if we could implement transmission gate using NMOS transistor only.
Unfortunately, NMOS only pass transistors are subject to voltage loss. This is not a
problem if the voltage levels are subsequently restored by a complementary CMOS
inverter. Such a circuit suffers from two major drawbacks: reduced noise margin, due
to threshold voltage drop and static power consumption. Several techniques have been
proposed to get around this problem.
4.1.5 Transmission Gate logic
Transmission gate logic includes at least two field-effect transistor elements
used as pass transistors, each having a channel of conductivity type opposite that of
the other (i.e., complementary FET’s).
Transmission gate is switching element which switches the input to the output
according to the gate input. Transmission gate is parallel connection of n-transistor,
which is good at pass logic one and p-transistor which is good at pass logic zero. The
basic arrangement of transmission gate is shown in figure 4.5.
Figure 4.5 A simple Transmission gate
Department of ECE, MRITS 47
Design of Low Power ALU using Area Efficient Carry Select Adder
4.2. Special Hardware using Multiplexers (SHM)
Though BEC technique reduces area and power but not up to considerable
amount and also the design is not suitable for sub threshold level modifications.
The 16-bit SQRT CSLA using BEC in its second level requires 792
transistors. There is a scope to reduce the number of transistors along with the area
reduction and power dissipation reduction by using proposed logic. For the
implementation of a 16-bit SQRT CSLA, 736 transistors are required by using
proposed logic.
The proposed logic implementation for second level RCA is Special Hardware
using Multiplexers (SHM) as shown in figure 4.6. In this the inputs are applied to first
level RCA. And the output of RCA is applied to second level SHM and then to third
level multiplexer. Third level multiplexer selects either RCA output or SHM output
according to the previous carry. A simple 3-bit SHM requires 3 multiplexers to
implement. b0, b1, b2 are the inputs to the 3-bit SHM and the x0, x1, x2 are
corresponding outputs. SHM will take first level RCA output as input and appends its
value by one. 3-bit SHM uses three multiplexers and three inverters. First inverter
gives the first output bit x0 basing on input bit b0 and that output will be used as
select line for the first multiplexer. First multiplexer passes either second bit b1 or
inversion of second bit b1to the output because first inverter output acts like a carry to
the second bit. First multiplexer gives the second output bit x1 and that will be used as
second multiplexer select line. Basing on x1 output bit and b1 bit second multiplexer
generates carry for input bit b2. One input to the second multiplexer is b1 and second
input is grounded which will be selected when it is connected as select line to the
third multiplexer. Third multiplexer passes third bit or inversion of third bit to the
output according to the previous carry bit. This logic can be extended to any number
of bits. It is implemented for second block with two inputs under consideration. When
number of inputs is increased the proposed technique produces more efficient results
on large scale. One point to be noticed is despite of the above advantages, delay is
increased as carry has to pass 2(n-1) levels in n bit SHM in order to appear at the
output. The comparison between numbers of transistors is shown in table 4.1.
Department of ECE, MRITS 48
Design of Low Power ALU using Area Efficient Carry Select Adder
Figure 4.6 A 3-bit SHM
Xo= bo
X1=x0.b1+x0.b1
X2=(x1+b1).b2+x1.b1.b2
Table 4.1 Area comparison between 2-bit RCA and BEC
Department of ECE, MRITS 49
Type of
logic
Gates Number of
transistors
Total number of
transistors
3-bit BEC 2 –XOR
1-AND
1-NOT
24
6
2
32
3-bit SHM 3-MUX
3-NOT
18
6 24
Design of Low Power ALU using Area Efficient Carry Select Adder
4.2.1 Transistor level design of SHM
A 3-bit SHM takes 24 transistors it is shown in figure4.7, corresponding
functional verification in the figure and corresponding wave forms are shown in
figure 4.8 and wave forms and power dissipation window shown in figure 4.9.
Figure 4.7 Transistor level 3-bit SHM
Figure 4.8 Critical path details of a 3-bit SHM
Department of ECE, MRITS 50
Design of Low Power ALU using Area Efficient Carry Select Adder
Figure 4.9 Power dissipation of a 3-bit SHM
The power and area between existing technique such as BEC and proposed technique such as SHM are compared in table 4.2.
Logic for Second
Level
Number of transistors
Critical path delay
(ns)
Area
(µm2)
Power dissipation
(µw)
static Dynamic Total
BEC using CMOS
32 1.200 781 3.269 25.746 29.015
SHM using CMOS
24 2.350 486 3.100 22.843 25.943
Table 4.2 Power and delay Comparison between 2-bit RCA and BEC
Department of ECE, MRITS 51
Design of Low Power ALU using Area Efficient Carry Select Adder
4.3 An 8-bit ALU using proposed carry select adder
The proposed technique with 10-transistor full adder is applied to 8-bit ALU and corresponding circuit diagram shown in figure and for 8- transistor full adder, circuit diagram shown in figure 4.10.
Figure 4.10 Eight bit ALU using modified SQRT CSLA
4.4 Wave forms
By applying the 20 ns clock to the every input output wave forms are obtained. The proposed technique with 10-transistor full adder is applied to 8-bit ALU and corresponding output wave forms and power dissipation is shown in figure 4.11 and for 8- transistor full adder, wave forms are shown in figure 4.12.
Department of ECE, MRITS 52
Design of Low Power ALU using Area Efficient Carry Select Adder
Figure 4.11 Wave forms of 8- bit ALU for 10- transistor full adder
Figure 4.12 Wave forms of 8- bit ALU for 8- transistor full adder
Department of ECE, MRITS 53
Design of Low Power ALU using Area Efficient Carry Select Adder
CHAPTER 5
RESULTS
5.1 Comparative analysis of existing CSLA and modified CSLA
In the designing of 8 bit ALU using efficient carry select adder, all the blocks
of 16-bit SQRT CSLA, second level of second block such as 3-bit BEC and 3-bit
SHM are implemented in Dsch2.6c – Logic Editor and synthesized in Micro wind
2.6a- Layout Editor under 0.12um technology with 1.2 volts as logic high voltage.
The first level of second block in the 16-bit SQRT CSLA is two bit RCA
which requires 56 transistors when implemented in CMOS logic. The second level of
second block is 3-bit SHM in the proposed logic design; it uses 24 transistors. The
third level of second block is multiplexer. A simple 2x1 multiplexer uses six
transistors CMOS technology. Block2 needs three 2x1 multiplexers hence eighteen
transistors are required for the implementation. The total number of transistors
required for the complete block 2 is only 98 when SHM is used. Otherwise it requires
106 Transistors with BEC technique. The number of transistors required for block3 is
only 146, for block4 are 194 and for block5 are 242 when SHM is used. Otherwise
block3 requires 158, block4 requires210 and block5 requires 262 Transistors with
BEC technique. Using SHM for the implementation of a 16 bit SQRL CSLA 736
transistors are required where it requires 792 transistors with BEC technique.
Finally the complete second block of16-bit SQRT CSLA with BEC and SHM
is implemented using CMOS technology and observed the results and are shown from
Table 5.1.
5.2 Comparative analysis of existing ALU and modified ALU
All the basic gates in the ALU such as AND, XOR, multiplexer and full adder
are designed using GDI technique. Here full adder is designed using 10 transistors as
well as 8 transistors. Final comparison on 8 bit ALU is considered by taking ripple
carry adder and carry select adder.
Department of ECE, MRITS 54
Design of Low Power ALU using Area Efficient Carry Select Adder
Design of 8-bit ALU using efficient carry select adder is speed advantageous
than the 8-bit ALU using ripple carry adders. ALU using efficient carry select adder
gives 42% advantage for 10 transistors adder and 46% advantage for 8 transistor
adder. Corresponding results are shown in table 5.3 and 5.4.
Table 5.1 Comparison of second level 2- bit RCA; 3-bit BEC and3-bit SHM
implemented using CMOS technology
Logic for Second
Level
Number of transistors
Critical path delay
(ns)
Area
(µm2)
Power dissipation
(µw)
Static Dynamic Total
RCA using
CMOS
56 1.900 1342 6.706 42.565 49.271
BEC using
CMOS
32 1.200 781 3.269 25.746 29.015
SHM using
CMOS
24 2.350 486 3.100 22.843 25.943
Table 5.2 Comparison between second block with BEC and second block with
SHM using CMOS
Design Type Number of transistors
Critical path delay
(ns)
Area
(µm2)
Power dissipation
(µw)
Static Dynamic Total
RCA-BEC-MUX
106 3.240 3465 21.005 106 127.005
RCA-SHM-MUX
98 3.770 2996 20.138 98.624 118.762
Department of ECE, MRITS 55
Design of Low Power ALU using Area Efficient Carry Select Adder
MODEL(ALU) NUMBER OF TRANSISTORS
Critical path delay(ns)
Area(µm) Power(mw)
8BIT ALU USING
10 TRANSISTOR
RCA
448 3.195 12384 0.204
8BIT ALU USING
10 TRANSISTOR
CSLA
508 1.865 24682 0.205
Table 5.3 Comparison of 8-bit ALU using 10 transistor adder
Table 5.4 Comparison of 8-bit ALU using 8 transistor adder
MODEL(ALU) NUMBER OF TRANSISTORS
Critical path delay(ns)
Area(µm) Power(mw)
8BIT ALU USING
8 TRANSISTOR
RCA
432 3.745 11832 0.221
8BIT ALU USING
8 TRANSISTOR
CSLA
494 2.070 20988 0.262
CHAPTER 6
Department of ECE, MRITS 56
Design of Low Power ALU using Area Efficient Carry Select Adder
CONCLUSIONS AND FUTURE SCOPE
6.1 Conclusions
In the process of designing a low power ALU, various tradeoffs between area,
delay and power dissipation occurred. As the adder is the main block in the ALU,
always efficient adder is preferred. For that, SQRT carry select adder is moderated
with more power and area advantageous.
In this process all second level RCA blocks of 16-bit SQRT CSLA are
replaced by SHM and the results are compared with existing technique such as BEC.
From the comparisons in Table 5.1, it is observed that the variation between 2-bit
RCA and proposed technique 3-bit SHM are more comparable such as percentage of
utilization of number of transistors is reduced to 57.1%, correspondingly percentage
of area required also reduced to 63.7% along with power dissipation reduction
advantage of 47.3%. Whereas the variation between 2-bit RCA and existing technique
3-bit BEC is only 42.8% reduction of utilization of number of transistors, 41.8%
reduction of area required along with the 41.1% reduction of power dissipation.
Finally second block of 16-bit SQRT CSLA is designed using logic level modification
such as SHM in place of BEC. From the table 5.2, it is observed that number of
transistors is reduced by 7.5%, area is reduced by 13.5% and power is reduced by
6.4%, but critical path delay is increased by 16.3%. Once again it is proved that the
tradeoff between area, power and delay, the design is optimized for power and area
against to the delay over head. This delay overhead also can be overcome by using
various existing low power circuit level modifications.
By using the proposed efficient carry select adder and GDI technique 8-bit
ALU is designed for both 10 transistor and 8 transistor full adders and compared with
the existing technique such as 8-bit ALU using ripple carry adders in the tables 5.3
and 5.4. It is observed that speed is increased 41.6% in case 10 transistor full adder
and 44.7% in case of 8-transistor full adder.
The performance of the proposed design has been shown to outperform.
Satisfactory level of power consumption and propagation delay can be achieved using
Department of ECE, MRITS 57
Design of Low Power ALU using Area Efficient Carry Select Adder
the proposed technology without the need to purchase new technology libraries,
which may lead to design cost reduction. Consequently, the proposed design is
suitable for the application in the high-performance arithmetic and VLSI circuits in
the future.
6.2 Future Scope
The proposed work can be extended and carried further with an aim of
increasing the number of bits and approach to new technology such as 0.08, 0.06
micron meter technology. The resulting design with few numbers of transistors will in
turn result in reduction of total area and also reduction in the power consumption.
REFERENCES
Department of ECE, MRITS 58
Design of Low Power ALU using Area Efficient Carry Select Adder
[1] Arun Prakash Singh, Rohit Kumar, “Implementation of 1-bit Full Adder Using
Gate Diffusion Input (GDI) cell”, International Journal of Electronics and
Computer Science Engineering J. Clerk Maxwell, A Treatise on Electricity and
Magnetism, 3rd ed., vol. 2. Oxford: Clarendon, 1892, pp.68-73.
[2] N. M. Chore, R. N. Mandavgane , “ A survay of low power high speed one bit
full adder”,recent advances in networking, VLSI and signal processing, ISSN:
1790-5117. ISBN: 978-960-474-162-5.
[3] N. Weste and K. Eshraghian, Principles of CMOS VLSI Design, A System
Perspective . Reading, MA: Addison- Wesley, 1993.
[4] Pardeep Kumar / International Journal of Engineering Research and
Applications(IJERA) ISSN: 2248-9622 Vol. 2, Issue 6, November- December
2012, pp.599-606
[5] M.sreedevi and p.jeno.paul “ Design and Optimization of a High Performance
Low-Power CMOS Flex Cell “, International Journal of Signal System Control
and Engineering Application, 2010, vol.3, no.4, pp.65-69. DOI:
10.3923/ijssceapp.2010.65.69.
[6] A good over view of leakage and reduction methods are explained in the book
Leakage and reduction in Nanometer CMOS Technologies ISBN 0-387-25737-3.
[7] M.Parvathi, N.Vasantha, K. Satya Prasad “Design of High Speed -Low Power-
High Accurate (HS-LP-HA) Adder “, ICECT, Internation conference on
Electronics Computer Technology Proceedings, 2012, pp: 523-527, 978-1-4673-
1850-1/12@2012, IEEE.
[8] K Allipeera, S Ahmed Basha, “An Efficient 64-Bit Carry Select Adder With Less
Delay And Reduced Area Application“, International Journal of Engineering
Research and Applications( IJERA) .ISSN: 2248-9622 www.ijera.com Vol. 2,
Issue 5, September- October 2012, pp.550-554
[9] O.J.Bedrij, “Carry Select Adder”, IRE Trans. Electron. Comput.pp. 340-344,1962.
[10] U.Sreenivasulu, T.Venkata Sridhar, “Implementation of An 4 Bit - ALU Using
Low-Power And Area-Efficient Carry Select Adder”, International Conference on
Electronics and Communication Engineering, 20th, May 2012, Bangalore, ISBN:
978-93-81693-29-2.
Department of ECE, MRITS 59
Design of Low Power ALU using Area Efficient Carry Select Adder
[11] A.Andamuthu, S.Rithanyaa, ”Design Of 128 Bit Low Power and Area
Efficient Carry Select Adder”, International Journal of Advanced Research in
Engineering (IJARE) Vol 1, Issue 1,2012 Page 31-34.
[12] B.Ramkumar, H.M.Kittur, and P .M.Kannan, “ASIC implementation of
modified faster carry save adder”, EUR .J. Sci .Res. vol.42, no.1, pp.53-58, 2010.
[13] T.Y.Ceaing and M.J.Hsaio, “carry –select adder using single ripple carry
adder”, Electron. Lett. Vol.34,no.22,pp.2101-2103, oct.1998
[14] Y.Kim and L.S.Kim, “64-bit carry select adder with reduced area”, Electron.
Lett. Vol.37,no.10,pp.614-615, May.2001.
[15] B RamKumar and Harish M Kittur, “Low –Power And Area -Efficient Carry
Select Adder”, IEEE Transactions on Very Large Scale Integration(VLSI)Systems
APPENDIX
Department of ECE, MRITS 60
Design of Low Power ALU using Area Efficient Carry Select Adder
About Microwind2
The MICROWIND2 program allows the student to design and simulate an
integrated circuit at physical description level. The package contains a library of
common logic and analog ICs to view and simulate. MICROWIND2 includes all the
commands for a mask editor as well as original tools never gathered before in a single
module (2D and 3D process view, Verilog compiler, tutorial on MOS devices). You
can gain access to Circuit Simulation by pressing one single key. The electric
extraction of your circuit is automatically performed and the analog simulator
produces voltage and current curves immediately. This includes details on the device
modeling, simulation at logic and layout levels.
Figure A: MICROWIND window as it appears at the initialization stage.
We use MICROWIND2 to draw the MOS layout and simulate its behavior.
Go to the directory in which the software has been copied (By default microwind2).
Double-click on the MicroWind3 icon. The MICROWIND2 display window includes
four main windows: the main menu, the layout display window, the icon menu and
Department of ECE, MRITS 61
Design of Low Power ALU using Area Efficient Carry Select Adder
the layer palette. The layout window features a grid; scaled in lambda (λ) units. The
lambda unit is fixed to half of the minimum available lithography of the technology.
The default technology is a CMOS 6-metal layers 0.12μm technology, consequently
lambda is 0.06μm (60nm).
Simulation of a layout
MICROWIND3 includes a 3D process viewer for that purpose. Click
Simulate → Process steps in 3D. The simulation of the CMOS fabrication process is
performed, step-by-step by a click on Next Step.
The picture on the left represents the nMOS device, pMOS device, common
polysilicon gate and contacts. The picture on the right represents the same portion of
layout with the metal layers stacked on top of the active device.The inverter
simulation is conducted as follows. Firstly, a VDD supply source (1.2V) is fixed to
the upper metal2 supply line, and a VSS supply source (0.0V) is fixed to the lower
metal2 supply line. The properties are located in the palette menu. Simply click the
desired property, and click on the desired location in the layout. Add a clock on the
inverter input node (The default node name clock1 has been changed into Vin) and a
visible property on the output node Vout
The command Simulate → Run Simulation gives access to the analog
simulation. Select the simulation mode Voltage vs. Time. The analog simulation of
the circuit is performed. The time domain waveform, proposed by default, details the
evolution of the voltages in1 and out1 versus time. This mode is also called transient
simulation
Department of ECE, MRITS 62
Design of Low Power ALU using Area Efficient Carry Select Adder
The command simulate→ run simulation gives access to four simulation
modes.voltage vs time, voltage and current vs time, static voltage vs voltage and
frequesncy vs time. all these simulation modes are applicable to inverter simulation.
Due to the fact that the layout Inv steps. Msk not only includes the inverter correctly
polarized but also several other MOS devices without any simulation properties, a
warning window appears prior to the anolog simulation, in this case you may click
simulate as it, In normal cases. All n-well regions should be stuck at VDD.
Select the simulation mode voltage vs time. The analog simulation of the
circuit is performed. The time domain waveform. Proposed by default, details the
evolution of the voltages in1 and out1 versus time. This mode is also called transient
simulation.
The inverter consumes power during transitions, due to two separate effects.
The first is short circuit power arising from momentary short circuit current that flows
from VDD to VSS when the transistor functions in the complete on/off state. The second
is charging/discharging power, which depends on the output wire capacitance. With
small loading the short circuit power loss is dominant. With huge loading, that is a
large output node capacitance, the load power is dominant.
The power consumption occurs briefly during transitions of the output, either
from 0 to 1 or from 1 to 0.the simulation contains the supply currents in the upper
window, and all voltage waveforms in the lower window. The current consumption is
important only during a very short period corresponding to the charge or discharge of
the output node. Without any switching activity, the current almost equals zero.
Delay
As the number of gates connected to the inverter output mode increase, the
load capacitance increases. The fan-out corresponds to the number of gates connected
to the cell output. Physically a large fan-out means a large number of connections that
is a large load capacitance.
An inverter circuit is simulated by using different clock, fanout and supply
conditions. The initial configuration is based on one inverter controlled by a 2 GHz
Department of ECE, MRITS 63
Design of Low Power ALU using Area Efficient Carry Select Adder
clock, with its output connected either to a single inverter or to four inverters. The
supply voltage is 1.2V, with a 0.12μm CMOS technology.
Now we connect four inverter circuits to the output node, thus increasing the
charge capacitance. In the simulation chronograms the inverter delay is significantly
increased. When we investigate the delay variation with the output capacitance load.
In the curve we can see that the gate delay variation with the loading capacitance is
quite linear. A 100fF load leads to around 300ps delay in CMOS 0.12μm technology.
In Microwind we obtain this type of screen, thanks to the command
parametric analysis. Load the file Invcapa.MSK, invoke the command parametric
analysis. By default the capacitance of output node is increased step-by-step from its
default value Cdef to Cdef +100fF.for each value of the output capacitance, the
analog simulation is performed, and the last computed rise time is plotted, appearing
as one single red dot in the graphs. The complete graph is built once all analog
simulations have been compelted.The memory button enables us to store one curve
prior to a new parametric simulation, for comparison purposes. Three main
parameters may vary in the parametric analysis: the capacitance voltage, temperature.
Several analog parameters may be monitored: rise and fall delay, oscillating
frequency, power consumption, final voltage of a node, cross talk etc.
Power consumption
The power consumption P is computed by micro wind as the average product
of the supply voltage VDD and the supply current IDD, computed at each iteration step-
in other words
P = Σ IDD.VDD/steps
Three main factors contribute to power consumption P: the load capacitance C, the
supply voltage VDD and the clock frequency for a CMOS inverter, this relation is
usually represented by the first order approximation below .The following equation
shows a linear dependence of the power consumption P with the total capacitance C
and the operating frequency father power consumption is also proportional to the
square of the supply voltage VDD.
Department of ECE, MRITS 64
Design of Low Power ALU using Area Efficient Carry Select Adder
P = 0.5ή.C.V2dd.f
ή = switching activity factor.
C = output load capacitance
Vdd= supply voltage
f= clock frequency.
Frequency dependence
We can verify the linear dependence of the power consumption with the
operating frequency by simulating a CMOS inverter circuit. At each time domain
analog simulation, we get a value of the power consumption, which is computed by
micro wind as the average product of the supply voltage VDD and the supply current
IDD.as the power consumption is linearly proportional to the clock frequency, a usual
metric found in most cell libraries is the μWGhz.
Supply voltage dependence
It can be considered as a first order approximation that the average power
consumption is proportional to VDD^2.we use the parametric analysis tool in micro
wind to control the incremental change of the supply voltage from 0.5 to 2.0 V.the
supply voltage step is 0.1 V.in the measurement window, the item dissipation is
selected. The result shows a non linear dependence of the power dissipation with
VDD.the square law fits with the experimental data form 0.8 to 1.5 V.we notice a
very important rise of the power consumption over 1.5 V, due to the avalanche effects
in n channel MOS devices. The simulation demonstrates the interest for a minimum
supply operation to achieve optimum low power operation.
Minimum supply voltage
Department of ECE, MRITS 65
Design of Low Power ALU using Area Efficient Carry Select Adder
We must know the supply voltage for which the inverter does not work any
more and the answer is given by the parametric analysis focusing this time on the
inverter delay dependence versus the supply voltage. Load the file cmosload.msk for
this study. Invoke the command parametric analysis of the analysis menu. click the
layout region corresponding to the node VDD. Verify that the voltage menu is selected
in the parametric analysis window. Verify that the node VDD is selected. Modify the
VDD voltage range from 0.5 to 1.5 V, step 0.1.finally in the measurement menu,
select the item rise delay and click start analysis.
We observe that the delay is significantly increased as we decrease VDD from
its nominal value 1.2V down to 0.6V.below 0.7V the inverter delay is higher than the
default transient simulation time so that the delay evaluator does not work anymore.
Static characteristics
The static characteristics of the inverter correspond to the variation plot of the
output voltage versus the input voltage. The simulation involves a step by step
increase of Vin, and the monitoring of Vout. In the simulation window, the static
characteristics are obtained by a click on the item voltage versus voltage situated in
the selection menu, at the bottom of the chronograms.
When Vin is low, Vout is high which corresponds to one logic state of the
inverter. When Vin increases Vout starts to decrease slowly, and suddenly crosses the
VDD/2 boundary. At that point the value of Vin is the commutation point of the
inverter called Vc.then when Vin rises to VDD, Vout reaches 0.which corresponds to
the other logic state of the inverter.
About DSCH3
The DSCH3 program is a logic editor and simulator. DSCH3 is used to
validate the architecture of the logic circuit before the microelectronics design is
started. DSCH3 provides a user-friendly environment for hierarchical logic design,
and fast simulation with delay analysis, which allows the design and validation of
complex logic structures. Some techniques for low power design are described in the
Department of ECE, MRITS 66
Design of Low Power ALU using Area Efficient Carry Select Adder
manual. DSCH3 also features the symbols, models and assembly support for 8051.
DSCH3 also includes an interface to SPICE.
Features
Figure B: DSCH schematic editor
user friendly environment for rapid design of logic circuits.
Supports hierarchical logic design.
Handles both conventional pattern-based logic simulation and intuitive on-
screen mouse-driven simulation.
Built-in extractor, which generates a SPICE net list from the schematic
diagram (Compatible with PSPICETM and WinSpiceTM).
Current and power consumption analysis.
Generates a VERILOG description of the schematic for layout editor.
Immediate access to symbol properties (Delay, fan-out).
Models and supports 8051 micro controller
Department of ECE, MRITS 67
Design of Low Power ALU using Area Efficient Carry Select Adder
An example of the design of the schematic diagram in the DSCH and the generation
of its layout in the MICROWIND is shown. The CMOS inverter design is detailed in
the figure C below. First click new on main menu then draw the circuit diagram on
DSCH window by dragging the components from symbol library. Draw the circuit
diagram as shown below.
Figure C: Inverter circuit
Save the file and Click Simulate→ Start simulation in the main menu. Then,
click inside the buttons situated on the left part of the diagram. The result is displayed
on the LED. Here the p-channel MOS and the n-channel MOS transistors function as
switches as shown in the figure D. When the input signal is logic 0as shown in figure
5.4 the NMOS is switched off while PMOS passes VDD through the output. When the
input signal is logic 1 shown in figure 6.12, the PMOS is switched off while the
NMOS passes VSS to the output.
Department of ECE, MRITS 68
Design of Low Power ALU using Area Efficient Carry Select Adder
Figure D: Circuit diagram of CMOS inverter, CMOS inverter While simulation
The fan-out corresponds to the number of gates connected to the inverter
output. Physically, a large fan-out means a large number of connections that is a large
load capacitance. If we simulate an inverter loaded with one single output, the
switching delay is small. Now, if we load the inverter by several outputs, the delay
and the power consumption are increased. The power consumption linearly increases
with the load capacitance.
This is mainly due to the current needed to charge and discharge that
capacitance. Click the button Stop simulation shown in the figure below. You are
back to the editor.
Department of ECE, MRITS 69
Design of Low Power ALU using Area Efficient Carry Select Adder
Figure E: Timing diagram of inverter
Click the chronogram icon to get access to the chronograms of the previous
simulation. As seen in the waveform, the value of the output is the logic opposite of
that of the input.
Generation of layout of the schematic diagram
Next open the Microwind window and click on open in the main menu. Then
open CMOS inverter circuit diagram. Then click on compile the verilog file to
generate the verilog file of corresponding circuit diagram.It generates the
corresponding stick diagram of the inverter circuit as shown in the figure. Then click
on simulate icon in main menu to generate the waveforms.
Verilog program
// DSCH Ver 3.0
// G:\project\dsch microwind\self\example.sch
module example (in1, out1);
input in1;
Department of ECE, MRITS 70
Design of Low Power ALU using Area Efficient Carry Select Adder
output out1;
wire ;
pmos #(17) pmos_1(out1,vdd,in1); // 2.0u 0.12u
nmos #(17) nmos_2(out1,vss,in1); // 1.0u 0.12u
endmodule
// Simulation parameters in Verilog Format
always #1000 in1=~in1; in1 CLK 10
Layout
In this paragraph, the procedure to create manually the layout of a CMOS
inverter is described. Click the icon MOS generator on the palette. The following
window appears. By default the proposed length is the minimum length available in
the technology (2 lambda), and the width is 10 lambda. In 0.12μm technology, where
lambda is 0.06μm, the corresponding size is 0.12μm for the length and 0.6μm for the
width.. Click on the top of the nMOS to fix the pMOS device. The result is displayed
in figure F.
Department of ECE, MRITS 71
Design of Low Power ALU using Area Efficient Carry Select Adder
Figure F: Layout of inverter in MICROWIND
Figure G: Selecting the NMOS device
Department of ECE, MRITS 72
Design of Low Power ALU using Area Efficient Carry Select Adder
Connection between devices
Within CMOS cells, metal and polysilicon are used as interconnects
for signals. Metal is a much better conductor than polysilicon. Consequently,
polysilicon is only used to interconnect gates, such as the bridge (1) between pMOS
and nMOS gates, as described in the schematic diagram of figure G. Polysilicon is
rarely used for long interconnects, except if a huge resistance value is expected. In the
layout shown in figure G, the Polysilicon Bridge links the gate of the n-channel MOS
with the gate of the p-channel MOS device. The polysilicon serves as the gate control
and the bridge between MOS gates.
Figure H: Connections required to build the inverter (CmosInv.SCH)
Department of ECE, MRITS 73