B.Supmonchai July 5, 2005
2102-545 Digital ICs 1
Chapter 12
Arithmetic Building Blocks
Boonchuay SupmonchaiIntegrated Design Application Research (IDAR) Laboratory
August 20, 2004; Revised - July 5, 2005
2102-545 Digital ICs Arithmetic Building Blocks 2
B.Supmonchai
Goals of This Chapter
q Designing for Performance, area, or power
ß Adders
ß Multipliers
ß Shifters
q Logic and System Optimizations for datapathmodules
q Power-Delay trade-offs in datapaths
2102-545 Digital ICs Arithmetic Building Blocks 3
B.Supmonchai
Review: A Generic Processor
Datapath
Inp
ut/O
utp
ut
Memory
Control
Adder, Multiplier,Shifter, Comparator, etc.
RAM, ROM, Shift Register
FSM,PLA,
Counter,Random
Logic
Switches,Arbiters,
BusDrivers
2102-545 Digital ICs Arithmetic Building Blocks 4
B.Supmonchai
Register
Adder
Shifter
Multiplexer
Datapath Unit
Bit-Sliced Architecture
Control
n-bitData In
n-bitData Out
Bit 0
Bit 1
Bit n
-2
Bit n
-1
…
IdenticalProcessingElements
q Modularß Easy to design and verify
ß Easy to expandq Potential to be fast
B.Supmonchai July 5, 2005
2102-545 Digital ICs 2
2102-545 Digital ICs Arithmetic Building Blocks 5
B.Supmonchai
Example: Itanium Bit-Sliced Design
Adder stage 1
Wiring
Adder stage 2
Wiring
Adder stage 3
Bit slice 0
Bit slice 2
Bit slice 1
Bit slice 63
Sum Select
Shifter
Multiplexers
Loopback Bus
From register files / Cache / Bypass
To register files / Cache
Loopback Bus
Loopback Bus
2102-545 Digital ICs Arithmetic Building Blocks 6
B.Supmonchai
Example: Itanium Integer Datapath
Itanium has 6 integer execution units (ALU)
2102-545 Digital ICs Arithmetic Building Blocks 7
B.Supmonchai
One-Bit Binary Full Adder (FA)
A
BS
Cin
Cout
1-bitFull Adder
(FA)
generate
generate
propagate
propagate
propagate
propagate
kill
kill
CarryStatus
11111
01011
01101
10001
01110
10010
10100
00000
SCoutCinBA
S = A ⊕ B ⊕ Cin
Cout = AB + ACin + BCin
q A VERY common operation - so worth spending sometime trying to optimizeß Often in the critical path, so need to look at both logic level and
circuit level optimizations
2102-545 Digital ICs Arithmetic Building Blocks 8
B.Supmonchai
Generate (G) = ABPropagate (P) = A ⊕ B
Delete(D) = A B
S(G,P,C) = P ⊕ Cin
Cout(G,P,C) = G + PCin
Propagate, Generate, and Delete (Kill)
q Define 3 new variable which ONLY depend on A, B
q Then we can write S and Cout in terms of G, P, and Cin
q We can also write S and Cout in terms of D, P, and Cin
q Sometimes an alternative definition for P can be used
Propagate (P) = A + B
(FA itself generates a carry)
(FA passes along carry)
(FA stops propagation of carry)
B.Supmonchai July 5, 2005
2102-545 Digital ICs 3
2102-545 Digital ICs Arithmetic Building Blocks 9
B.Supmonchai
FA CMOS Implementation: First Try
Cout
AB
Cin A
CinA
BA
A
B
Cin
A
B
Cin
B
CinCin
BB
B
A A
S
A A
B B
Cin Cin
32 Transistors
Majority Function Maj(A,B,C)outputs 0 or 1 whichever hasgreater numbers at the inputs
2102-545 Digital ICs Arithmetic Building Blocks 10
B.Supmonchai
Improved CMOS Implementationq A more compact design is based on the observation that
S can be factored to reuse the Cout term
S = ABCin+ (A + B + Cin)Cout
ABCin
ABCin
Cout
SS
Cout
Minority Function
2102-545 Digital ICs Arithmetic Building Blocks 11
B.Supmonchai
A B
B
A
Ci
Ci A
X
VDD
VDD
A B
Ci BA
B VDD
A
B
Ci
Ci
A
B
A CiB
Co
VDD
28 Transistors
Improved CMOS Implementation II
2102-545 Digital ICs Arithmetic Building Blocks 12
B.Supmonchai
Notes on Improved CMOS FA
q Note that the PMOS network is identical to the NMOSnetwork rather than being the complement.
ß This is possible because of the inversion property which saysthat the function of complemented inputs is equal to thecomplement of the function.
ß This simplification reduces the number of series transistorsand makes the layout more uniform
q This design has a greater delay to compute S than Cout
ß Most of the time the extra delay computing S has little effecton the critical path because carry is the signal that propagates
ß With proper sizing this delay on S can be minimized
B.Supmonchai July 5, 2005
2102-545 Digital ICs 4
2102-545 Digital ICs Arithmetic Building Blocks 13
B.Supmonchai
A B
S
CoCi FA
A B
S
CoCi FA
SABC i,,() SABC i
,,()=
Co ABC i,,() Co ABC i
,,()=
Inversion Property
q The function must be symmetric
2102-545 Digital ICs Arithmetic Building Blocks 14
B.Supmonchai
TG-Based FA
XOR XOR2-to-1 MUX
16 Transistors
Cout
S
Cin
A
BP
Extra delay - slower
2102-545 Digital ICs Arithmetic Building Blocks 15
B.Supmonchai
Complementary PT Logic (CPL) FA
A
A
B
B
Cin
Cin
A
B
B
A
B
B
Cin
Cin
Cin
Cin
S
S
Cout
Cout
28 transistorsdual rail
Voltage dropProblems
Faster, Lower Power, and small area than full static CMOS
2102-545 Digital ICs Arithmetic Building Blocks 17
B.Supmonchai
B
B B
B B
B
B
B
A
A
A
A
A
AA
A
Cin
Cin
Cin
Cin
Cin
!Cout!S
24+4 transistors
kill
generate
0-propagate
1-propagate
4 4
4 4
4
8
888
8
2 2 2
3
3
3
6
6
6
444
4
2
Mirror Adder
S = ABCin+ (A + B + Cin)CoutCout = AB + ACin + BCin
PUN and PDN are symmetrical not complemented
B.Supmonchai July 5, 2005
2102-545 Digital ICs 5
2102-545 Digital ICs Arithmetic Building Blocks 18
B.Supmonchai
Mirror Adder Featuresq The NMOS and PMOS chains are completely
symmetrical with a maximum of two series transistorsin the carry circuitry, guaranteeing identical rise andfall transitions if the NMOS and PMOS devices areproperly sized.
q When laying out the cell, the most critical issue is theminimization of the capacitances at node !Cout (fourdiffusion capacitances, two internal gate capacitances,and two inverter gate capacitances).ß Shared diffusions can reduce the stack node capacitances.
q The transistors connected to Cin are placed closest to theoutput.
2102-545 Digital ICs Arithmetic Building Blocks 19
B.Supmonchai
Mirror Adder Sizing Issues
q Only the transistors in the carry stage have to beoptimized for optimal speed. All transistors in the sumstage can be minimal size.
q Assume PMOS/NMOS ratio of 2. Each input in thecarry circuit has a logical effort of 2 so the optimal fan-out for each is also 2.
q Since !Cout drives 2 internal and 2 inverter transistorgates (to form Cout for the bit adder) the carry circuitshould be oversized
2102-545 Digital ICs Arithmetic Building Blocks 20
B.Supmonchai
CiA B
VDD
GND
B
Co
A Ci Co Ci A B
S
Mirror Adder Stick Diagram
2102-545 Digital ICs Arithmetic Building Blocks 21
B.Supmonchai
Worst Case Delay : tripple = O(N)
tripple ª tFA(A,BÆCout) + (N - 2)tFA(CinÆCout) + tFA(CinÆS)
Slow!
Ripple Carry Adder (RCA)A0 B0
S0
C0 = Cin
A1 B1
S1
A2 B2
S2
A3 B3
S3
Cout = C4
C1C2C3FAFAFAFA
Make the fastest possible carry path
B.Supmonchai July 5, 2005
2102-545 Digital ICs 6
2102-545 Digital ICs Arithmetic Building Blocks 22
B.Supmonchai
regular cellinverted cell
A0 B0
S0
C0 = Cin
A1 B1
S1
A2 B2
S2
A3 B3
S3
Cout = C4
C1C2C3FAFAFAFA
Exploiting the Inversion Property
q Now need two “flavors” of FAs
q Minimizes the critical path (the carry chain) by elimi-nating inverters between the FAsß Need increasing the transistor sizes on the carry chain portion
of the mirror adder.2102-545 Digital ICs Arithmetic Building Blocks 23
B.Supmonchai
C1 = G0 + P0C0
C2 = G1 + P1G0 + P1P0 C0
C3 = G2 + P2G1 + P2P1G0 + P2P1P0 C0
C4 = G3 + P3G2 + P3P2G1 + P3P2P1G0 + P3P2P1P0 C0
Fast Carry Chain Design
q The key to fast addition is a low latency carry network
q What matters is whether in a given position a carry is
ß Generated Gi = AiBi
ß Propagated Pi = Ai ⊕ Bi (sometimes use Ai | Bi)
ß Annihilated (killed) Ki = !Ai !Bi
q Giving a carry recurrence of C i+1 = Gi + PiCi
2102-545 Digital ICs Arithmetic Building Blocks 24
B.Supmonchai
Manchester Carry Chainq Switches controlled by Gi and Pi
q Components of total delayß time to form the switch control signals Gi and Pi
ß setup time for the switches
ß signal propagation delay through N switches in the worst case
CoCi
Gi
Di
Pi
Pi
VDD
Static
CoCi
Gi
Pi
VDD
f
f
Domino
2102-545 Digital ICs Arithmetic Building Blocks 25
B.Supmonchai
4-bit Sliced MCC Adder
G P
!C0
clk
G PG PG P
⊕⊕⊕⊕
& ⊕& ⊕& ⊕& ⊕
A0 B0A1 B1A2 B2A3 B3
S0S1S2S3
!C1!C2!C3
!C4
B.Supmonchai July 5, 2005
2102-545 Digital ICs 7
2102-545 Digital ICs Arithmetic Building Blocks 26
B.Supmonchai
G0 + P0C0
G1 + P1G0 + P1P0 C0
G2 + P2G1 + P2P1G0 + P2P1P0 C0
G3 + P3G2 + P3P2G1 + P3P2P1G0 + P3P2P1P0 C0
Domino MCC Circuit
P0P1P2P3
Ci,0
clk
G0G1G2G3
Ci,4
clk3 3 3 3 3
1 2 3 4
5
6
1
2
2
3
3
4
4
5
2102-545 Digital ICs Arithmetic Building Blocks 27
B.Supmonchai
MCC Stick Diagram
Pi + 1 Gi + 1 f
Ci
Inverter/Sum Row
Propagate/Generate Row
Pi Gi f
Ci - 1Ci + 1
VDD
GND
2102-545 Digital ICs Arithmetic Building Blocks 28
B.Supmonchai
Notes on MCC Adder
q When clock is low, the carry nodes precharge; whenclock goes high if Gi is high, Ci+1 is asserted (goes low)
q To prevent Gi from affecting Ci, the signal Pi must becomputed as the xor (rather than the or) which is not aproblem since we need the xor of Ai and Bi forcomputing the sum anyway
q Delay is roughly proportional to n**2 (as n passtransistors are connected in series)
ß we usually limit each group to 4 stages, then buffer the carrychain with an inverter between each group
2102-545 Digital ICs Arithmetic Building Blocks 29
B.Supmonchai
Binary Adder Landscape
Synchronous WordParallel Adders
Ripple Carry Adders (RCA) Carry Prop Min Adders
Signed-Digit Adders
Fast Carry Prop Adders Residue Adder
Manchester Carry Chain
CarrySelect
Parallel Prefix
ConditionalSum
CarrySkip
t = O(log N)A = O(N log N)
t = O(÷N) A = O(N)t = O(N)
A = O(N)
t = O(N), A = O(N)
t = O(1), A = O(N)
Bit-Serial Adders
Asynchronous Adders
B.Supmonchai July 5, 2005
2102-545 Digital ICs 8
2102-545 Digital ICs Arithmetic Building Blocks 30
B.Supmonchai
If (P0 & P1 & P2 & P3 = 1) then Co,3 = Ci,0 otherwise theblock itself kills or generates the carry internally
Carry-Skip (Carry-Bypass) Adder
A0 B0
S0
Ci,0
A1 B1
S1
A2 B2
S2
A3 B3
S3
C0,3C1C2C3
FAFAFAFA
Co,3
BP = P0 P1 P2 P3 “Block Propagate”
1
0
2102-545 Digital ICs Arithmetic Building Blocks 31
B.Supmonchai
BP (By-Pass)block carry-in
block carry-outcarry-out
Carry-Skip Chain Implementation
Cin
G0
P0P1P2P3
G1G2G3
BP
Cout
Only 10% to 20% area overhead
Only two “gate delays” toproduce Cout if skip occurs
2102-545 Digital ICs Arithmetic Building Blocks 32
B.Supmonchai
Worst-case delay Æ carry from bit 0 to bit 15 = carry generated in bit 0,ripples through bits 1, 2, and 3, skips the middle two groups (B is thegroup size in bits), ripples in the last group from bit 12 to bit 15
tadd = tsetup + B tcarry + ((N/B) -1) tskip + B tcarry + tsum
4-bit Block Carry-Skip Adder
Ci,0
CarryPropagation
Setup
Sum
CarryPropagation
Setup
Sum
CarryPropagation
Setup
Sum
CarryPropagation
Setup
Sum
bits 0 to 3bits 4 to 7bits 8 to 11bits 12 to 15
tsetuptskiptcarry
tsum
2102-545 Digital ICs Arithmetic Building Blocks 33
B.Supmonchai
Optimal Block Size and Timeq Assuming one stage of ripple (tcarry) has the same delay
as one skip logic stage (tskip) and both are 1
tCSkA = 1 + B + (N/B-1) + B + 1
= 2B + N/B + 1
q So the optimal block size, B, is
dtCSkA/dB = 0 fi ÷(N/2) = Bopt
q And the optimal time is
Optimal tCSkA = 2(÷(2N)) + 1
tsetup ripple in skips ripple in tsum block 0 last block
B.Supmonchai July 5, 2005
2102-545 Digital ICs 9
2102-545 Digital ICs Arithmetic Building Blocks 34
B.Supmonchai
Variations of Carry-Skip Adders I
q Variable block sized Carry-Skip Adders
ß A carry that is generated in, or absorbed by, one of the innerblocks travels a shorter distance through the skip blocks
ß Hence a CSA adder can have bigger blocks for the innercarries without increasing the overall delay
CinCout
tCSkA = 2B + O(NB)
NB Blocks
2102-545 Digital ICs Arithmetic Building Blocks 35
B.Supmonchai
skip level 1
skip level 2
CinCout
AND of thefirst level skipsignals (BP’s)tCSkA = 2B + O(logBN)
Variations of Carry-Skip Adders IIq Multiple Levels of Skip Logicß CSAs with large number of bits suffer from linear carry
propagation delay time.
ß Added higher levels of skip logic, a CSA can skip more blocksat a time.
2102-545 Digital ICs Arithmetic Building Blocks 36
B.Supmonchai
Carry-Skip Adder Comparisons
0
10
20
30
40
50
60
70
8 bits 16 bits 32 bits 48 bits 64 bits
RCACSkAVSkA
B=2 B=3B=4
B=5B=6
2102-545 Digital ICs Arithmetic Building Blocks 37
B.Supmonchai
q Idea: Precompute thecarry out of each block forboth carry_in = 0 andcarry_in = 1 (can bedone for all blocks inparallel) and then selectthe correct one
q More cost effectivethan the ripple carryadder
Carry Select Adders
“0” Carry Propagation
4-bit Setup
“1” Carry Propagation 1
0
Multiplexer CinCout
Sum Generation
P’s G’s
C’s
A’s B’s
S’s
B.Supmonchai July 5, 2005
2102-545 Digital ICs 10
2102-545 Digital ICs Arithmetic Building Blocks 38
B.Supmonchai
tadd = tsetup + B tcarry + (N/B) tmux + tsum
Cout
bits 0 to 3bits 4 to 7bits 8 to 11bits 12 to 15
“0” carry
Setup
Mux
Sum Gen
P’s G’s
C’s
S’s
A’s B’s
“1” carry
“0” carry
Setup
Mux
Sum Gen
P’s G’s
C’s
S’s
A’s B’s
“1” carry
“0” carry
Setup
Mux
Sum Gen
P’s G’s
C’s
S’s
A’s B’s
“1” carry
“0” carry
Setup
Mux
Sum Gen
P’s G’s
C’s
S’s
A’s B’s
“1” carry
Cin
Carry Select Adder: Critical Path
2102-545 Digital ICs Arithmetic Building Blocks 39
B.Supmonchai
Square Root Carry Select Adders
Setup
"0" Carry
"1" Carry
Multiplexer
Sum Generation
"0"
"1"
Setup
"0" Carry
"1" Carry
Multiplexer
Sum Generation
"0"
"1"
Setup
"0" Carry
"1" Carry
Multiplexer
Sum Generation
"0"
"1"
Setup
"0" Carry
"1" Carry
Multiplexer
Sum Generation
"0"
"1"
Bit 0-1 Bit 2-4 Bit 5-8 Bit 9-13
S0-1 S2-4 S5-8 S9-13
Ci,0
(4) (5) (6) (7)
(1)
(1)
(3) (4) (5) (6)
Mux
Sum
S14-19
(7)
(8)
Bit 14-19
(9)
(3)
tadd = tsetup + 2 tcarry + √N tmux + tsum
Balance Delay - Making later block bigger
2102-545 Digital ICs Arithmetic Building Blocks 40
B.Supmonchai
Square root select
Linear select
Ripple adder
20 40N
600
10
0
20
30
40
Adder Delays - Comparison
2102-545 Digital ICs Arithmetic Building Blocks 41
B.Supmonchai
AN-1, B N-1A1, B1
P1
S1
• • •
• • • SN-1
PN-1Ci, N-1
S0
P0Ci,0 Ci,1
Carry Network
LookAhead - Basic Idea
Co,k = f(Ak, Bk,Co,k-1) = Gk + PkCo,k-1
B.Supmonchai July 5, 2005
2102-545 Digital ICs 11
2102-545 Digital ICs Arithmetic Building Blocks 42
B.Supmonchai
Co,3
Ci,0
VDD
P0
P1
P2
P3
G0
G1
G
Look-Ahead: Topology
By expanding carry generationall the way:
C1 = G0 + P0C0
C2 = G1 + P1G0 + P1P0 C0
C3 = G2 + P2G1 + P2P1G0 + P2P1P0 C0
C4 = G3 + P3G2 + P3P2G1 + P3P2P1G0+ P3P2P1P0 C0
…
2102-545 Digital ICs Arithmetic Building Blocks 43
B.Supmonchai
A7
F
A6A5A4A3A2A1
A0
A0
A1
A2A3
A4
A5
A6
A7
F
tp~ log2(N)
tp~ N
Logarithmic Look-Ahead Adder
2102-545 Digital ICs Arithmetic Building Blocks 44
B.Supmonchai
q Define carry operator € on (G,P) signal pairs
ß € is associative, i.e.,
[(g’’’,p’’’) € (g’’,p’’)] € (g’,p’) = (g’’’,p’’’) € [(g’’,p’’) € (g’,p’)]
Parallel Prefix Adders (PPAs)
€
(G’’,P’’) (G’,P’)
(G,P)
where G = G’’ + P’’G’ P = P’’P’
€
€ €
€
G’
!G
G’’
P’’
2102-545 Digital ICs Arithmetic Building Blocks 45
B.Supmonchai
PPA General Structureq Given P and G terms for each bit position, computing all the carries
is equal to finding all the prefixes in parallel
(G0,P0) € (G1,P1) € (G2,P2) € … € (GN-2,PN-2) € (GN-1,PN-1)
q Since € is associative, we can group them in any order
ß but note that it is not commutative
q Measures to considerß number of € cells
ß tree cell depth (time)
ß tree cell area
ß cell fan-in and fan-out
ß max wiring length
ß wiring congestion
ß delay path variation (glitching)
Pi, Gi logic (1 unit delay)
Si logic (1 unit delay)
Ci parallel prefix logic tree(1 unit delay per level)
B.Supmonchai July 5, 2005
2102-545 Digital ICs 12
2102-545 Digital ICs Arithmetic Building Blocks 46
B.Supmonchai
Par
alle
l Pre
fix C
ompu
tatio
n €
G0
P0
G1
P1
G2
p2
G3
P3
G4
P4
G5
P5
G6
P6
G7
P7
G8
P8
G9
p9
G10
P10
G11
p11
G12
P12
G13
p13
G14
p14
G15
p15
€€€€€€€
€ € € €
€
€
€
€
€
€
€ € € € € €
€ €
C1C2C3C4C5C6C7C8C9C10C11C12C13C14C15C16
Cin
€
T =
log
2NT
= lo
g2N
- 2
A =
2lo
g2N
A = N/2
Brent-Kung PPA
2102-545 Digital ICs Arithmetic Building Blocks 47
B.Supmonchai
Par
alle
l Pre
fix C
ompu
tatio
n €
G0
P0
G1
P1
G2
P2
G3
P3
G4
P4
G5
P5
G6
P6
G7
P7
G8
P8
G9
P9
G10
P10
G11
P11
G12
P12
G13
P13
G14
P14
G15
P15
€€€€€€€
€ € € €
€
€
€
€
C1C2C3C4C5C6C7C8C9C10C11C12C13C14C15C16
Cin
€
T =
log
2N
A =
log
2N
A = N
€€€€€€€
€ € € € € € € € € €
€ € € € € € € € € €
€ € € € € €
Kogge-Stone PPF Adder
2102-545 Digital ICs Arithmetic Building Blocks 48
B.Supmonchai
More Adder Comparisons
0
10
20
30
40
50
60
70
8 bits 16 bits 32 bits 48 bits 64 bits
RCACSkAVSkAKS PPA
2102-545 Digital ICs Arithmetic Building Blocks 49
B.Supmonchai
Adder Speed Comparisons
10
20
30
40
50
60
70
16 bits 32 bits 64 bits
RCAMCCCCSkAVCSkACCSlAB&K
B.Supmonchai July 5, 2005
2102-545 Digital ICs 13
2102-545 Digital ICs Arithmetic Building Blocks 50
B.Supmonchai
Adder Average Power Comparisons
0
5
10
15
20
25
30
35
16 bits 32 bits 64 bits
RCAMCCCCSkAVCSkACCSlAB&K
2102-545 Digital ICs Arithmetic Building Blocks 52
B.Supmonchai
Binary Multiplication - Basics
q Given two unsigned binary numbers X (M bits)and Y (N bits)
†
X = Xi2i
i= 0
M -1
Â
†
Y = Yj 2j
j= 0
N-1
Â
where Xi, Yj Œ {0, 1}
q The multiplication operation Z = X ¥ Y is
†
Zk 2k
k= 0
M +N-1
 = Xi2i
i= 0
M -1
ÂÊ
Ë Á
ˆ
¯ ˜ Yj 2
j
j= 0
N-1
ÂÊ
Ë Á Á
ˆ
¯ ˜ ˜ = XiYj 2
i+ j
j= 0
N-1
ÂÊ
Ë Á Á
ˆ
¯ ˜ ˜
i= 0
M -1
Â
2102-545 Digital ICs Arithmetic Building Blocks 53
B.Supmonchai
Binary Multiplication Operation
q Binary Multiplication as repeated additions
1 0 1 0 1 0 1 0 1 1
1 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0
1 0 1 0 1 0
1 1 1 0 0 1 1 1 0
multiplicandmultiplier
partialproductarray
double precision product
can be formed in parallel
NM
2N
N
2102-545 Digital ICs Arithmetic Building Blocks 54
B.Supmonchai
Shift-and-Add Multiplicationq Right Shift and Add (N bits ¥ N bits)
Multiplicand
Multiplier
N-bit Adder
“0”N
N
N N
N
N+1
N
Bit out
1 0
*Left shift requires 2n-bit adder
tshift&add_mult = O(N · tadder) = O(N2) for an RCA
B.Supmonchai July 5, 2005
2102-545 Digital ICs 14
2102-545 Digital ICs Arithmetic Building Blocks 55
B.Supmonchai
Improving Multipliers
q Making them faster (therefore, bigger area)
ß Use faster adders
ß Use higher radix (e.g., base 4) multiplication
ÿ Use multiplier recoding to simplify multiple formation
ß Form partial product array in parallel and add it in parallel
q Making them smaller (i.e., slower)
ß Use array multipliers
ÿ Very regular structure with only short wires to nearest neighborcells. Thus, very simple and efficient layout in VLSI
ÿ Can be easily and efficiently pipelined
2102-545 Digital ICs Arithmetic Building Blocks 56
B.Supmonchai
partialproductarrayreductiontree
fast carrypropagateadder(CPA)
mux+
reductiontree (log N)
+CPA (log N)
multipleformingcircuits
P (product)
Q (‘ier)
D (‘icand)D
DD
00
00
Array (or Tree) Multiplier Structure
PP
Gen
erat
ion
PP
Acc
um
u-
lati
on
Fin
alA
dd
itio
n
2102-545 Digital ICs Arithmetic Building Blocks 57
B.Supmonchai
Partial Product (PP) Generationq Each row in the partial-product array is either a copy of
the multiplicand or a row of zeros
q Careful optimization of the PP generation can lead tosome substantial delay and area reduction.
ß Booth’s and modified Booth’s recording
X7 X6 X5 X4 X3 X2 X1 X0
Yi
PP7 PP6 PP5 PP4 PP3 PP2 PP1 PP0
2102-545 Digital ICs Arithmetic Building Blocks 58
B.Supmonchai
Array Multiplier Implementation
Y0
Y1
X3 X2 X1 X 0
X3
HA
X2
FA
X1
FA
X0
HA
Y2X3
FA
X2
FA
X1
FA
X0
HA
Z1
Z3Z6Z7 Z5 Z4
Y3X3
FA
X2
FA
X1
FA
X0
HA
HA: Half Adder FA: Full AdderCP: Critical Path
HW for OnePartial Product
CP1
CP2
tarray_mult = [(M -1)+(N - 2)] tcarry + (N - 1) tsum + tand = O(N)
* Assume tadd = tcarry
B.Supmonchai July 5, 2005
2102-545 Digital ICs 15
2102-545 Digital ICs Arithmetic Building Blocks 59
B.Supmonchai
Carry-Save Multiplier
HA HA HA HA
FAFAFAHA
FAHA FA FA
FAHA FA HA
Vector Merging Adder
q The idea is to “save” the (PP) carry and add it in thenext adder stage
q In the final addition a fast carry-propagate (e.g., carry-lookahead) adder is used.
tCSM = (N - 1) tcarry + tmerge + tand = O(N)
Unique andShorter CP
6 HAs6 FAs
2102-545 Digital ICs Arithmetic Building Blocks 60
B.Supmonchai
SCSCSCSC
SCSCSCSC
SCSCSCSC
SC
SC
SC
SC
Z0
Z1
Z2
Z3Z4Z5Z6Z7
X0X1X2X3
Y1
Y2
Y3
Y0
Vector Merging Cell
HA Multiplier Cell
FA Multiplier Cell
X and Y signals are broadcastedthrough the complete array.( )
CSM Floorplan
Regularity makes thegeneration of structureamenable to automation
2102-545 Digital ICs Arithmetic Building Blocks 61
B.Supmonchai
Wallace-Tree Multiplier
6 5 4 3 2 1 0
Partial Products
BitPosition
6 5 4 3 2 1 0
First Stage
6 5 4 3 2 1 0
Second Stage
6 5 4 3 2 1 0
Final Adder
Rearranging PPs
Cover tree with HAs and FAs,
starting fro
m the densest part
Any Types of addercan be used
GOAL: Minimize depth (# of stages) with min. no. of adder elements
HA
FA HA
2102-545 Digital ICs Arithmetic Building Blocks 62
B.Supmonchai
Wallace-Tree Multiplier Implementation
Partial products
First stage
Second stage
Final adder
FA FA FA
HA HA
FA
x3y3
z7 z6 z5 z4 z3 z2 z1 z0
x3y2x2y3
x1y1x3y0 x2y0 x0y1x0y2
x2y2x1y3
x1y2x3y1x0y3 x1y0 x0
HA
3 HAs and 3 FAs for the reduction process (stage 1 + stage 2)Any type of adder can be used for the final adder
B.Supmonchai July 5, 2005
2102-545 Digital ICs 16
2102-545 Digital ICs Arithmetic Building Blocks 63
B.Supmonchai
Notes on Wallace-Tree Multiplierq Wallace tree substantially saves hardware for large
multipliers
ß Number of partial products is reduced by two-thirds per stage
q The propagation delay is found to be bound,
q Although substantially faster than CSM, WTM structureis very irregular
ß Difficulty in finding efficient VLSI layout
q Many of today’s high performance multipliers use higherorder (e.g. 4-2) compressors in stead of 3-2 compressors(FAs)
tWTM = O(log 3/2 (N))
2102-545 Digital ICs Arithmetic Building Blocks 64
B.Supmonchai
Dat
a In
Shifter
Control =
Dat
aO
ut
Shift amountShift directionShift type (logical, arith, circular)
Consume lots of area if done in random logic gates
Parallel Programmable Shiftersq Shifting a data word left or right over a constant amount
is a trivial hardware operation and is implemented bythe appropriate signal wiring
q Shifters are used in multipliers, floating point units
2102-545 Digital ICs Arithmetic Building Blocks 65
B.Supmonchai
A Programmable Binary Shifter
0A0100A0A1
A10001A0A1
A0A1010A0A1
Bi-1BileftnoprightAi-1Ai
Ai
Ai-1
Bi
Bi-1
Right Leftnop
Bit-Slice i
...
Exactly onesignal is active
2102-545 Digital ICs Arithmetic Building Blocks 66
B.Supmonchai
4-bit Barrel Shifter
Example: Sh0 = 1 B3B2B1B0 = A3A2A1A0
Sh1 = 1 B3B2B1B0 = A3A3A2A1
Sh2 = 1 B3B2B1B0 = A3A3A3A2
Sh3 = 1 B3B2B1B0 = A3A3A3A3
A0
A1
A2
A3
B0
B1
B2
B3
Sh1
Sh2
Sh3
Sh0 Sh1 Sh2 Sh3 Area dominated by wiring
Arithmetic shift
B.Supmonchai July 5, 2005
2102-545 Digital ICs 17
2102-545 Digital ICs Arithmetic Building Blocks 68
B.Supmonchai
Notes on Barrel Shifter
q Note that signal goes through at most one FET (soconstant propagation delay (in theory))
q Also note, that the FET diffusion capacitance on anoutput wire increases linearly with the shift width butthe FET diffusion capacitance on the input data linesincreases quadratically (i.e., N2 for circular shifter)
q Size of cell is bounded by the pitch of the metal wires.
q A decoder is usually needed for shift control signalssince the amount of shift are normally given in (encoded)binary number.
2102-545 Digital ICs Arithmetic Building Blocks 69
B.Supmonchai
4-bit Barrel Shifter Layout
Widthbarrel ~ 2 pm NN = max shift distance, pm = metal pitch
BufferSh3Sh2Sh1Sh0
A3
A2
A1
A0
Widthbarrel
2102-545 Digital ICs Arithmetic Building Blocks 70
B.Supmonchai
8-bit Logarithmic Shifter
log N stages
A3
A2
A1
A0
!Sh1Sh1 !Sh2Sh2 !Sh3Sh3
B0
B1
B2
B3
2102-545 Digital ICs Arithmetic Building Blocks 72
B.Supmonchai
Widthlog ~ pm(2K+(1+2+…+2K-1)) = pm(2K+2K-1) K = log2 N
A0
B3
B2
B1
B0
A1
A2
A3
1 2 4
8-bit Logarithmic Shifter Layout Slice
B.Supmonchai July 5, 2005
2102-545 Digital ICs 18
2102-545 Digital ICs Arithmetic Building Blocks 73
B.Supmonchai
6 + 2
5 + 2
4 + 2
3 + 2
K + 2 diffs
Speed
1 + 64
1 + 32
1 + 16
1 + 8
1 + N diffs
Speed WidthWidth
pm(2K+2K-1)2 N pm
75 pm
41 pm
23 pm
13 pm
Logarithmic
128 pm
64 pm
32 pm
16 pm
Barrel
664
532
416
38
KN
Shifter Implementation Comparisons
q Barrel Shifter is better for small shifters (faster, not much bigger)while Log Shifter is preferred for larger shifters.
ß Log Shifters are always smaller
q For large shifter we may have to start worrying about the numberof pass transistors in series.
2102-545 Digital ICs Arithmetic Building Blocks 74
B.Supmonchai
2-to-4Decoder
In0
In1
Enable
Out0
Out1
Out2
Out3
Decodersq Decodes inputs to activate one of many outputs
q Cost of 2-to-4 Decoderß two inverters, four 2-input NAND gates, four
inverters plus enable logic
ß how about cost for a 3-to-8, 4-to-16, etc. decoder?
= In0 In1
= In0 In1
= In0 In1
= In0 In1
2102-545 Digital ICs Arithmetic Building Blocks 75
B.Supmonchai
Dynamic NOR DecoderVdd GND GND
A0 A1
B0
B1
B2
B3
precharge A0 A1 Active HIGH Outputs
Capacitance of the output wires increases linearly with the decoder size
2102-545 Digital ICs Arithmetic Building Blocks 77
B.Supmonchai
Dynamic NAND DecoderGND
A0 A1
B3
precharge
B2
B1
B0
A1A0
B.Supmonchai July 5, 2005
2102-545 Digital ICs 19
2102-545 Digital ICs Arithmetic Building Blocks 79
B.Supmonchai
Notes on Dynamic Decodersq In Dynamic NOR decoder signal goes through at most
one FETß So constant propagation delay (in theory)
ß However, some output wires may have two or more parallelpaths to GND - effectively shortening the transition time
q On the contrary, signal in dynamic NAND decoder passthrough a series of FETß The number of FETs rises linearly with the decoder size
ß Thus it will be slower than the NOR implementation if thegate capacitance dominates diffusion capacitance
q For the NAND decoder all the input signals must be lowduring precharge else Vdd and GND will be connected!
2102-545 Digital ICs Arithmetic Building Blocks 80
B.Supmonchai
Building Bigger Decoders
0 0 0 0 1
1
Active low enable, Active low output
Need to catch the output that goes to zero before it precharges again
A4
enable
A3 A2 A1 A0
1x2
2x4
2x4
2x4
2x4
.
.
.
Æ 0 Æ 1
2102-545 Digital ICs Arithmetic Building Blocks 81
B.Supmonchai
Layout of Bit-Sliced Datapaths
Must have enoughdrive capacity tohandle large fan-out
Sized for peak current
Horizontal gap forfeeding signals to thecells downstream
2102-545 Digital ICs Arithmetic Building Blocks 82
B.Supmonchai
Without feedthroughs orpitch matching (4.2mm2)
Optimizing Bit-sliced DatapathsWith feedthroughs andpitch matching (2.2mm2)
With feedthroughs(3.2mm2)