Arithmetic Building Blocks · 2009. 11. 12. · Cout drives 2 internal and 2 inverter transistor...

B.Supmonchai July 5, 2005

2102-545 Digital ICs 1

Chapter 12

Arithmetic Building Blocks

Boonchuay SupmonchaiIntegrated Design Application Research (IDAR) Laboratory

August 20, 2004; Revised - July 5, 2005

2102-545 Digital ICs Arithmetic Building Blocks 2

B.Supmonchai

Goals of This Chapter

q Designing for Performance, area, or power

ß Adders

ß Multipliers

ß Shifters

q Logic and System Optimizations for datapathmodules

q Power-Delay trade-offs in datapaths


B.Supmonchai

Review: A Generic Processor

Datapath

Inp

ut/O

utp

ut

Memory

Control

Adder, Multiplier,Shifter, Comparator, etc.

RAM, ROM, Shift Register

FSM,PLA,

Counter,Random

Logic

Switches,Arbiters,

BusDrivers


B.Supmonchai

Register

Adder

Shifter

Multiplexer

Datapath Unit

Bit-Sliced Architecture

Control

n-bitData In

n-bitData Out

Bit 0

Bit 1

Bit n

-2

Bit n

-1

…

IdenticalProcessingElements

q Modularß Easy to design and verify

ß Easy to expandq Potential to be fast




B.Supmonchai

Example: Itanium Bit-Sliced Design

Adder stage 1

Wiring

Adder stage 2

Wiring

Adder stage 3

Bit slice 0

Bit slice 2

Bit slice 1

Bit slice 63

Sum Select

Shifter

Multiplexers

Loopback Bus

From register files / Cache / Bypass

To register files / Cache

Loopback Bus

Loopback Bus


B.Supmonchai

Example: Itanium Integer Datapath

Itanium has 6 integer execution units (ALU)


B.Supmonchai

One-Bit Binary Full Adder (FA)

A

BS

Cin

Cout

1-bitFull Adder

(FA)

generate

generate

propagate

propagate

propagate

propagate

kill

kill

CarryStatus

11111

01011

01101

10001

01110

10010

10100

00000

SCoutCinBA

S = A ⊕ B ⊕ Cin

Cout = AB + ACin + BCin

q A VERY common operation - so worth spending sometime trying to optimizeß Often in the critical path, so need to look at both logic level and

circuit level optimizations


B.Supmonchai

Generate (G) = ABPropagate (P) = A ⊕ B

Delete(D) = A B

S(G,P,C) = P ⊕ Cin

Cout(G,P,C) = G + PCin

Propagate, Generate, and Delete (Kill)

q Define 3 new variable which ONLY depend on A, B

q Then we can write S and Cout in terms of G, P, and Cin

q We can also write S and Cout in terms of D, P, and Cin

q Sometimes an alternative definition for P can be used

Propagate (P) = A + B

(FA itself generates a carry)

(FA passes along carry)

(FA stops propagation of carry)




B.Supmonchai

FA CMOS Implementation: First Try

Cout

AB

Cin A

CinA

BA

A

B

Cin

A

B

Cin

B

CinCin

BB

B

A A

S

A A

B B

Cin Cin

32 Transistors

Majority Function Maj(A,B,C)outputs 0 or 1 whichever hasgreater numbers at the inputs


B.Supmonchai

Improved CMOS Implementationq A more compact design is based on the observation that

S can be factored to reuse the Cout term

S = ABCin+ (A + B + Cin)Cout

ABCin

ABCin

Cout

SS

Cout

Minority Function


B.Supmonchai

A B

B

A

Ci

Ci A

X

VDD

VDD

A B

Ci BA

B VDD

A

B

Ci

Ci

A

B

A CiB

Co

VDD

28 Transistors

Improved CMOS Implementation II


B.Supmonchai

Notes on Improved CMOS FA

q Note that the PMOS network is identical to the NMOSnetwork rather than being the complement.

ß This is possible because of the inversion property which saysthat the function of complemented inputs is equal to thecomplement of the function.

ß This simplification reduces the number of series transistorsand makes the layout more uniform

q This design has a greater delay to compute S than Cout

ß Most of the time the extra delay computing S has little effecton the critical path because carry is the signal that propagates

ß With proper sizing this delay on S can be minimized




B.Supmonchai

A B

S

CoCi FA

A B

S

CoCi FA

SABC i,,() SABC i

,,()=

Co ABC i,,() Co ABC i

,,()=

Inversion Property

q The function must be symmetric


B.Supmonchai

TG-Based FA

XOR XOR2-to-1 MUX

16 Transistors

Cout

S

Cin

A

BP

Extra delay - slower


B.Supmonchai

Complementary PT Logic (CPL) FA

A

A

B

B

Cin

Cin

A

B

B

A

B

B

Cin

Cin

Cin

Cin

S

S

Cout

Cout

28 transistorsdual rail

Voltage dropProblems

Faster, Lower Power, and small area than full static CMOS


B.Supmonchai

B

B B

B B

B

B

B

A

A

A

A

A

AA

A

Cin

Cin

Cin

Cin

Cin

!Cout!S

24+4 transistors

kill

generate

0-propagate

1-propagate

4 4

4 4

4

8

888

8

2 2 2

3

3

3

6

6

6

444

4

2

Mirror Adder

S = ABCin+ (A + B + Cin)CoutCout = AB + ACin + BCin

PUN and PDN are symmetrical not complemented




B.Supmonchai

Mirror Adder Featuresq The NMOS and PMOS chains are completely

symmetrical with a maximum of two series transistorsin the carry circuitry, guaranteeing identical rise andfall transitions if the NMOS and PMOS devices areproperly sized.

q When laying out the cell, the most critical issue is theminimization of the capacitances at node !Cout (fourdiffusion capacitances, two internal gate capacitances,and two inverter gate capacitances).ß Shared diffusions can reduce the stack node capacitances.

q The transistors connected to Cin are placed closest to theoutput.


B.Supmonchai

Mirror Adder Sizing Issues

q Only the transistors in the carry stage have to beoptimized for optimal speed. All transistors in the sumstage can be minimal size.

q Assume PMOS/NMOS ratio of 2. Each input in thecarry circuit has a logical effort of 2 so the optimal fan-out for each is also 2.

q Since !Cout drives 2 internal and 2 inverter transistorgates (to form Cout for the bit adder) the carry circuitshould be oversized


B.Supmonchai

CiA B

VDD

GND

B

Co

A Ci Co Ci A B

S

Mirror Adder Stick Diagram


B.Supmonchai

Worst Case Delay : tripple = O(N)

tripple ª tFA(A,BÆCout) + (N - 2)tFA(CinÆCout) + tFA(CinÆS)

Slow!

Ripple Carry Adder (RCA)A0 B0

S0

C0 = Cin

A1 B1

S1

A2 B2

S2

A3 B3

S3

Cout = C4

C1C2C3FAFAFAFA

Make the fastest possible carry path




B.Supmonchai

regular cellinverted cell

A0 B0

S0

C0 = Cin

A1 B1

S1

A2 B2

S2

A3 B3

S3

Cout = C4

C1C2C3FAFAFAFA

Exploiting the Inversion Property

q Now need two “flavors” of FAs

q Minimizes the critical path (the carry chain) by elimi-nating inverters between the FAsß Need increasing the transistor sizes on the carry chain portion

of the mirror adder.2102-545 Digital ICs Arithmetic Building Blocks 23

B.Supmonchai

C1 = G0 + P0C0

C2 = G1 + P1G0 + P1P0 C0

C3 = G2 + P2G1 + P2P1G0 + P2P1P0 C0

C4 = G3 + P3G2 + P3P2G1 + P3P2P1G0 + P3P2P1P0 C0

Fast Carry Chain Design

q The key to fast addition is a low latency carry network

q What matters is whether in a given position a carry is

ß Generated Gi = AiBi

ß Propagated Pi = Ai ⊕ Bi (sometimes use Ai | Bi)

ß Annihilated (killed) Ki = !Ai !Bi

q Giving a carry recurrence of C i+1 = Gi + PiCi


B.Supmonchai

Manchester Carry Chainq Switches controlled by Gi and Pi

q Components of total delayß time to form the switch control signals Gi and Pi

ß setup time for the switches

ß signal propagation delay through N switches in the worst case

CoCi

Gi

Di

Pi

Pi

VDD

Static

CoCi

Gi

Pi

VDD

f

f

Domino


B.Supmonchai

4-bit Sliced MCC Adder

G P

!C0

clk

G PG PG P

⊕⊕⊕⊕

& ⊕& ⊕& ⊕& ⊕

A0 B0A1 B1A2 B2A3 B3

S0S1S2S3

!C1!C2!C3

!C4




B.Supmonchai

G0 + P0C0

G1 + P1G0 + P1P0 C0

G2 + P2G1 + P2P1G0 + P2P1P0 C0

G3 + P3G2 + P3P2G1 + P3P2P1G0 + P3P2P1P0 C0

Domino MCC Circuit

P0P1P2P3

Ci,0

clk

G0G1G2G3

Ci,4

clk3 3 3 3 3

1 2 3 4

5

6

1

2

2

3

3

4

4

5


B.Supmonchai

MCC Stick Diagram

Pi + 1 Gi + 1 f

Ci

Inverter/Sum Row

Propagate/Generate Row

Pi Gi f

Ci - 1Ci + 1

VDD

GND


B.Supmonchai

Notes on MCC Adder

q When clock is low, the carry nodes precharge; whenclock goes high if Gi is high, Ci+1 is asserted (goes low)

q To prevent Gi from affecting Ci, the signal Pi must becomputed as the xor (rather than the or) which is not aproblem since we need the xor of Ai and Bi forcomputing the sum anyway

q Delay is roughly proportional to n**2 (as n passtransistors are connected in series)

ß we usually limit each group to 4 stages, then buffer the carrychain with an inverter between each group


B.Supmonchai

Binary Adder Landscape

Synchronous WordParallel Adders

Ripple Carry Adders (RCA) Carry Prop Min Adders

Signed-Digit Adders

Fast Carry Prop Adders Residue Adder

Manchester Carry Chain

CarrySelect

Parallel Prefix

ConditionalSum

CarrySkip

t = O(log N)A = O(N log N)

t = O(÷N) A = O(N)t = O(N)

A = O(N)

t = O(N), A = O(N)

t = O(1), A = O(N)

Bit-Serial Adders

Asynchronous Adders




B.Supmonchai

If (P0 & P1 & P2 & P3 = 1) then Co,3 = Ci,0 otherwise theblock itself kills or generates the carry internally

Carry-Skip (Carry-Bypass) Adder

A0 B0

S0

Ci,0

A1 B1

S1

A2 B2

S2

A3 B3

S3

C0,3C1C2C3

FAFAFAFA

Co,3

BP = P0 P1 P2 P3 “Block Propagate”

1

0


B.Supmonchai

BP (By-Pass)block carry-in

block carry-outcarry-out

Carry-Skip Chain Implementation

Cin

G0

P0P1P2P3

G1G2G3

BP

Cout

Only 10% to 20% area overhead

Only two “gate delays” toproduce Cout if skip occurs


B.Supmonchai

Worst-case delay Æ carry from bit 0 to bit 15 = carry generated in bit 0,ripples through bits 1, 2, and 3, skips the middle two groups (B is thegroup size in bits), ripples in the last group from bit 12 to bit 15

tadd = tsetup + B tcarry + ((N/B) -1) tskip + B tcarry + tsum

4-bit Block Carry-Skip Adder

Ci,0

CarryPropagation

Setup

Sum

CarryPropagation

Setup

Sum

CarryPropagation

Setup

Sum

CarryPropagation

Setup

Sum

bits 0 to 3bits 4 to 7bits 8 to 11bits 12 to 15

tsetuptskiptcarry

tsum


B.Supmonchai

Optimal Block Size and Timeq Assuming one stage of ripple (tcarry) has the same delay

as one skip logic stage (tskip) and both are 1

tCSkA = 1 + B + (N/B-1) + B + 1

= 2B + N/B + 1

q So the optimal block size, B, is

dtCSkA/dB = 0 fi ÷(N/2) = Bopt

q And the optimal time is

Optimal tCSkA = 2(÷(2N)) + 1

tsetup ripple in skips ripple in tsum block 0 last block




B.Supmonchai

Variations of Carry-Skip Adders I

q Variable block sized Carry-Skip Adders

ß A carry that is generated in, or absorbed by, one of the innerblocks travels a shorter distance through the skip blocks

ß Hence a CSA adder can have bigger blocks for the innercarries without increasing the overall delay

CinCout

tCSkA = 2B + O(NB)

NB Blocks


B.Supmonchai

skip level 1

skip level 2

CinCout

AND of thefirst level skipsignals (BP’s)tCSkA = 2B + O(logBN)

Variations of Carry-Skip Adders IIq Multiple Levels of Skip Logicß CSAs with large number of bits suffer from linear carry

propagation delay time.

ß Added higher levels of skip logic, a CSA can skip more blocksat a time.


B.Supmonchai

Carry-Skip Adder Comparisons

0

10

20

30

40

50

60

70

8 bits 16 bits 32 bits 48 bits 64 bits

RCACSkAVSkA

B=2 B=3B=4

B=5B=6


B.Supmonchai

q Idea: Precompute thecarry out of each block forboth carry_in = 0 andcarry_in = 1 (can bedone for all blocks inparallel) and then selectthe correct one

q More cost effectivethan the ripple carryadder

Carry Select Adders

“0” Carry Propagation

4-bit Setup

“1” Carry Propagation 1

0

Multiplexer CinCout

Sum Generation

P’s G’s

C’s

A’s B’s

S’s




B.Supmonchai

tadd = tsetup + B tcarry + (N/B) tmux + tsum

Cout

bits 0 to 3bits 4 to 7bits 8 to 11bits 12 to 15

“0” carry

Setup

Mux

Sum Gen

P’s G’s

C’s

S’s

A’s B’s

“1” carry

“0” carry

Setup

Mux

Sum Gen

P’s G’s

C’s

S’s

A’s B’s

“1” carry

“0” carry

Setup

Mux

Sum Gen

P’s G’s

C’s

S’s

A’s B’s

“1” carry

“0” carry

Setup

Mux

Sum Gen

P’s G’s

C’s

S’s

A’s B’s

“1” carry

Cin

Carry Select Adder: Critical Path


B.Supmonchai

Square Root Carry Select Adders

Setup

"0" Carry

"1" Carry

Multiplexer

Sum Generation

"0"

"1"

Setup

"0" Carry

"1" Carry

Multiplexer

Sum Generation

"0"

"1"

Setup

"0" Carry

"1" Carry

Multiplexer

Sum Generation

"0"

"1"

Setup

"0" Carry

"1" Carry

Multiplexer

Sum Generation

"0"

"1"

Bit 0-1 Bit 2-4 Bit 5-8 Bit 9-13

S0-1 S2-4 S5-8 S9-13

Ci,0

(4) (5) (6) (7)

(1)

(1)

(3) (4) (5) (6)

Mux

Sum

S14-19

(7)

(8)

Bit 14-19

(9)

(3)

tadd = tsetup + 2 tcarry + √N tmux + tsum

Balance Delay - Making later block bigger


B.Supmonchai

Square root select

Linear select

Ripple adder

20 40N

600

10

0

20

30

40

Adder Delays - Comparison


B.Supmonchai

AN-1, B N-1A1, B1

P1

S1

• • •

• • • SN-1

PN-1Ci, N-1

S0

P0Ci,0 Ci,1

Carry Network

LookAhead - Basic Idea

Co,k = f(Ak, Bk,Co,k-1) = Gk + PkCo,k-1




B.Supmonchai

Co,3

Ci,0

VDD

P0

P1

P2

P3

G0

G1

G

Look-Ahead: Topology

By expanding carry generationall the way:

C1 = G0 + P0C0

C2 = G1 + P1G0 + P1P0 C0

C3 = G2 + P2G1 + P2P1G0 + P2P1P0 C0

C4 = G3 + P3G2 + P3P2G1 + P3P2P1G0+ P3P2P1P0 C0

…


B.Supmonchai

A7

F

A6A5A4A3A2A1

A0

A0

A1

A2A3

A4

A5

A6

A7

F

tp~ log2(N)

tp~ N

Logarithmic Look-Ahead Adder


B.Supmonchai

q Define carry operator € on (G,P) signal pairs

ß € is associative, i.e.,

[(g’’’,p’’’) € (g’’,p’’)] € (g’,p’) = (g’’’,p’’’) € [(g’’,p’’) € (g’,p’)]

Parallel Prefix Adders (PPAs)

€

(G’’,P’’) (G’,P’)

(G,P)

where G = G’’ + P’’G’ P = P’’P’

€

€ €

€

G’

!G

G’’

P’’


B.Supmonchai

PPA General Structureq Given P and G terms for each bit position, computing all the carries

is equal to finding all the prefixes in parallel

(G0,P0) € (G1,P1) € (G2,P2) € … € (GN-2,PN-2) € (GN-1,PN-1)

q Since € is associative, we can group them in any order

ß but note that it is not commutative

q Measures to considerß number of € cells

ß tree cell depth (time)

ß tree cell area

ß cell fan-in and fan-out

ß max wiring length

ß wiring congestion

ß delay path variation (glitching)

Pi, Gi logic (1 unit delay)

Si logic (1 unit delay)

Ci parallel prefix logic tree(1 unit delay per level)




B.Supmonchai

Par

alle

l Pre

fix C

ompu

tatio

n €

G0

P0

G1

P1

G2

p2

G3

P3

G4

P4

G5

P5

G6

P6

G7

P7

G8

P8

G9

p9

G10

P10

G11

p11

G12

P12

G13

p13

G14

p14

G15

p15

€€€€€€€

€ € € €

€

€

€

€

€

€

€ € € € € €

€ €

C1C2C3C4C5C6C7C8C9C10C11C12C13C14C15C16

Cin

€

T =

log

2NT

= lo

g2N

- 2

A =

2lo

g2N

A = N/2

Brent-Kung PPA


B.Supmonchai

Par

alle

l Pre

fix C

ompu

tatio

n €

G0

P0

G1

P1

G2

P2

G3

P3

G4

P4

G5

P5

G6

P6

G7

P7

G8

P8

G9

P9

G10

P10

G11

P11

G12

P12

G13

P13

G14

P14

G15

P15

€€€€€€€

€ € € €

€

€

€

€

C1C2C3C4C5C6C7C8C9C10C11C12C13C14C15C16

Cin

€

T =

log

2N

A =

log

2N

A = N

€€€€€€€

€ € € € € € € € € €

€ € € € € € € € € €

€ € € € € €

Kogge-Stone PPF Adder


B.Supmonchai

More Adder Comparisons

0

10

20

30

40

50

60

70

8 bits 16 bits 32 bits 48 bits 64 bits

RCACSkAVSkAKS PPA


B.Supmonchai

Adder Speed Comparisons

10

20

30

40

50

60

70

16 bits 32 bits 64 bits

RCAMCCCCSkAVCSkACCSlAB&K




B.Supmonchai

Adder Average Power Comparisons

0

5

10

15

20

25

30

35

16 bits 32 bits 64 bits

RCAMCCCCSkAVCSkACCSlAB&K


B.Supmonchai

Binary Multiplication - Basics

q Given two unsigned binary numbers X (M bits)and Y (N bits)

†

X = Xi2i

i= 0

M -1

Â

†

Y = Yj 2j

j= 0

N-1

Â

where Xi, Yj Œ {0, 1}

q The multiplication operation Z = X ¥ Y is

†

Zk 2k

k= 0

M +N-1

Â = Xi2i

i= 0

M -1

ÂÊ

Ë Á

ˆ

¯ ˜ Yj 2

j

j= 0

N-1

ÂÊ

Ë Á Á

ˆ

¯ ˜ ˜ = XiYj 2

i+ j

j= 0

N-1

ÂÊ

Ë Á Á

ˆ

¯ ˜ ˜

i= 0

M -1

Â


B.Supmonchai

Binary Multiplication Operation

q Binary Multiplication as repeated additions

1 0 1 0 1 0 1 0 1 1

1 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0

1 0 1 0 1 0

1 1 1 0 0 1 1 1 0

multiplicandmultiplier

partialproductarray

double precision product

can be formed in parallel

NM

2N

N


B.Supmonchai

Shift-and-Add Multiplicationq Right Shift and Add (N bits ¥ N bits)

Multiplicand

Multiplier

N-bit Adder

“0”N

N

N N

N

N+1

N

Bit out

1 0

*Left shift requires 2n-bit adder

tshift&add_mult = O(N · tadder) = O(N2) for an RCA




B.Supmonchai

Improving Multipliers

q Making them faster (therefore, bigger area)

ß Use faster adders

ß Use higher radix (e.g., base 4) multiplication

ÿ Use multiplier recoding to simplify multiple formation

ß Form partial product array in parallel and add it in parallel

q Making them smaller (i.e., slower)

ß Use array multipliers

ÿ Very regular structure with only short wires to nearest neighborcells. Thus, very simple and efficient layout in VLSI

ÿ Can be easily and efficiently pipelined


B.Supmonchai

partialproductarrayreductiontree

fast carrypropagateadder(CPA)

mux+

reductiontree (log N)

+CPA (log N)

multipleformingcircuits

P (product)

Q (‘ier)

D (‘icand)D

DD

00

00

Array (or Tree) Multiplier Structure

PP

Gen

erat

ion

PP

Acc

um

u-

lati

on

Fin

alA

dd

itio

n


B.Supmonchai

Partial Product (PP) Generationq Each row in the partial-product array is either a copy of

the multiplicand or a row of zeros

q Careful optimization of the PP generation can lead tosome substantial delay and area reduction.

ß Booth’s and modified Booth’s recording

X7 X6 X5 X4 X3 X2 X1 X0

Yi

PP7 PP6 PP5 PP4 PP3 PP2 PP1 PP0


B.Supmonchai

Array Multiplier Implementation

Y0

Y1

X3 X2 X1 X 0

X3

HA

X2

FA

X1

FA

X0

HA

Y2X3

FA

X2

FA

X1

FA

X0

HA

Z1

Z3Z6Z7 Z5 Z4

Y3X3

FA

X2

FA

X1

FA

X0

HA

HA: Half Adder FA: Full AdderCP: Critical Path

HW for OnePartial Product

CP1

CP2

tarray_mult = [(M -1)+(N - 2)] tcarry + (N - 1) tsum + tand = O(N)

* Assume tadd = tcarry




B.Supmonchai

Carry-Save Multiplier

HA HA HA HA

FAFAFAHA

FAHA FA FA

FAHA FA HA

Vector Merging Adder

q The idea is to “save” the (PP) carry and add it in thenext adder stage

q In the final addition a fast carry-propagate (e.g., carry-lookahead) adder is used.

tCSM = (N - 1) tcarry + tmerge + tand = O(N)

Unique andShorter CP

6 HAs6 FAs


B.Supmonchai

SCSCSCSC

SCSCSCSC

SCSCSCSC

SC

SC

SC

SC

Z0

Z1

Z2

Z3Z4Z5Z6Z7

X0X1X2X3

Y1

Y2

Y3

Y0

Vector Merging Cell

HA Multiplier Cell

FA Multiplier Cell

X and Y signals are broadcastedthrough the complete array.( )

CSM Floorplan

Regularity makes thegeneration of structureamenable to automation


B.Supmonchai

Wallace-Tree Multiplier

6 5 4 3 2 1 0

Partial Products

BitPosition

6 5 4 3 2 1 0

First Stage

6 5 4 3 2 1 0

Second Stage

6 5 4 3 2 1 0

Final Adder

Rearranging PPs

Cover tree with HAs and FAs,

starting fro

m the densest part

Any Types of addercan be used

GOAL: Minimize depth (# of stages) with min. no. of adder elements

HA

FA HA


B.Supmonchai

Wallace-Tree Multiplier Implementation

Partial products

First stage

Second stage

Final adder

FA FA FA

HA HA

FA

x3y3

z7 z6 z5 z4 z3 z2 z1 z0

x3y2x2y3

x1y1x3y0 x2y0 x0y1x0y2

x2y2x1y3

x1y2x3y1x0y3 x1y0 x0

HA

3 HAs and 3 FAs for the reduction process (stage 1 + stage 2)Any type of adder can be used for the final adder




B.Supmonchai

Notes on Wallace-Tree Multiplierq Wallace tree substantially saves hardware for large

multipliers

ß Number of partial products is reduced by two-thirds per stage

q The propagation delay is found to be bound,

q Although substantially faster than CSM, WTM structureis very irregular

ß Difficulty in finding efficient VLSI layout

q Many of today’s high performance multipliers use higherorder (e.g. 4-2) compressors in stead of 3-2 compressors(FAs)

tWTM = O(log 3/2 (N))


B.Supmonchai

Dat

a In

Shifter

Control =

Dat

aO

ut

Shift amountShift directionShift type (logical, arith, circular)

Consume lots of area if done in random logic gates

Parallel Programmable Shiftersq Shifting a data word left or right over a constant amount

is a trivial hardware operation and is implemented bythe appropriate signal wiring

q Shifters are used in multipliers, floating point units


B.Supmonchai

A Programmable Binary Shifter

0A0100A0A1

A10001A0A1

A0A1010A0A1

Bi-1BileftnoprightAi-1Ai

Ai

Ai-1

Bi

Bi-1

Right Leftnop

Bit-Slice i

...

Exactly onesignal is active


B.Supmonchai

4-bit Barrel Shifter

Example: Sh0 = 1 B3B2B1B0 = A3A2A1A0

Sh1 = 1 B3B2B1B0 = A3A3A2A1

Sh2 = 1 B3B2B1B0 = A3A3A3A2

Sh3 = 1 B3B2B1B0 = A3A3A3A3

A0

A1

A2

A3

B0

B1

B2

B3

Sh1

Sh2

Sh3

Sh0 Sh1 Sh2 Sh3 Area dominated by wiring

Arithmetic shift




B.Supmonchai

Notes on Barrel Shifter

q Note that signal goes through at most one FET (soconstant propagation delay (in theory))

q Also note, that the FET diffusion capacitance on anoutput wire increases linearly with the shift width butthe FET diffusion capacitance on the input data linesincreases quadratically (i.e., N2 for circular shifter)

q Size of cell is bounded by the pitch of the metal wires.

q A decoder is usually needed for shift control signalssince the amount of shift are normally given in (encoded)binary number.


B.Supmonchai

4-bit Barrel Shifter Layout

Widthbarrel ~ 2 pm NN = max shift distance, pm = metal pitch

BufferSh3Sh2Sh1Sh0

A3

A2

A1

A0

Widthbarrel


B.Supmonchai

8-bit Logarithmic Shifter

log N stages

A3

A2

A1

A0

!Sh1Sh1 !Sh2Sh2 !Sh3Sh3

B0

B1

B2

B3


B.Supmonchai

Widthlog ~ pm(2K+(1+2+…+2K-1)) = pm(2K+2K-1) K = log2 N

A0

B3

B2

B1

B0

A1

A2

A3

1 2 4

8-bit Logarithmic Shifter Layout Slice




B.Supmonchai

6 + 2

5 + 2

4 + 2

3 + 2

K + 2 diffs

Speed

1 + 64

1 + 32

1 + 16

1 + 8

1 + N diffs

Speed WidthWidth

pm(2K+2K-1)2 N pm

75 pm

41 pm

23 pm

13 pm

Logarithmic

128 pm

64 pm

32 pm

16 pm

Barrel

664

532

416

38

KN

Shifter Implementation Comparisons

q Barrel Shifter is better for small shifters (faster, not much bigger)while Log Shifter is preferred for larger shifters.

ß Log Shifters are always smaller

q For large shifter we may have to start worrying about the numberof pass transistors in series.


B.Supmonchai

2-to-4Decoder

In0

In1

Enable

Out0

Out1

Out2

Out3

Decodersq Decodes inputs to activate one of many outputs

q Cost of 2-to-4 Decoderß two inverters, four 2-input NAND gates, four

inverters plus enable logic

ß how about cost for a 3-to-8, 4-to-16, etc. decoder?

= In0 In1

= In0 In1

= In0 In1

= In0 In1


B.Supmonchai

Dynamic NOR DecoderVdd GND GND

A0 A1

B0

B1

B2

B3

precharge A0 A1 Active HIGH Outputs

Capacitance of the output wires increases linearly with the decoder size


B.Supmonchai

Dynamic NAND DecoderGND

A0 A1

B3

precharge

B2

B1

B0

A1A0




B.Supmonchai

Notes on Dynamic Decodersq In Dynamic NOR decoder signal goes through at most

one FETß So constant propagation delay (in theory)

ß However, some output wires may have two or more parallelpaths to GND - effectively shortening the transition time

q On the contrary, signal in dynamic NAND decoder passthrough a series of FETß The number of FETs rises linearly with the decoder size

ß Thus it will be slower than the NOR implementation if thegate capacitance dominates diffusion capacitance

q For the NAND decoder all the input signals must be lowduring precharge else Vdd and GND will be connected!


B.Supmonchai

Building Bigger Decoders

0 0 0 0 1

1

Active low enable, Active low output

Need to catch the output that goes to zero before it precharges again

A4

enable

A3 A2 A1 A0

1x2

2x4

2x4

2x4

2x4

.

.

.

Æ 0 Æ 1


B.Supmonchai

Layout of Bit-Sliced Datapaths

Must have enoughdrive capacity tohandle large fan-out

Sized for peak current

Horizontal gap forfeeding signals to thecells downstream


B.Supmonchai

Without feedthroughs orpitch matching (4.2mm2)

Optimizing Bit-sliced DatapathsWith feedthroughs andpitch matching (2.2mm2)

With feedthroughs(3.2mm2)

Date post:	08-Dec-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Arithmetic Building Blocks · 2009. 11. 12. · Cout drives 2 internal and 2 inverter transistor...

Documents