Comp Arith Notes

8/14/2019 Comp Arith Notes

1/26

Zurich Technische Hochschule Eidgenossische

Swiss Federal Institute of Technology Zurich Politecnico federale di Zurigo Ecole polytechnique federale de Zurich

Institut f ur Integrierte Systeme Integrated Systems Laboratory

Lecture notes on

Computer Arithmetic:Principles, Architectures,

and VLSI Design

March 16, 1999

Reto Zimmermann

Integrated Systems LaboratorySwiss Federal Institute of Technology (ETH)

CH-8092 Z urich, Switzerland [email protected]

Copyright c

1999 by Integrated Systems Laboratory, ETH Z urichhttp://www.iis.ee.ethz.ch/ zimmi/publications/comp arith notes.ps.gz


2/26

Contents

Contents

1 Introduction and Conventions 4

1.1 Outline 4

1.2 Motivation 4

1.3 Conventions 5

1.4 Recursive Function Evaluation

62 Arithmetic Operations 8

2.1 Overview 8

2.2 Implementation Techniques 9

3 Number Representations 10

3.1 Binary Number Systems (BNS) 10

3.2 Gray Numbers 13

3.3 Redundant Number Systems 14

3.4 Residue Number Systems (RNS) 16

3.5 Floating-Point Numbers 18

3.6 Logarithmic Number System 193.7 Antitetrational Number System 19

3.8 Composite Arithmetic 20

3.9 Round-Off Schemes 21

4 Addition 22

4.1 Overview 22

4.2 1-Bit Adders, (m, k)-Counters 23

Computer Arithmetic: Principles, Architectures, and VLSI Design 1

Contents

4.3 Carry-Propagate Adders (CPA) 26

4.4 Carry-Save Adder (CSA) 45

4.5 Multi-Operand Adders 46

4.6 Sequential Adders 52

5 Simple/ Addition-Based Operations 53

5.1 Complement and Subtraction 53

5.2 Increment / Decrement 545.3 Counting 58

5.4 Comparison, Coding, Detection 60

5.5 Shift, Extension, Saturation 64

5.6 Addition Flags 66

5.7 Arithmetic Logic Unit (ALU) 68

6 Multiplication 69

6.1 Multiplication Basics 69

6.2 Unsigned Array Multiplier 71

6.3 Signed Array Multipliers 72

6.4 Booth Recoding 73

6.5 Wallace Tree Addition 75

6.6 Multiplier Implementations 75

6.7 Composition from Smaller Multipliers 76

6.8 Squaring 76

7 Division / Square Root Extraction 77

7.1 Division Basics 77


Contents

7.2 Restoring Division 78

7.3 Non-Restoring Division 78

7.4 Signed Division 79

7.5 SRT Division 80

7.6 High-Radix Division 81

7.7 Division by Multiplication 81

7.8 Remainder / Modulus 82

7.9 Divider Implementations 83

7.10 Square Root Extraction 84

8 Elementary Functions

858.1 Algorithms 85

8.2 Integer Exponentiation 86

8.3 Integer Logarithm 87

9 VLSI Design Aspects 88

9.1 Design Levels 88

9.2 Synthesis 90

9.3 VHDL 91

9.4 Performance 93

9.5 Testability 95

Bibliography 96



3/26

1 Introduction and Conventions 1.2 Motivation

1 Introduction and Conventions

1.1 Outline

Basic principles of computer arithmetic [1, 2, 3, 4, 5, 6, 7] Circuit architectures and implementations of main

arithmetic operations

Aspects regarding VLSI design of arithmetic units

1.2 Motivation

Arithmetic units are, among others, core of every data path and addressing unit

Data path is core of : microprocessors (CPU) signal processors (DSP) data-processing application specic ICs (ASIC) and

programmable ICs (e.g. FPGA) Standard arithmetic units available from libraries Design of arithmetic units necessary for :

non-standard operations high-performance components library development


1 Introduction and Conventions 1.3 Conventions

1.3 Conventions

Naming conventions

Signal buses :

(1-D),

(2-D), : (subbus, 1-D)

Signals : , (1-D),

(2-D),

: (group signal)

Circuit complexity measures :

(area), (cycle time,

delay),

(area-time product), (latency, # cycles) Arithmetic operators : , , , , log ( log 2 )

Logic operators : (or), (and), (xor), (xnor), (not)

Circuit complexity measures

Unit-gate model ( gate-equivalents (GE) model) : Inverter, buffer :

0 0 (i.e. ignored) Simple monotonic 2-input gates (AND, NAND, OR,

NOR) :

1 1

Simple non-monotonic 2-input gates (XOR, XNOR) :

2 2 Complex gates : composed from simple gates

Simple -input gates :

1

log !

Wiring not considered (acceptable for comparisonpurposes, local wiring, multilevel metallization)

Only estimations given for complex circuits


1 Introduction and Conventions 1.4 Recursive Function Evaluation

1.4 Recursive Function Evaluation

Given : inputs , outputs " , function # (graph sym. : )

Non-recursive functions (n. ) Output " is a function of input (or $ % & : $ const.)

"

#

'

(

)

; 0 0 1 1 1 2 1

parallel structure :

3

'

2

)

3

'

1)

funn.epsi

194

17 mm1

a 0 a 1a 2 a 3

z 0 z 1z 2 z 3

Recursive functions (r.) Output " is a function of all inputs

5 6 0

a) with single output " " 7 8 1 (r.s.) :

9

#

'

9

8 1)

; 0 0 1 1 1 2 19

8 1 0 1 " 9

7

8 1

1. # is non-associative (r.s.n. )

serial structure :

3

'

2

)

3

'

2

)

funrsn.epsi19 4 24 mm

123

a 0

a 1

a 2

a 3

z


1 Introduction and Conventions 1.4 Recursive Function Evaluation

2. # is associative (r.s.a. ) serial or single-tree structure :

3

'

2

)

3

'

log 2 )

funrsa.epsi19 4 20 mm

12

a 0 a 1a 2 a 3

z

b) with multiple outputs " (r.m.) (

prex problem) :

"

#

'

"

8 1)

; 0 0 1 1 1 2 1 " 8 1 0 1

1. # is non-associative (r.m.n. )

serial structure :

3

'

2

)

3

'

2

)

funrmn.epsi19 4 25 mm

1

23

a 0 a 1a 2 a 3

z 0 z 1z 2 z 3

2. # is associative (r.m.a. )

serial or multi-tree structure :

3

'

2 2)

3

'

log 2 )

funrma1.epsi19 4 43 mm

12

a 0 a 1a 2 a 3

z 0

z 1

z 2

z 3

or shared-tree structure :

3

'

2 log 2 )

3

'

log 2 )

funrma2.epsi19 4 21 mm

12

a 0 a 1a 2 a 3

z 0 z 1z 2 z 3



4/26

2 Arithmetic Operations 2.1 Overview

2 Arithmetic Operations

2.1 Overview

arithops.epsi98 4 83 mm

= , < + 1 , 1 + , + /

exp (x)

trig (x)

sqrt (x)

log (x)

>

+ ,

fixed-point floating-pointbased on operation

related operation

hyp (x) c o m p

l e x

i t y

(same as onthe left for

floating-pointnumbers)

1 shift/extension 7 division2 comparison 8 square root extraction3 increment/decrement 9 exponential function4 complement 10 logarithm function5 addition/subtraction 11 trigonometric functions6 multiplication 12 hyperbolic functions


2 Arithmetic Operations 2.2 Implementation Techniques

2.2 Implementation Techniques

Direct implementation of dedicated units :

always : 1 5 in most cases : 6 sometimes : 7, 8

Sequential implementation using simpler units andseveral clock cycles ( decomposition) :

sometimes : 6 in most cases : 7, 8, 9

Table look-up techniques using ROMs :

universal : simple application to all operations efcient only for single-operand operations of high

complexity (8 12) and small word length (note: ROMsize 2

7

2 )

Approximation techniques using simpler units : 712

taylor series expansion polynomial and rational approximations convergence of recursive equation systems CORDIC (COordinate Rotation DIgital Computer)


3 Number Representations 3.1 Binary Number Systems (BNS)

3 Number Representations

3.1 Binary Number Systems (BNS)

Radix-2 , binary number system (BNS) : irredundant,weighted, positional, monotonic [1, 2]

2 -bit number is ordered sequence of bits (b inary dig its ) :

'

7

8 1 7 8 2 1 1 1 0)

2

0 1 Simple and efcient implementation in digital circuits MSB/LSB (most-/least-signicant bit) : 7 8 1 / 0 Represents an integer or xed-point number, exact Fixed-point numbers :

'

&

8 1 1 1 1 0

-bit integer

1

8 1 1 1 1 & 8 7

-bit fraction

)

Unsigned : positive or natural numbers

Value :

7

8 127

8 1 12 0

7

8 1

0

2

Range : 0 27

1

Twos (2s) complement : standard representation of signed or integer numbers

Value :

7

8 127

8 1

7

8 2

0

2

Range : 27

8 1 2

7

8 1 1



Complement :

27

1 ,where

'

7

8 1 7 8 2 1 1 1 0)

Sign : 7 8 1

Properties : asymmetric range, compatible withunsigned numbers in many arithmetic operations(i.e. same treatment of positive and negative numbers)

Ones (1s) complement : similar to 2s complement

Value :

7

8 1'

27

8 1 1

)

7

8 2

0

2

Range : '

27

8 1 1

)

27

8 1 1

Complement :

27

1

Sign : 7 8 1

Properties : double representation of zero, symmetricrange, modulo

'

27

1)

number system

Sign-magnitude : alternative representation of signednumbers

Value :

'

1)

1

7

8 2

0

2

Range : '

27

8 1 1

)

27

8 1 1

Complement :

'

7

8 1 7 8 2 1 1 1 0)

Sign : 7 8 1



5/26


Properties : double representation of zero, symmetricrange, different treatment of positive and negativenumbers in arithmetic operations, no MSB toggles atsign changes around 0 (

low power)

Graphical representation

numrep.epsi95 4 73 mm

2 n 10

unsigned

2s complement

1s complement

sign-magnitude

2 n 2 n 1

0 0 0

. . . 0

0 1 1

. . . 1

1 0 0

. . . 0

1 1 1

. . . 1

binary number representation

Conventions 2s complement used for signed numbers in these notes Unsigned and signed numbers can be treated equally in

most cases, exceptions are mentioned


3 Number Representations 3.2 Gray Numbers

3.2 Gray Numbers

Gray numbers (code ) : binary, irredundant, non-weighted,non-monotonic

+ Property : unit-distance coding (i.e. exactly one bittoggles between adjacent numbers)

Applications : counters with low output toggle rate(low-power signal buses), representation of continuoussignals for low-error sampling (no false numbers due toswitching of different bits at different times)

Non-monotonic numbers : difcult arithmetic operations,e.g. addition, comparison :

1

0

1

0

0

0

0 0 0 1 and 0 11 1 1 0 but 1 0

binary Gray :

%

1

7

0 ;0 0 1 1 1 2 1 (n.)

Gray binary :

%

1

7

0 ;0

2

1 1 1 1 0 (r.m.a.)

binary Gray

3

2

1

0 3 2 1 00 0 0 0 0 0 0 0 01 0 0 0 1 0 0 0 12 0 0 1 0 0 0 1 13 0 0 1 1 0 0 1 04 0 1 0 0 0 1 1 05 0 1 0 1 0 1 1 16 0 1 1 0 0 1 0 17 0 1 1 1 0 1 0 08 1 0 0 0 1 1 0 09 1 0 0 1 1 1 0 1

10 1 0 1 0 1 1 1 111 1 0 1 1 1 1 1 012 1 1 0 0 1 0 1 013 1 1 0 1 1 0 1 114 1 1 1 0 1 0 0 115 1 1 1 1 1 0 0 0


3 Number Representations 3.3 Redundant Number Systems

3.3 Redundant Number Systems Non-binary , redundant , weighted number systems [1, 2] Digit set larger than radix (typically radix 2)

multiplerepresentations of same number redundancy

+ No carry-propagation in adders

more efcient impl.of adder-based units (e.g. multipliers and dividers)

Redundancy

no direct implementation of relationaloperators conversion to irredundant numbers

Several bits used to represent one digit

higher storagerequirements

Expensive conversion into irredundant numbers (notnecessary if redundant input operands are allowed)

Delayed-carry of half-adder number representation :

0 1 2 ,

0 1 ,

'

%

1

)

2 % 1

, % 1

0

7

8 1

0

2

'

)

1 digit holds sum of 2 bits (no carry-out digit) example :

'

00 10)

00 10 01 01 '

10 00)

irredundant representation of 1 [8], since

%

1

0 &

1

1

0

Carry-save number representation :

0 1 2 3 ,

0 1 ,

'

%

1

)

2 % 1

7

8 1

0

2

'

)


3 Number Representations 3.3 Redundant Number Systems

1 digit holds sum of 3 bits or 1 digit + 1 bit (nocarry-out digit, i.e. carry is saved )

standard redundant number system for fast addition

Signed-digit (SD) or redundant digit (RD) numberrepresentation :

9

1 0 1

1 0 1 ,

7

8 1

0

2

no carry-propagation in

:

9

'

%

1

)

2 % 1

, % 1

1 0 1

'

%

1

)

is redundant (e.g. 0 1 01 11)

0

'

) !

1 0 1 1 digit holds sum of 2 digits (no carry-out digit) minimal SD representation : minimal number of

non-zero digits, 011

1 10 100

0 10 applications : sequential multiplication (less cycles),

lters with constant coefcients (less hardware) example :

7 '

0111!

1111!

1011!

minimal

1001!

11111!

)

canonical SD repres.: minimal SD + not two non-zero

digits in sequence,

01

1

10

10

0

10

SD binary : carry-propagation necessary (

adder) other applications : high-speed multipliers [9] similar to carry-save , simple use for signed numbers



6/26

3 Number Representations 3 .4 Residue Number Systems (RNS)

3.4 Residue Number Systems (RNS)

Non-binary , irredundant , non-weighted number system [1]

+ Carry-free and fast additions and multiplications

Complex and slow other arithmetic operations(e.g. comparison, sign and overow detection) because

digits are not weighted , conversion to weightedmixed-radix or binary system required Codes for error detection and correction [1] Possible applications (but hardly used) :

digital lters : fast additions and multiplications error detection and correction for arithmetic operations

in conventional and residue number systems Base is 2 -tuple of integers

'

7

8 1 7 8 2 1 1 1 0)

,residues (or moduli ) pairwise relatively prime

'

7

8 1 7

8 2 1 1 1

0

)

&

1 &

2 &

0 ,

0 1 1 1 1 1

Range:

7

8 1

0

, anywhere in ZZ

mod !

!

&

,

!

!

7

8 1

0

,

'

1 1 1

0 1

0 1 1 1 )


3 Number Representa tions 3 .4 Residue Number Systems (RNS)

Arithmetic operations : (each digit computed separately)

"

!

!

&

!

#

'

) !

&

#

' !

!

&

)

&

!

#

'

) !

&

!

!

&

!

!

&

!

!

&

&

!

!

&

!

!

&

!

!

&

!

!

&

&

!

!

&

!

!

&

!

!

&

8 1

&

&

8 2

&

(Fermats theorem)

Best moduli are 2 and'

2 1)

: high storage efciency with 5 bits simple modular addition : 2 : 5 -bit adder without ,

2 1 : 5 -bit adder with end-around carry ( 7 ) Example :

'

1 0)

'

3 2)

,

6

4 3 2 1 0 1 2 3 4 5 6 7 8 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 0 1 0 1 0 1 0 1 0 1 0 1 0

possible range!

5!

6

'

1 0)

' !

5!

3 !

5!

2)

'

2 1)

!

4 5!

6 '

1 0)

'

2 1)

' !

1 2!

3 !

0 1!

2)

'

0 1)

!

3!

6!

4 5!

6 '

1 0)

'

2 1)

' !

1 2!

3 !

0 1!

2)

'

2 0)

!

2!

6


3 Number Representations 3.5 Floating-Point Numbers

3.5 Floating-Point Numbers Larger range , smaller precision than xed-point

representation, inexact , real numbers [1, 2] Double-number form

discontinuous precision S biased exponent E unsigned norm. mantissa M

'

1)

'

1)

1 1

2

8

Basic arithmetic operations :

'

1)

!

%

"

'

1)

'

1)

"

! #

'

$

$

!

)

%

&

base on xed-point add, multiply, and shift operations postnormalization required (1

6

1) Applications :

processors : real oating-point formats (e.g. IEEEstandard), large range due to universal use

ASICs : usually simplied oating-point formats withsmall exponents, smaller range, used for rangeextension of normal xed-point numbers

IEEE oating-point format :precision 2 2 2 bias range precision

single 32 23 8 127 3 1 8 1038 108 7

double 64 52 11 1023 9 10307 108 15


3 Number Representa tions 3 .7 Anti tet rat ional Number System

3.6 Logarithmic Number System Alternative representation to oating-point (i.e. mantissa

+ integer exponent only xed-point exponent ) [1] Single-number form

continuous precision

higheraccuracy, more reliable

S biased xed-point exponent E

'

1)

'

1)

2

8

(signed-logarithmic ) Basic arithmetic operations :

'

)

'

$

$

!

)

(additionally consider sign)

: by approximation or addition in conventionalnumber system and double conversion

'

1)

%

'

'

1)

'

(

)

0

'

1)

1

'

+ Simpler multiplication/exponent., more complex addition

Expensive conversion : (anti)logarithms (table look-up) Applications : real-time digital lters

3.7 Antitetrational Number System

Tetration (t. ( 22 2 2

2

3

4

) and antitetration (a.t. ( ) [10]

Larger range , smaller precision than logarithmic repres.,otherwise analogous (i.e. 2

3

t. ( log ( a.t. ( )



7/26

3 Number Representations 3.8 Composite Arithmetic

3.8 Composite Arithmetic Proposal for a new standard of number representations [10] Scheme for storage and display of exact (primary:

integer , secondary: rational ) and inexact (primary:logarithmic , secondary: antitetrational ) numbers

Secondary forms used for numbers not representable by

primary ones (

no over-/underow handling necessary) Choice of number representation hidden from user, i.e.

software/compiler selects format for highest accuracy Number representations :

tag valueinteger : 00 2s complement integer

rational : 01 slash denominator numerator

logarithmic : 10 log integer log fraction

antitetrational : 11 a.t. integer a.t. fraction Rational numbers : slash position (i.e. size of numerator/

denominator) is variable and stored (oating slash) Storage form sizes : 32-bit (short), 64-bit (normal),

128-bit (long), 256-bit (extended) Implementation : mixed hardware/software solutions Hardware proposal : long accumulator (4096 bits) holds

any oating-point number in xed-point format

higher accurary

large hardware/software overhead


3 Number Representations 3.9 Round-Off Schemes

3.9 Round-Off Schemes Intermediate results with

additional lower bits(

higher accuracy) :

'

7

8 1 1 1 1 0 8 1 1 1 1 8

)

Rounding : keeping error small during nal word length reduction :

'

7

8 1 1 1 1

0)

Trade-off : numerical accuracy vs. implementation cost

Truncation :

'

7

8 1 1 1 1 0)

0

12

12

1 (= average error )

Round-to-nearest (i.e. normal rounding ) :

'

7

8 1 1 1 1

0

)

12

0 1 12

0

12

1 (nearly symmetric) 0 1 12 can often be included in previous operation

Round-to-nearest-even/-odd :

8

if '

8 1

1 1 1

8

)

0

0'

7

8 1 1 1 1

1 0)

otherwise

0

0 (symmetric) mandatory in IEEE oating-point standard

3 guard bits for rounding after oating-point operations :guard bit (postnormalization), round bit

(round-to-nearest), sticky bit

(round-to-nearest-even)


4 Addition 4.1 Overview

4 Addition

4.1 Overview

adders.epsi103 4 121 mm

HA FA (m,k) (m,2)1-bit adders

RCA CSKA CSLA CIA

CLA PPA COSA

carry-propagate adders

carry-save adders

CSA

adderarray

addertree

arrayadder

treeaddermulti-operand adders

CPA

3-operand

multi-operand

Legend:

HA: half -adderFA: full-adder(m,k): (m,k)-counter(m,2): (m,2)-compressor

CPA: carry-propagate adderRCA: ripple-carry adderCSKA:carry-skip adderCSLA: carry-select adderCIA: carry-increment adder

CLA: carry-lookahead adderPPA: parallel-prefix adderCOSA:conditional-sum adder

CSA: carry-save adder

based on component related component


4 Addition 4.2 1-Bit Adders, (m, k)-Counters

4.2 1-Bit Adders, (m, k)-Counters

Add up bits of same magnitude (i.e. 1-bit numbers)

Output sum as 5 -bit number ( 5

log

1)

or : count 1s at inputs

(m, k)-counter [3](combinational counters)

Half-adder (HA), (2, 2)-counter

'

)

2

3

2

'

1

)

(sum)

(carry-out)

hasym.epsi18 4 23 mmHA

a

c out

s

b haschema1.epsi

19 4 28 mm

a

c out

s

b

(reference)

haschema2.epsi21 4 43 mm

a

c out

s

b



8/26


Full-adder (FA), (3, 2)-counter

'

)

2

7

7 4'

2)

(generate) 0

(propagate) 1

7

7

7

7

'

)

7

7

7

7

7

0

7

1

fasymbol.epsi18 4 21 mmFA

a

c out

s

b

c in

faschematic3.epsi29 4 32 mm

a

c out

s

b

c in

HA

HA

g

p faschematic2.epsi32 4 35 mm

a

c out

s

b

c in


a

c out

s

b

c in

g p

(reference)


a

c out

s

b

c in p

0

1faschematic5.epsi

35 4 47 mm

a

c out

s

b

c in

0

1

c 0

c 1



(m, k)-counters'

8 1 1 1 1

0)

8 1

$

0

$ 2$

&

8 1

0

cntsymbol.epsi18 4 23 mm(m,k)

a m-1...

...

a 0

s k-1 s 0 Usually built from full-adders

Associativity of addition allows convertion from linear totree structure faster at same number of FAs

7 log &

1

28

7'

log )

4 2

log

4

log3 !

2

log

Example : (7, 3)-counter

28 14

count73ser.epsi42 4 59 mm

FA

a 0

FA

FA

FA

a 1 a 2 a 3 a 4 a 5 a 6

s 0 s 1s 2 linear structure

28 10

count73par.epsi36 4 48 mm

FA

a 0

FA

FA

FA

a 1 a 2 a 3 a 4 a 5 a 6

s 0 s 1s 2

tree structure


4 Addition 4.3 Carry-Propagate Adders (CPA)

4.3 Carry-Propagate Adders (CPA)

Add two 2 -bit operands

and

and an optional carry-in

7 by performing carry-propagation [1, 2, 11] Sum

'

)

is irredundant '

2

1)

-bit number'

)

27

7

2 % 1

;0 0 1 1 1 1 2 1

0

7

7 (r.m.a.)cpasymbol.epsi

29 4 26 mmc out CPA

A B

S

c in

Ripple-carry adder (RCA)

Serial arrangement of 2 full-adders Simplest , smallest , and slowest CPA structure

7 2 2 2

14 2 2

rca.epsi57 4 23 mmFAc out c in

a n-1 b n-1

s n-1

FA

a 1 b 1

s 1

FA

a 0 b 0

s 0

c 1c 2 c n-1

. . .

. . .



Carry-propagation speed-up techniques

a) Concatenation of partial CPAs with fast 7

speedup1.epsi84 4 26 mm

a i-1:k b i-1:k

s i-1:k

c in c out CPA CPA

a k-1:0 b k-1:0

CPA

a n-1:j b n-1:j

s k-1:0 s n-1:j

c k c i c j

. . .

. . .

a) Fast carry look-ahead logic for entire range of bits

speedup2.epsi104 4 50 mm

c out c in

a n-1 b n-1

s n-1

a 1 b 1

s 1

a 0 b 0

s 0

. . .

. . .

preprocessing

postprocessing

carry propagation



9/26


Carry-skip adder (CSKA)

Type a) : partial CPA with fast

8 1:

8 1: (bit group'

8 1 1 1 1 )

)

8 1:

8 1

8 2

(group propagate)

1)

8 1: 0 :

and

selected (

)2)

8 1: 1 :

but

skipped (

)

path

never sensitized

fast

false path

inherent logic redundancy

problems incircuit optimization, timing analysis, and testing

Variable group sizes (faster) : larger groups in the middle(minimize delays 0

8 1 and

7

8 1) Partial CPA typ. is RCA or CSKA (

multilevel CSKA) Medium speed-up at small hardware overhead

(+ AND/bit + MUX/group)

82

42 1

1

2

322 3

1

2

cska.epsi99 4 36 mm

a i-1:k b i-1:k

s i-1:k

c in c out

CPA0

1

P i-1:k

CPA

a k-1:0 b k-1:0

CPA

a n-1:j b n-1:j

s k-1:0 s n-1:j

c k

c i

c i c j

. . .

. . .



Carry-select adder (CSLA) Type a) : partial CPA with fast

and

8 1:

8 1: 0

8 1: 1

8 1:

0

1

Two CPAs compute two possible results ( 7 0 1),

group carry-in selects correct one afterwards Variable group sizes (faster) : larger groups at end (MSB)

(balance delays 0 and 0 ) Part. CPA typ. is RCA, CSLA ( multil. CSLA), or CLA High speed-up at high hardware overhead

(+ MUX/bit + (CPA + MUX)/group)

14 2

2 1 8 2 11

2

39 2 31

2

csla.epsi102 4 50 mm

c in c out CPA

a k-1:0 b k-1:0

s k-1:0

0CPA

CPA

0 1

10

1

s i-1:k 0 s i-1:k

1

c i 0

c i 1

a i-1:k b i-1:k

s i-1:k

c k c i

. . .

. . .

c k



Carry-increment adder (CIA) Type a) : partial CPA with fast

and

8 1:

8 1:

8 1:

8 1:

8 1:

8 1

8 2

(group propagate)

Result is incremented after addition, if

1 [12, 11] Variable group sizes (faster) : larger groups at end (MSB)

(balance delays 0 and

) Part. CPA typ. is RCA, CIA (

multilevel CIA) or CLA High speed-up at medium hardware overhead

(+ AND/bit + (incrementer + AND-OR)/group) Logic of CPA and incrementer can be merged [11]

10 2

2 1 8 2 11

2

28 2 31

2

cia.epsi86 4 43 mm

c in c out CPA

a k-1:0 b k-1:0

s k-1:0

c k c i

a i-1:k b i-1:k

s i-1:k

0CPA

+1

c i

s i-1:k

P i-1:k

. . .

. . .



Example : gate-level schematic of carry-incr. adder (CIA) only 2 different logic cells ( bit-slices ) : IHA and IFA

4 6 10 12 14 16 18 20 22 24 26 28 ... 38max group 2 3 4 5 6 7 8 9 10 11 ... 16

1 2 4 7 11 16 22 29 37 46 56 67 ... 137

ciagate.epsi100 4 112 mm s k

a k b k a k+1 b k+1

s i-2

a i-2 b i-2

s i-1

a i-1 b i-1

c k c i

. . .

. . .

. . .

c in c out

IFA IFA IFA IHA

IHAIFA + IHA(i-k-1)IFA + IHA

. . .. . .

s k+1

2IFA + IHA IHA

bit 0bit 1bits 3,2bits 6...4bits i-1...k



10/26


Conditional-sum adder (COSA)

Type a) : optimized multilevel CSLA with'

log 2 )

levels(i.e. double CPAs are merged at higher levels)

Correct sum bits ( 0 8 1: or 1

8 1: ) are (conditionally )selected through

'

log 2 )

levels of multiplexers Bit groups of size 2 at level

Higher parallelism , more balanced signal paths Highest speed-up at highest hardware overhead

(2 RCA + more than'

log 2 )

MUX/bit)

3 2 log 2

2log 2

6 2 log 2 2

cosa.epsi100 4 57 mm

c in FA

a 0 b 0

s 3

0 1

FA

FA

0

1

a 1 b 1

0 1

0 1

FA

FA

0

1

a 3 b 3

0 1 0 1

FA

FA

0

1

a 2 b 2

0 1

c out s 0 s 2 s 1

0 1

l e v e

l 2

l e v e

l 1

l e v e

l 0

. . .

...

. . .

. . .



Carry-lookahead adder (CLA), traditional

Type b) : carries looked ahead before sum bits computed Typically 4-bit blocks used (e.g. standard IC SN74181)

0

0 1

0

0

0

2

1

1

0

1

0

0 3

2

2

1

2

1

0

2

1

0

0

3

3

3

2

3

2

1

3

2

1

0

3

3

2

1

0

clbsymbol.epsi27 4 26 mm

c 3

CLBc 0

. . .

c 0 . . .

(g ,p )0 0

(g ,p )3 3

(g ,p )3 3

Hierarchical arrangement using'

12 log

2

)

levels :'

3

3

)

passed up,

0 passed down between levels High speed-up at medium hardware overhead

14 2

4 log 2

56 2 log 2

cla.epsi97 4 48 mm

CLB CLB CLB CLB

CLB c in

(g ,p )3 3 ... (g ,p )0 0 (g ,p )7 7 ... (g ,p )4 4 (g ,p )11 11 ... (g ,p )8 8 ... (g ,p )12 12 (g ,p )15 15

c 15 c 12 ...

c 12 c 8 c 4 c 0

( g

, p

)

3

3

c 11 c 8 ... c 7 c 4 ... c 3 c 0 ...

( g

, p

)

1 5

1 5

( g

, p

)

7

7

( g

, p

)

1 1

1 1

+ preprocessing :

+ postprocessing :



Parallel-prex adders (PPA) Type b) : universal adder architecture comprising RCA,

CIA, CLA, and more (i.e. entire range of area-delaytrade-offs from slowest RCA to fastest CLA)

Preprocessing , carry-lookahead , and postprocessing step Carries calculated using parallel-prex algorithms

+ High regularity : suitable for synthesis and layout

+ High exibility : special adders, other arithmeticoperations, exchangeable prex algorithms (i.e. speeds)

+ High performance : smallest and fastest adders

5 2 3

4 2

add.epsi///gures73 4 64 mm

a n - 1

a 0

b n - 1

b 0

s n - 1

s 0

c out

c in

c n p n-1

(g , p )0 0

c 0 p 0 c 1

(g , p )n-1n-1

...

a 1

b 1

s 1

a n - 2

b n - 2

s n - 2

...

... ...

preprocessing:

carry-lookahead:prex algorithm

postprocessing:



Prex problem Inputs

'

(

7

8 1 1 1 1 (

0)

, outputs'

7

8 1 1 1 1

0)

, associativebinary operator [11, 13]'

7

8 1 1 1 1

0)

'

(

7

8 1

(

0 1 1 1 (

1(

0 (

0)

or

0 (

0

(

8 1 ; 0 1 1 1 1 2

1 (r.m.a.)

Associativity of tree structures for evaluation :(

3

'

(

2

'

(

1(

0

'

1

11:0

)

'

2

2

2:0

)

'

3

33:0

'

(

3(

2

13:2

)

'

(

1(

0

'

1

11:0

)

'

3

2

3:0

, but 2 ?

Group variables : : covers bits'

(

1 1 1

(

)

at level

Carry-propagation is prex problem : : '

:

:

)

'

0

:

0

:

)

'

)

'

:

:

)

'

8 1

:$

%

1

8 1

:$

%

1

)

'

8 1$

:

8 1$

:

)

; 5 6

6 0

'

8 1

:$

%

1

8 1

:$

%

1 8 1

$

:

8 1

:$

%

1

8 1$

:

)

%

1 &

:0 ; 0 0 1 1 1 2

1

1 1 1 1

Parallel-prex algorithms [11] :

multi-tree structures ( 3

'

2

)

3

'

log2

)

) sharing subtrees (

3

'

2 2)

3

'

2 log 2 )

) different algorithms trading area vs. delay (inuences

also from wiring and maximum fan-out

3

)



11/26


Prex algorithms

Algorithms visualized by directed acyclic graphs (DAG)with array structure ( 2 bits

levels) Graph vertex symbols :

8 1

:$

%

1

8 1

:$

%

1

8 1$

:

8 1$

:

:

:

:

:

(contains logic for )

8 1

:

8 1

:

:

:

:

:

(contains no logic)

Performance measures :

: graph size (number of black nodes)

: graph depth (number of black nodes on critical path) Serial -prex algorithm (

RCA)

2

1 2 1

3

2

ser.epsi///gures69 4 38 mm

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

0123

1415

. . .



Sklansky parallel-prex algorithm (

PPA-SK) Tree-like collection, parallel redistribution of carries

12

2 log 2

log 2 !

3

12

2

sk.epsi///gures67 4 30 mm

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

1234

0

Brent-Kung parallel-prex algorithm (

PPA-BK) Traditional CLA is PPA-BK with 4-bit groups Tree-like redistribution of carries (fan-out tree)

2 2

log 2 !

2 2

log 2 !

2

3

log 2

bk.epsi///gures67 4 38 mm

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

1234

0

56



Kogge-Stone parallel-prex algorithm (

PPA-KS) very high wiring requirements

2 log 2 2 1

log 2 !

3

2

ks.epsi///gures67 4 52 mm

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

1

2

3

4

0

Carry-increment parallel-prex algorithm ( CIA)

2 2 1 1 4 2 11

2

1 1 4 2 11

2

3

1 1 4 2 11

2

cia.epsi///gures67 4 34 mm

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

01

2345



Mixed serial/parallel -prex algorithm (

RCA + PPA)

linear size-depth trade-off using parameter 5 :

0 6 5 6 2 2

log 2 !

2

5 0 : serial-prex graph5

2

2

log 2 !

1 : Brent-Kung parallel-prexgraph

lls gap between RCA and PPA-BK (i.e. CLA) in stepsof single -operations

2

1 5 2 1 5

3

var.

var.epsi///gures68 4 54 mm

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

012345678

910



12/26


Example : 4-bit parallel-prex adder (PPA-SK) efcient AND-OR-prex circuit for the generate and

AND-prex circuit for the propagate signals optimization : alternatingly AOI-/OAI- resp. NAND-/

NOR-gates (inverting gates are smaller and faster) can also be realized using two MUX-prex circuits

askgate.epsi///gures100 4 103 mm

c out

a 3 b 3

s 3 s 2 s 1 s 0 P n-1:0

a 2 b 2 a 1 b 1 a 0 b 0

c in



Prex adder synthesis

Local prex graph transformation :

3

3unfact.epsi

20 4 26 mm

0123

012

3

depth-decr.transform

size-decr.

transform

fact.epsi20 4 26 mm

0123

012

3

4

2

Repeated (local) prex transformations result in overallminimization of graph depth or size

which sequence ? Goal: minimal size (area) at given depth (delay) Simple algorithm for sequence of applied transforms :

Step 1 : prex graph compression (depth minimization) :depth-decr. transforms in right-to-left bottom-up order

Step 2 : prex graph expansion (size minimization) :size-decreasing transforms in left-to-right top-downorder, if allowed depth not exceeded

Prex adder synthesis : 1) generate serial-prex graph, 2)graph compression, 3) depth-controlled graph expansion,4) generate pre-/postprocessing and prex logic

+ Generates all previous prex graphs (except PPA-KS)

+ Universal adder synthesis algorithm : generatesarea-optimal adders for any given timing constraints [11](including non-uniform signal arrival times)



Multilevel adders Multilevel versions of adders of type a) possible (CSKA,

CSLA, and CIA; notation: 2-level CIA = CIA-2L)

+ Delay is3

'

2 11

&

%

1 )

for levels

Area increase small for CSKA and CIA,high for CSLA ( COSA)

Difcult computation of optimal group sizes

Hybrid adders

Arbitrary combinations of speed-up techniques possible

hybrid/mixed adder architectures Often used combinations : CLA and CSLA [14]

Pure architectures usually perform best (at gate-level)

Transistor-level adders Inuence of logic styles (e.g. dynamic logic,

pass-transistor logic

faster)

+ Efcient transistor-level implementation of ripple-carrychains (Manchester chain) [14]

+ Combinations of speed-up techniques make sense

Much higher design effort Many efcient implementations exist and published



Self-timed adders

Average carry-propagation length : log 2

+ RCA is fast in average case ( 3

'

log 2 )

), slow in worstcase suitable for self-timed asynchronous designs [15]

Completion detection is not trivial

Adder performance comparisons

Standard-cell implementations, 0 1 8 process

addperf.ps84 4 84 mm

RCA

CSKA-2L

CIA-1L

CIA-2L

PPA-SK

PPA-BK

CLA

COSA

const. AT

area [lambda^2]

delay [ns]2

5

1e+06

2

5

1e+07

5 10 20

8-bit

16-bit

32-bit

64-bit

128-bit



13/26


Complexity comparison under the unit-gate model

adder A T AT opt. 1 syn. 2

RCA 7 2 2 2 14 2 2 aaa0

CSKA-1L 8 2 4 2 11

2 32 2 31

2 aat 3

CSKA-2L 8 2 ( 2 11

3 4 ( 2 41

3 4

CSLA-1L 142

21

82 1

1

2

392 3

1

2

CIA-1L 10 2 2 1 8 2 11

2 28 2 31

2 att0

CIA-2L 10 2 3 1 6 2 11

3 36 2 41

3 att0

CIA-3L 10 2 4 1 4 2 11

4 44 2 51

4 0

PPA-SK 322 log 2 2log 2 3 2 log2 2 ttt

0

PPA-BK 10 2 4log 2 40 2 log 2 att0

PPA-KS 3 2 log 2 2log 2 6 2 log2 2 CLA 5 14 2 4log 2 56 2 log 2 (

0

)COSA 3 2 log 2 2log 2 6 2 log2 2

1 optimality regarding area and delayaaa : smallest area, longest delayaat : small area, medium delayatt : medium area, short delayttt : large area, shortest delay : not optimal

2 obtained from prex adder synthesis3 automatic logic optimization not possible (redundancy)4 exact factors not calculated5 corresponds to 4-bit PPA-BK


4 Addition 4.4 Carry-Save Adder (CSA)

4.4 Carry-Save Adder (CSA)

a) Adds three 2 -bit operands

0 ,

1 ,

2 performing nocarry-propagation (i.e. carries are saved ) [1]

'

)

0

1

2

2 % 1

0

1

2 ;

0 0 1 1 1 1 2 1 (n.)

csasymbol.epsi21 4 26 mmCSA

S C

A0 A1 A2

b) Adds one 2 -bit operand to an 2 -digit carry-save operand'

)

'

)

7

Result is in redundant carry-save format ( 2 digits),represented by two 2 -bit numbers

(sum bits) and

(carry bits)

+ Parallel arrangement of 2 full-adders, constant delay

7 2 4

csa.epsi67 4 27 mmFA

s n-1

FA

s 1

FA

s 0

. . .

c n c 2 c 1

a 0

, n - 1

a 1

, n - 1

a 2

, n - 1

a 0

, 1

a 1

, 1

a 2

, 1

a 0

, 0

a 1

, 0

a 2

, 0

Multi-operand carry-save adders ( 3) adder array (linear arrangement), adder tree (tree arr.)


4 Addition 4.5 Multi-Operand Adders

4.5 Multi-Operand Adders Add three or more ( 2) 2 -bit operands, yield

'

2

log ! )

-bit result in irredundant number rep. [1, 2]

Array adders Realization by array adders : (see gures on next page)

a) linear arrangement of CPAsb) linear arr. of CSAs (adder array ) and nal CPA

a) and b) differ in bit arrival times at nal CPA : if CPA = RCA : a) and b) have same overall delay

if fast nal CPA : uniform bit arrival times required

CSA array (b) Fast implementation : CSA array + fast nal CPA

(note: array of fast CPAs not efcient/necessary)

'

2)

'

2)

CPA = RCA :

3

'

2

2

)

3

'

2

)

Fast CPA :

3

'

2

2 log 2 )

3

'

log 2 )

mopadd.epsi30 4 58 mm

CSA

A0

CPA

CSA

A1 A2 Am-1

S

A3

. . .

. . .



a) 4-operand CPA (RCA) array :

cparray.epsi93 4 57 mm

s n-1

FA

s 1

FA

s 0

a 0

, n - 1

a 1

, n - 1

a 2,n-1

a 0

, 1

a 1

, 1

a 0

, 0

a 1

, 0

FA

HA

FA HA

FA FA HA

FA

FA

FAFA

a 0

, 2

a 1

, 2

a 3,n-1

a 2,2

a 3,2

a 2,1

a 3,1

a 2,0

a 3,0

s 2 s n

CPA

CPA

CPA

. . .

. . .

. . .

. . .

b) 4-operand CSA array with nal CPA (RCA) :

csarray.epsi99 4 57 mm

s n-1

FA

s 1

FA

s 0

a 0

, n - 1

a 1

, n - 1

a 3,n-1

a 0

, 1

a 1

, 1

a 0

, 0

a 1

, 0

FA FA HA

FA HA

FA

FA

FAFA

a 0

, 2

a 1

, 2

a 3,2 a 3,1 a 3,0

s 2 s n

a 2

, n - 1

a 2

, 1

a 2

, 0

a 2

, 2

CSA

CSA

CPA

FA. . .

. . .

. . .



14/26


(m, 2)-compressors

2'

&

8 4

0

)

&

8 1

0

&

8 4

0

7

cprsymbol.epsi37 4 26 mm(m,2)

a m-1...

a 0

c s

......c in

m-4

c in 0

c out m-4

c out 0

1-bit adders (similar to (m,k)-counters) [16] Compresses bits down to 2 by forwarding

'

3)

intermediate carries to next higher bit position Is bit-slice of multi-operand CSA array (see prev. page)

+ No horizontal carry-propagation (i.e. 7 5

) Built from full-adders (= (3,2)-compressor) or

(4, 2)-compressors arranged in linear or tree structures Example : 4-operand adder using (4, 2)-compressors

cpradd.epsi99 4 44 mm

FA

s n-1

FA

s 1 s 0

(4,2)

HA

(4,2)(4,2)(4,2)

FA

s n s n+1 s 2

a 0 , n

- 1

a 1 , n

- 1

a 2 , n - 1

a 3 , n

- 1

a 0 , 2

a 1 , 2

a 2 , 2

a 3 , 2

a 0 , 1

a 1 , 1

a 2 , 1

a 3 , 1

a 0 , 0

a 1 , 0

a 2 , 0

a 3 , 0

CSA

CPA



7'

2)

4'

2)

6'

log !

1)

Optimized (4, 2)-compressor :

2 full-adders merged and optimized (i.e. XORsarranged in tree structure )

14 8

cpr42fa.epsi32 4 38 mm

FA

s

c out FA

a 0 a 1 a 2 a 3

c in

c with full-adders

14 6

cpr42opt.epsi41 4 53 mm

s

c out

a 0 a 1 a 2 a 3

c in

c

0 1

0 1

optimized

+ same area, 25% shorter delay SD-FA (signed-digit full-adder) is similar to

(4, 2)-compressor regarding structure and complexity



Advantages of (4, 2)-compressors over FAs for realizing(m, 2)-compressors :

higher compression rate (4:2 instead of 3:2) less deep and more regular trees

tree depth 0 1 2 3 4 5 6 7 8 9 10

FA 2 3 4 6 9 13 19 28 42 63 94# operands

(4,2) 2 4 8 16 32 64 128

Example : (8, 2)-compressor

42 16

cpr82fa.epsi47 4 65 mm

FA

a 0

FA

a 1 a 2 a 3 a 4 a 5 a 6

FA FA

FA

FA

a 7

c s

c in 0 c out

0

c in 1

c in 2

c in 3

c out 1

c out 2

c out 3

c in 4 c out

4

full-adder tree

42 12

cpr82cpr42.epsi47 4 50 mm

(4,2)

a 3 a 0

c s

c in 0 c out

0

a 1a 2 a 7 a 4 a 5 a 6

(4,2)

(4,2)

c in 1

c in 2

c in 3

c out 1

c out 2

c out 3

c in 4 c out

4

(4, 2)-compressor tree



Tree adders (Wallace tree)

Adder tree : 2 -bit -operand carry-save adder composed of 2 tree-structured (m, 2)-compressors [1, 17]

Tree adders : fastest multi-operand adders using anadder tree and a fast nal CPA

&

2 2

3

'

2

2 log 2 )

&

2

3

'

log log 2 )

Adder arrays and adder trees revisited

Some FA can often be replaced by HA or eliminated (i.e. redundant due to constant inputs)

Number of (irredundant) FA does not depend on adderstructure, but number of HA does

An -operand adder accomodates'

1)

carry inputs

Adder trees ( 3

'

log 2 )

) are faster than adder arrays(

3

'

2

)

) at same amount of gates (

3

'

2

)

)

Adder trees are less regular and have more complexrouting than adder arrays

larger area, difcult layout(i.e. limited use in layout generators)



15/26

4 Addition 4.6 Sequential Adders

4.6 Sequential Adders

Bit-serial adder : Sequential 2 -bit adder

2

bitseradd.epsi25 4 27 mm

FA

a i b i

s i Accumulators : Sequential -operand adders

With CPA

accucpa.epsi27 4 28 mm

A

CPA

S

With CSA and nal CPA Allows higher clock rates Final CPA too slow :

pipelining or multiplecycles for evaluation

4

accucsa.epsi33 4 52 mm

A

CPA

CSA

S Mixed CSA/CPA : CSA with partial CPAs (i.e. fewer

carries saved), trade-off between speed and register size


5 Simple/ Addition-Based Operations 5.1 Complement and Subtraction

5 Simple / Addition-Based Operations

5.1 Complement and Subtraction

2s complementer (negation)

1

neg.epsi21 4 32 mm

+ 1

A

Z

1

2s complement subtractor

'

)

1

sub.epsi29 4 32 mm

c out CPA

A B

S

1

2s complement adder/subtractor

'

1)

'

)

addsub.epsi36 4 35 mm

c out CPA

A B

S

sub

1s complement adder

'

mod 27

1)

(end-around carry)

addmod.epsi29 4 28 mmc out

CPA

A B

S

c in


5 Simple/ Addi tion-Based Operat ions 5 .2 Increment / Decrement

5.2 Increment / Decrement

Incrementer Adds a single bit 7 to an 2 -bit operand

'

)

27

7

"

%

1

; 0 0 1 1 1 2 1 0

7

7 (r.m.a.)

incsymbol.epsi29 4 26 mmc out

+ 1

A

Z

c in

Corresponds to addition with

0 (

FA HA) Example : Ripple-carry incrementer using half-adders

3 2 2 1

3 2 2

incfa.epsi59 4 23 mmc out c in

a n-1

z n-1

a 1

z 1

a 0

z 0

c 1c 2 c n-1HA HA HA

. . .

. . .

or using incrementer slices (= half-adder)

inc.epsi83 4 33 mm

c out c in

a n-1

z n-1

a 2

z 2

a 1

z 1

a 0

z 0

HA

. . .

. . .



Prex problem :

:

:$

%

1

$

:

AND-prex struct.

12

2 log 2 2 2

log 2 !

2

12

2 log2 2

Decrementer'

)

7

dec.epsi93 4 41 mmc out c in

a 2

z 2

a 1

z 1

a 0

z 0

a n-1

z n-1

. . .

. . .

Incrementer-decrementer'

)

7

'

1)

7

incdec.epsi

944

46 mmc out c in

a 2

z 2

a 1

z 1

a 0

z 0

dec

a n-1

z n-1

. . .

. . .



16/26


Fast incrementers

4-bit incrementer using multi-input gates :

inccg.epsi

62 4 39 mm

c out

c in

a 3 a 2 a 1 a 0

z 3 z 2 z 1 z 0

8-bit parallel-prex incrementer (Sklansky AND-prexstructure) :

incpp.epsi98 4 63 mm

c out

c in

a 7 a 6 a 5 a 4

z 7 z 6 z 5 z 4

a 3 a 2 a 1 a 0

z 3 z 2 z 1 z 0



Gray incrementer

Increments in Gray number system

0 7 8 1 7 8 2 0 (parity)

%

1

; 0 0 1 1 1 2 3 (r.m.a.)" 0 0 0

"

8 1

8 1 ; 0 1 1 1 1 2

2"

7

8 1 7 8 1 7 8 2

Prex problem

AND-prex structure


5 Simple / Addition-Based Operations 5.3 Counting

5.3 Counting Count clock cycles counter ,

divide clock frequency

frequency divider ( )

Binary counter Sequential in-/decrementer Incrementer speed-up

techniques applicable Down- and up-down-counters

using decrementers / incrementer-decrementers

cntblock.epsi32 4 33 mm

c out + 1

Q

c in

clk

Example : Ripple-carry up-counter using counter slices(= HA + FF), 7 is count enable

cntripple.epsi87 4 36 mm

c out c in

q n-1 q 2 q 1 q 0

. . .

Asynchronous counter using toggle-ip-ops(lower toggle rate

lower power)

cntasync.epsi64 4 18 mm

clk

q n-1 q 2 q 1 q 0

TTTT . . .


5 Simple / Addition-Based Operations 5.3 Counting

Fast divider ( 3

'

1)

) using delayed-carry numbers(irredundant carry-save represention of 1 allows usingfast carry-save incrementer) [8]

Gray counter Counter using Gray incrementer

Ring counters Shift register connected to ring :

cntring.epsi51 4 16 mm

q n-1 q 0 q 1q 2

State is not encoded

2 FF for counting 2 states Must be initialized correctly (e.g. 00 01) Applications:

fast dividers (no logic between FF) state counter for one-hot coded FSMs

Johnson / twisted-ring counter (inverted feed-back) :

cntjohnson.epsi59 4 16 mm

q n-1 q 0 q 1q 2

2 FF for counting 2 2 states



17/26

5 Simple/ Addition-Based Operations 5.4 Comparison, Coding, Detection

5.4 Comparison, Coding, Detection

Comparison operations$

'

)

(equal) $

'

)

$

(not equal)

$

'

)

(greater or equal)

'

)

$

(less than)

'

)

$

$

(greater than)

$

'

6

)

$

$

(less or equal)

Equality comparison$

'

)

%

1 '

)

'

)

;0 0 1 1 1 2 1

0 1 $

7 (r.s.a.)

cmpeq.epsi40 4 36 mm

a n - 1

a 2

a 1

a 0

EQ

b n - 1

b 2

b 1

b 0

. . .

Magnitude comparison

$

'

)

%

1 '

)

'

)

'

)

; 0 0 1 1 1 2 1

0 1 $

7 (r.s.a.)



Comparators Subtractor (

)

:

$

$

7

8 1:0

(for free in PPA)

7 2 2 2 or

8

32

2 log 2 8

2 log 2

cmpsub.epsi37 4 31 mm

CPA

A B

1c out GE =

P n-1:0 EQ =

Optimized comparator : removing redundancies in subtractor (unused ) single-tree structure

speed-up at no cost :

6 2 2 2

2log 2

example : ripple comparator using comparator slices

cmpripple.epsi100 4 47 mm

a n - 1

a 2

a 1

EQ

b n - 1

b 2

b 1

a 0

b 0

GE

. . .

equality

magnitude

equality &magnitude



Decoder Decodes binary number

7

8 1:0 to vector

&

8 1:0 ( 27

)

"

1 if

0

0 else ; 0 0 1 1 1 1

2

decodersym.epsi21 4 26 mmdecoder

A

Z

decoder.epsi58 4 28 mm

a 2 a 1 a 0

z 3 z 2 z 1 z 0 z 7 z 6 z 5 z 4

'

2

1)

27

log2

!

Encoder Encodes vector

&

8 1:0 to binary number

7

8 1:0 ( 27

)(condition: 0

5

!

if 5 0 then

1 else

0)

0 if 1 ; 0 0 1 1 1 1

log 2

encodersym.epsi21 4 26 mmencoder

A

Z

2

'

27

8 1 1

)

2

1

encoder.epsi30 4 34 mm

a 0

z 0

z 1

z 2

a 2 a 4 a 6 a 1a 3 a 5 a 7

(note: connectionsaccording to PPA-SK)



Detection operations

All-zeroes detection : " 7 8 1 7 8 2 0

All-ones detection : " 7 8 1 7 8 2 0 (r.s.a.)

2

log 2

Leading-zeroes detection (LZD) : for scaling , normalization , priority encoding

a) non-encoded output :

0 1

0!

1

0 1

0

(e.g. 000101 000100)

2 2 2

lzdnenc.epsi50 4 28 mm

a 1

z 1

a 0

z 0

a n-1

z n-1

. . .

a n-2

z n-2

. . .

prex problem (r.m.a.)

AND-prex structure

b) encoded output : + encoder

signed numbers : + leading-ones detector (LOZ)



18/26

5 Simple/ Addition-Based Operations 5.5 Shift, Extension, Saturation

5.5 Shift, Extension, Saturation

Shift : a) shift 2 -bit vector by 5 bit positionsb) select 2 out of more bits at position 5

also: logical (= unsigned), arithmetic (= signed)

Rotation by 5 bit positions, 2 constant (logic operation)Extension of word lengths by 5 bits ( 2 2 5 )

(i.e. sign-extension for signed numbers)Saturation to highest/lowest value after over-/underow

shift a) un- l. 7 8 2 1 1 1 0 0 sllsigned r. 0 7 8 1 1 1 1 1 srlsigned l. 7 8 1 7 8 3 1 1 1 0 0 sla

r. 7 8 1 7 8 1 7 8 2 1 1 1 1 sra

shift b) unsigned 7 %

8 1 1 1 1

signed 2 7 8 1 7 % 8 2 1 1 1

rotate l. 7 8 2 1 1 1 0 7 8 1 rol

r. 0

7

8

1 1 1 1

1 rorextend un- l. 0 7 8 1 1 1 1 0

signed r. 7 8 1 1 1 1 0 0signed l. 7 8 1 7 8 1 7 8 2 1 1 1 0

r. 7 8 1 7 8 2 1 1 1 0 0

saturate unsigned 7 8 1 1 1 1 7 8 1signed 7 8 1 7 8 1 1 1 1 7 8 1


5 Simple/ Addition-Based Operations 5.5 Shift, Extension, Saturation

Applications : adaption of magnitude (shift a)) or word length

(extension) of operands (e.g. for addition) multiplication/division by multiples of 2 (shift) logic bit/byte operations (shift, rotation) scaling of numbers for word-length reduction (i.e.

ignore leading zeroes, shift b)) or normalization (e.g.of oating-point numbers, shift a)) using LZD

reducing error after over-/underow (saturation) Implementation of shift/extension/rotation by

constant values : hard-wired variable values : multiplexers 2 possible values : 2 by 2 barrel-shifter/rotator

Example : 4by4 barrel-rotator

3

'

2 2 )

3

'

log 2 )

muxshift.epsi41 4 28 mm

a 3 a 2 a 1 a 0

s 0

s 1

z 3 z 2 z 1 z 0

multiplexers

barshift.epsi44 4 49 mm

a 3 a 2 a 1 a 0

s 0

s 1

z 3 z 2 z 1 z 0

s 0 s 1

s 0 s 1

s 0 s 1

tristate buffers


5 Simple / Addition-Based Operations 5.6 Addition Flags

5.6 Addition Flags

ag formula description

7 carry ag

7

7

8 1 signed overow ag

7

7

7

7

7

7

0 : 0 zero ag

7

8 1 negative ag, sign

Implementation of adder with ags

,

: for free

: fast

7

,

7

8 1 computed by e.g. PPA

very cheap

: a) 7 1 (subtract.) :

'

)

7

8 1:0 (of PPA)

b) 7 0 1 :

1)

7

8 1

7

8 2

0 (r.s.a.)

2

log 2 !

2) faster without nal sum (i.e. carry prop.) [18] example : 01001 1 00 0

10110 1 00 00000 0 00

" 0 ' '

0

0)

7

)

"

' '

)

'

8 1

8 1) )

"

7

8 1 " 7 8 2 " 0 ; 0 0 1 1 1 2

1 (r.s.a.)

3 2 4

log 2 !


5 Simple / Addition-Based Operations 5.6 Addition Flags

Basic and derived condition ags

formulacondition ag

unsigned signed

operation:

( ) or

( )

0 zero

0 negative

0 positive

( overow

( )

0

2 underow

( )

operation:

$

$

$

'

)

6

$

Unsigned and signed addition/subtraction only differwith respect to the condition ags



19/26

5 Simple/ Addition-Based Operations 5.7 Arithmetic Logic Unit (ALU)

5.7 Arithmetic Logic Unit (ALU)

alusymbol.epsi30 4 29 mm

c out ALU

A B

Z

c in

op flags

ALU operations

add

7 sub

7

arithmetic inc

1 dec

1pass

neg

and

nand

or

nor

logicxor

xnor

pass not

sll

1 srl

#

1shift/ sla

1 sra

#

1rotate

rol

1 ror

#

1 s/ro : shift/rotate ; l/r : left/right ;

l/a : logic (unsigned) / arithmetic (signed)

Logic of adder/subtractor can partly be shared with logicoperations


6 Multiplication 6.1 Multiplication Basics

6 Multiplication

6.1 Multiplication Basics Multiplies two 2 -bit operands

and

[1, 2] Product

is'

2 2 )

-bit unsigned number or'

2 2 1)

-bitsigned number

Example : unsigned multiplication

7

8 1

0

2

7

8 1

$

0

$ 2$

7

8 1

0

7

8 1

$

0

$ 2

%

$

or

7

8 1

0

2

; 0 0 1 1 1 2 1 (r.s.a.)

Algorithm

1) Generation of 2 partial products

2) Adding up partial products :

a) sequentially (sequential shift-and-add),b) serially (combinational shift-and-add), orc) in parallel

Speed-up techniques Reduce number of partial products Accelerate addition of partial products


6 Multiplication 6.1 Multiplication Basics

Sequential multipliers :partial products generatedand added sequentially (usingaccumulator )

3

'

2

)

3

'

log 2 )

2

mulseq.epsi34 4 28 mm

CPA

Array multipliers :partial products generated andadded simultaneously in lineararray (using array adder )

3

'

2 2)

3

'

2

)

mularr.epsi34 4 47 mm

CPA

CSA

CSA

CSA

CSA

Parallel multipliers :partial productsgenerated in parallel andaddedsubsequently in multi-operandadder (using tree adder )

3

'

2 2)

3

'

log 2 )

mulpar.epsi34 4 43 mm

CPA

CSAtree

Signed multipliers :a) complement operands before and result after

multiplication unsigned multiplicationb) direct implementation (dedicated multiplier structure)


6 Multiplication 6.2 Unsigned Array Multiplier

6.2 Unsigned Array Multiplier Braun multiplier : array multiplier for unsigned numbers

7

8 1

0

7

8 1

$

0

$ 2

%

$

8 2 2 11 2

6 2 9

0

3 0

2 0

1 0

0 1

3 1

2 1

1 1

0 2

3 2

2 2

1 2

0 3

3 3

2 3

1 3

0

7

6

5

4

3

2

1

0

mulbraun.epsi99 4 83 mm

b 3

FA

FA

FA

FA

FA

FA

FA FA HA

b 2

b 1

b 0

p 7 p 6 p 5 p 4

p 3

p 2

p 1

p 0

a 3

a 2

a 1

a 0

HA HA HA

CPA

CSA

1

2

3



20/26

6 Multiplication 6.3 Signed Array Multipliers

6.3 Signed Array Multipliers

Modied Braun multiplier

Subtract bits with negative weight

special FAs [1]

1 neg. bit :

7

2

2 neg. bits :

7

2

Replace FAs in regions

1 ,

2 , and

3 by :(input at mark )

7

7

7

Otherwise exactly same structure and complexity asBraun multiplier efcient and exible

Baugh-Wooley multiplier

Arithmetic transformations yield the following partialproducts (two additional ones) :

0

3

0

2

0

1

0

0 1

3 1

2 1

1 1

0 2

3 2

2 2

1 2

0 3

3 3

2 3

1 3

0 3 3

1

3

3

7

6

5

4

3

2

1

0

Less efcient and regular than modied Braunmultiplier


6 Multiplication 6.4 Booth Recoding

6.4 Booth Recoding Speed-up technique : reduction of partial products

Sequential multiplication Minimal (or canonical) signed-digit (SD) represent. of

+ One cycle per non-zero partial product (i.e.

!

0)

Negative partial products

Data-dependent reduction of partial products and latency

Combinational multiplication Only xed reduction of partial product possible Radix-4 modied Booth recoding : 2 bits recoded to one

multiplier digit

2

2 partial products

7

1

2

0( 2 8 1 2 2 2 % 1)

8 2 8 1 0 %

1 %

2

22

; 8 1 0

2

%

1

2

2

8 1

0 0 0 00 0 1 0 1 0 0 1 1 2 1 0 0 2 1 0 1 1 1 0 1 1 1 0

mulbooth.epsi41 4 43 mm

B o o

t h

r e c o

d i n g

CPA

CSAarray/tree


6 Multiplication 6.4 Booth Recoding

Applicable to sequential , array , and parallel multipliers

additional recoding logic and morecomplex partial product generation(MUX for shift, XOR for negation)

: 8 2

: 7

+ adder array/tree cut in half considerably smaller (array and tree)

: 2

much faster for adder arrays : 2

slightly or not faster for adder trees : 0

Negative partial products (avoid sign-extension ) :

3

3

3

ext. sign

3

2

1

0 0 0 0

3

2

1

0

1 1 1 1 3

2

1

0

03

03

03

03

02

01

00

13

13

13

12

11

10

23

23

22

21

20

33

32

31

30

6

5

4

3

2

1

0

1

03

02

01

00

13

12

11

10

23

22

21

20

33

32

31

30

6

5

4

3

2

1

0

Suited for signed multiplication (incl. Booth recod.)

Extend

for unsigned multiplication : 7 0

Radix-8 (3-bit recoding) and higher radices :precomputing 3

, 1 1 1

larger overhead


6 Multiplication 6.6 Multiplier Implementations

6.5 Wallace Tree Addition Speed-up technique : fast partial product addition

3

'

2 2)

3

'

log 2 )

Applicable to parallel multipliers : parallel partialproduct generation (normal or Booth recoded)

Irregular adder tree (Wallace tree) due to differentnumber of bits per column

irregular wiring and/or layout

non-uniform bit arrival times at nal adder

6.6 Multiplier Implementations Sequential multipliers :

low performance, small area, resource sharing (adder) Braun or Baugh-Wooley multiplier (array multiplier) :

medium performance, high area, high regularity layout generators

data paths and macro-cells simple pipelining , faster CPA higher speed

Booth-Wallace multiplier (parallel multiplier) [9] : high performance, high area, low regularity

custom multipliers, netlist generators often pipelined (e.g. register between CSA-tree and CPA)

Signed-unsigned multiplier : signed multiplier withoperands extended by 1 bit ( 7 7 8 1 0,

7

7

8 1 0)



21/26

6 Multiplication 6.8 Squaring

6.7 Composition from Smaller Multipliers

'

2 2

2 2 )

-bit multiplier can be composed from 4'

2

2

)

-bit multipliers (can be repeated recursively)

'

27

)

'

27

)

227

'

)

27

4'

2

2

)

-bit multipliers+

'

2 2 )

-bit CSA +'

3 2 )

-bit CPA

less efcient (area and speed)

6.8 Squaring

2

: multiplier optimizations possible

0 3

0 1 0 1 3 1 2 1 1 0

2 3 2 2 1

3 3 2 3 1 3 0 2 3 1 3 0 3

0 1 0 0 3 3 1 2 1 1

2 2

7

6

5

4

3

2

1

0

+

2

2

1 partial products (if no Booth recoding used)

optimized squarer more efcient than multiplier

Table look-up (ROM) less efcient for every 2


7 Division / Square Root Extraction 7.1 Division Basics

7 Division / Square Root Extraction

7.1 Division Basics

;

rem

(remainder)

0 227

1

0 27

1

0

27

27

, otherwise overow

normalize

before division (

27

8 1 2

7

1 )

Algorithms (radix-2) Subtract-and-shift : partial remainders

[1, 2] Sequential algorithm : recursive, # non-associative

"

%

1

2

%

%

1

2

7

0 ; 0 2

1 1 1 1 0 (r.m.n.)

Basic algorithm : compare and conditionally subtract

expensive comparison and CPA

Restoring division : subtract and conditionally restore(adder or multiplexer)

expensive CPA and restoring

Non-restoring division : detect sign , subtract/add , andcorrect by next steps expensive CPA

SRT division : estimate range , subtract/add (CSA), andcorrect by next steps

inexpensive CSA


7 Division / Square Root Extraction 7 .3 Non-Restoring Division

7.2 Restoring Division

1 if

%

1

2

00 if

%

1

2

0

0

%

1

2

0 : 0

%

1 (restored)0 1

%

1

2

8 1 0 : 8 1 1

8 1

%

1

2

8 1

7.3 Non-Restoring Division

1 if

%

1

0 1 1 if

%

1 0

0

%

1

0 :

1

%

1

2

0

1

%

1

2

0 :

8 1

1

8

1

%

1

2

2

8 1

%

1

2

8 1

One subtraction/addition (CPA) per step Final correction step for

(additional CPA) Simple quotient digit conversion : (note:

irredundant)

1 1

0 1 : 12'

1)

'

7

8 1

7

8 2

7

8 3 1 1 1

0 1)

'

2

1)

3

'

2

2)

or3

'

2

2 log2

)

'

2

1)

3

'

2 2)

or3

'

2 log 2 )

divnr.epsi46 4 38 mm

+ / CPA+ / CPA

+ / CPA+ / CPA

Q

+ / CPA

A B

R


7 Division / Square Root Extraction 7.4 Signed Division

7.4 Signed Division

1 if

%

1

same sign1 if

%

1

opposite sign Example : signed non-restoring array divider

(simplications:

0, nal correction of

omitted)

9 2 2 2 2 2 4 2

divarray.epsi81 4 101 mm

b 3 b 0

r 3 r 2 r 1 r 0

a 0

a 1

a 2

q 3

q 2

q 1

q 0

b 2 b 1

FAFAFAFA

FAFAFAFA

FAFAFAFA

FAFAFAFA

a 6 a 3 a 5 a 4 b 3 a 6



22/26

7 Division / Square Root Extraction 7.5 SRT Division

7.5 SRT Division (Sweeney, Robertson, Tocher)

1 if

2

6

%

1

0 if

2

6

%

1

2

1 if

%

1

2

is SD number

If 27

8 16

27

, i.e.

is normalized :

2

6

2

7

%

8 16

%

1

2

7

%

8 16

2

1 if 27

%

8 16

%

1

0 if 27

%

8 16

%

1 27

%

8 1

1 if

%

1 27

%

8 1

+ Only 3 MSB are compared

are estimated

CSAinstead of CPA can be used (precise enough) [19]

Correction in following steps (+ nal correction step) Redundant representation of

(SD representation)

nal conversion necessary (CPA)+ Highly regular and fast (

3

'

2

)

) SRT array dividers

only slightly slower/larger than array multipliers

2

2

3

'

2 2)

2

3

'

2

)

divsrt.epsi50 4 38 mm

+ / CSA

A B

Q

R

+ / CPA

+ / CSA+ / CSA

+ / CSA C P A


7 Division / Square Root Extraction 7.7 Division by Multiplication

7.6 High-Radix Division

Radix

2&

,

1 1 1 1 1 0 1 1 1 1

1

quotient bits per step fewer , but more complex steps

+ Suitable for SRT algorithm

faster

Complex comparisons (more bits) and decisions

table look-up (

Pentium bug!)

7.7 Division by Multiplication

Division by convergence

0

1

&

8 1

0

1

&

8 1

1!

1!

1resp.

27

%

1

27

'

1 )

'

1 )

27

'

1 2)

27

1

28

7

2

28

7

1 (signed)

Algorithm :

%

1

%

1

1 ; 0 0 1 1 1 1

0

0

& (r.s.n.)

Quadratic convergence :

log 2 !


7 Division / Square Root Extraction 7.8 Remainder / Modulus

Division by reciprocation

1

Newton-Raphson iteration method :

nd # '

)

0 by recursion

%

1

#

'

)

#

'

)

#

'

)

1

#

'

)

1

2#

1

&

0

Algorithm :

%

1

'

2

)

; 0 0 1 1 1 1

0

& (r.s.n.)

Quadratic convergence : 3

'

log 2 )

Speed-up : rst approximation

0 from table

7.8 Remainder / Modulus

Remainder (rem) : signed remainder of a division

rem

sign'

)

sign'

)

Modulus (mod) : positive remainder of a division

mod

0

if

0

else


7 Division / Square Root Extraction 7.9 Divider Implementations

7.9 Divider Implementations

Iterative dividers (through multiplication) :

resource sharing of existing components (multiplier) medium performance, medium area high efciency if components are shared

Sequential dividers (restoring, non-restoring, SRT) :

resource sharing of existing components (e.g. adder)

low performance, low area Array dividers (restoring, non-restoring, SRT) :

dedicated hardware component high performance, high area high regularity layout generators, pipelining square root extraction possible by minor changes combination with multiplication or/and square root

No parallel dividers exist, as compared to parallelmultipliers (sequential nature of division)



23/26

7 Division / Square Root Extraction 7 .10 Square Root Extraction

7.10 Square Root Extraction0

2

0 227

1

0 27

1

Algorithm Subtract-and-shift : partial remainders

and quotients

%

1

2

'

7

8 1 1 1 1

Date post:	30-May-2018
Category:	Documents
Upload:	dannmartins9
View:	214 times
Download:	0 times

Comp Arith Notes

Documents