An IEEE 7542008 Decimal Parallel and Pipelined FPGA ... Decimal FixedPoint Multiplier Decimal...

An IEEE 7542008 Decimal Parallel and Pipelined FPGA FloatingPoint Multiplier

Malte Baesler, SvenOle Voigt, Thomas Teufel

Institute for Reliable ComputingHamburg University of Technology

September 1st, 2010

Decimal FloatingPoint MultiplierM. Baesler, S. Voigt, T. Teufel 09/01/2010

Agenda

1. Introduction

a)Why Decimal FloatingPoint Arithmetic?

b)What are the Requirements on the Decimal Multiplier?

2. Decimal FixedPoint Multiplier

3. Decimal FloatingPoint Multiplier

4. Post Place & Route Results

a)FixedPoint Multiplier

b)FloatingPoint Multiplier

1/30


Introduction

Introduction Decimal FixedPoint Multiplier Decimal FloatingPoint Multiplier Post Place & Route Results

2/30


Why decimal floatingpoint arithmetic?


● avoid conversion errors● human centric applications● required for commercial applications, e.g. interest

calculation

2/30


Why decimal floatingpoint arithmetic?


● avoid conversion errors● human centric applications● required for commercial applications, e.g. interest

calculation

IEEE Standard 7542008 for FloatingPoint Arithmetic

● published in August 2008● replaces IEEE 7541985 and IEEE 8541987● binary and decimal floatingpoint arithmetic

2/30


FloatingPoint Arithmetic

IEEE 7542008 FloatingPoint Arithmetic

decimal64 data format● radix b=10● significand precision p=16● exponent range q

min=398, q

max=369


3/30


Requirements on the multiplier

● fast● low resource usage● IEEE 7542008 compliant● pipelined due to reuse in accurate scalar product

fully combinational→● optimized for FPGA architecture (Virtex5)

– internal fast carry chain

– DSP48E slices


4/30


Requirements on the multiplier

● fast● low resource usage● IEEE 7542008 compliant● pipelined due to reuse in accurate scalar product

fully combinational→● optimized for FPGA architecture (Virtex5)

– internal fast carry chain

– DSP48E slices


4/30


Decimal FixedPoint Multiplier


5/30


FixedPoint Multiplier


How does multiplication work?school method:

● partial product generation● accumulation of partial products

1234⋅5678 = 5000⋅1234 600⋅1234 70⋅1234 8⋅1234

5/30



● based on concepts of A. Vazquez, E. Antelo, P.Montuschi 1

● fully combinational● BCD recoding schemes● fast partial product generation● fast BCD4221 carry save adder reduction tree


1“A new family of highperformance parallel decimal multipliers“, 18th IEEE Symposium on Computer Arithmetic, June 2007

6/30



ABCD8421

P0 BCD4221

P1 BCD4221

Pp+1 BCD4221

...

p digits

SBCD8421

S_sBCD4221

S_wBCD4221

2p digits

2p

2p

BBCD8421

p digitsP

PG

en

CS

AT CP

A

DR

ec

CSAT Carry Save Adder TreeCPA Carry Propagation Adder

PPGen Partial Product GeneratorDRec Decimal Recoding Unit


7/30


Decimal Recoding

ABCD8421

P0 BCD4221

P1 BCD4221

Pp+1 BCD4221

...

p digits

SBCD8421

S_sBCD4221

S_wBCD4221

2p digits

2p

2p

BBCD8421

p digitsP

PG

en

CS

AT CP

A

DR

ec




8/30


Decimal Recoding


● transforms the multiplier's digit set into ● reduces number of multiplicand multiples

● very fast operation, no ripple carry

A×1, A×2, A×3, A×4, A×5

{0,9} {−5,5}

8/30


Partial Product Generator

ABCD8421

P0 BCD4221

P1 BCD4221

Pp+1 BCD4221

...

p digits

SBCD8421

S_sBCD4221

S_wBCD4221

2p digits

2p

2p

BBCD8421

p digitsP

PG

en

CS

AT CP

A

DR

ec




9/30


Partial Product Generator


● calculates multiples – exploits correlation between shift operation and constant

value multiplication●

●

– BCD Recoding is fast– fixedvalue shift operation is for free– only requires one carry propagate adder

● generates partial products by selection of

● 10's complement for

X 5421≪1=X⋅28421

A×1, A×2, A×3, A×4, A×5

X 8421≪3=X⋅55421

A×3P0

P p1

A×1−A×5

Bk0 :−X nX 0= X n X 01

9/30


BCD4221 Carry Save Adder Tree

ABCD8421

P0 BCD4221

P1 BCD4221

Pp+1 BCD4221

...

p digits

SBCD8421

S_sBCD4221

S_wBCD4221

2p digits

2p

2p

BBCD8421

p digitsP

PG

en

CS

AT CP

A

DR

ec




10/30


Carry Save Adder Tree

P1

P2

P3

Pp+1

...

carry save adder tree sums up p+1 partial products


10/30



P1

P2

P3

Pp+1

...

C1

C2

Cp

sign extension

sign extension

sign extension

CSA tree with respect to decimal recoding


10/30



P1

P2

P3

Pp+1

...

C1

C2

Cp

improved sign extension

improved CSA tree with respect to decimal recoding


10/30


Improved Sign Extension


● adding several words composed of leading nines and following zeros always yields to a word composed of 0, 8, and 9. For example

● position of 0, 8, and 9 can be calculated very fast by means of FPGA's fast carry chain

999999990000 999900000000 990000000000= x989899990000

X kNegDC

={9 for ck

in=0∧signk=1

8 for ckin=1∧signk=1

0 else

ckout=ck1

in={ 1 for signk=1

ckin else

11/30



ABCD8421

P0 BCD4221

P1 BCD4221

Pp+1 BCD4221

...

p digits

SBCD8421

S_sBCD4221

S_wBCD4221

2p digits

2p

2p

BBCD8421

p digitsP

PG

en

CS

AT CP

A

DR

ec




12/30



ABCD8421

P0 BCD4221

P1 BCD4221

Pp+1 BCD4221

...

p digits

SBCD8421

S_sBCD4221

S_wBCD4221

2p digits

2p

2p

BBCD8421

p digitsP

PG

en

CS

AT CP

A

DR

ec




12/30


Decimal FloatingPoint Multiplier


13/30




● additional units for rounding, exponent computation and data format encoding/decoding

● based on M. Erle, B. Hickmann, M.Schulte 2

● early estimation of shift left amount

● fully IEEE 7542008 compliant

● support for gradual underflow and all rounding modes

● adapted to FPGA technology

2“Decimal FloatingPoint Multiplication“, IEEE Transaction on Computers, VOL. 58, NO. 7, July 2009

13/30



Densily Packed Decimal (DPD) Decoder

Leading Zeros Count /Shift Left Amount

Computation

Decimal FixedPoint Multipliplier

Left Shift Register

Carry Propagate Adder

Overflow / Underflow Correction

Rounding Unit

RoundUp Detection

Exception Unit DPD Encoder

Exponent Computation

X•Yexception signals

XY X = 0x03C80000534B9C1EY = 0x0250000277CB0D10

14/30





Computation


Left Shift Register



Rounding Unit

RoundUp Detection




XY X = 0x03C80000534B9C1EY = 0x0250000277CB0D10

X = +0000001234567890 EXP156Y = +0000009876543210 EXP250X•Y = +12193263111263526900 EXP406

15/30


X = +0000001234567890 EXP156Y = +0000009876543210 EXP250X•Y = +12193263111263526900 EXP406

Z = significand(X•Y)Z = 00000000000012193263111263526900Zs = 66888846846688648888664609006600Zc = 33111153153323544414446654520300



Computation


Left Shift Register



Rounding Unit

RoundUp Detection




XY


16/30



X = +0000001234567890 EXP156Y = +0000009876543210 EXP250X•Y = +12193263111263526900 EXP406


LZ(X)=6, LZ(Y)=6, SLA=min(6+6, p)=12Z = 1219326311126352.690000000000000Zs = 8864888866460900.660000000000000Zc = 2354441444665452.030000000000000



Computation


Left Shift Register



Rounding Unit

RoundUp Detection




XY

17/30




Computation


Left Shift Register



Rounding Unit

RoundUp Detection




XY X = +0000001234567890 EXP156Y = +0000009876543210 EXP250X•Y = +12193263111263526900 EXP406



Z' = 1219326311126352, G=6, R=9, sb='0'


18/30




Computation


Left Shift Register



Rounding Unit

RoundUp Detection




XY


X = +0000001234567890 EXP156Y = +0000009876543210 EXP250X•Y = +12193263111263526900 EXP406



Z' = 1219326311126352, G=6, R=9, sb='0'exponent = 406 + p – SLA = 402

19/30




Computation


Left Shift Register



Rounding Unit

RoundUp Detection








Z'' = 0000121932631112, G=6, R=3, sb='1'exponent = 398


20/30




Computation


Left Shift Register



Rounding Unit

RoundUp Detection








Z'' = 0000121932631112, G=6, R=3, sb='1'exponent = 398

round up → Z''' = 0000121932631113 EXP398


21/30





Computation


Left Shift Register



Rounding Unit

RoundUp Detection








Z'' = 0000121932631112, G=6, R=3, sb='1'exponent = 398

round up → Z''' = 0000121932631113 EXP398Z = 0x000000285BCCC493invalid inexact overflow underflow

22/30



type1 type2 type3fixedpointmultiplier output

redundant(delayed CPA)


nonredundant

CPA length (digits) p+2 = 18 p+2 = 18 2·p = 32

shift register multiplier based multiplexer based multiplexer based

decimal fixedpoint multiplier

shift register

CPA (p+2) CPA (p2)

Ps Pc

shift registerQsu Qsl Qcu Qcl

OR

product RG sticky bit

...


CPA (2·p)

Ps Pc

shift register

OR

sticky bit

...

product RG23/30




shift register

CPA (p+2) CPA (p2)

Ps Pc


OR


...


CPA (2·p)

Ps Pc

shift register

OR

sticky bit

...

product RG




nonredundant



23/30




shift register

CPA (p+2) CPA (p2)

Ps Pc


OR


...


CPA (2·p)

Ps Pc

shift register

OR

sticky bit

...

product RG




nonredundant



23/30



shifting through multiplication:

●

● requires two DSP48Es per 32bit shift

● saves LUTs

X≪n ≡ X⋅2n

MUL MUL

X(31:16) X(15:0)shift 2k

ADD

Y(15:0)Y(31:16)

DS

P48

E

DS

P48

E




nonredundant



24/30


Post Place & Route Results


25/30


Decimal FixedPoint Multiplier with CPA output


● Xilinx Virtex5, speed grade 2● up to 13 pipeline registers, configurable via VHDL generics

● 5350 – 6500 LUTs, 0 – 4900 FFs● 5500 – 7600 combined LUTs and FFs

25/30


Decimal FixedPoint Multiplier with CPA output


● 5350 – 6500 LUTs, 0 – 4900 FFs● 5350 – 7600 combined LUTs and FFs

● Xilinx Virtex5, speed grade 2● up to 13 pipeline registers, configurable via VHDL generics

25/30




26/30




27/30




Type1mulbased shifting,

delayed CPA

Type2muxbased shifting,

delayed CPA

Type3muxbased shifting,

no delayed CPA

#LUTs 6300 8400 7900 9400 7500 9400

#FFs 0 4100 0 4500 0 4400

#(LUT + FFs) 6500 8400 8300 9300 7600 9600

#DSP48E 17 0 0

● approx. 70% of the LUTs are used by the fixedpoint multiplier (for Type2 and Type3)

● medium Virtex5 XC5VLX110T: 80009000 LUTs ~ 11.5%13%

28/30


0 3 6 90

5000

10000

decimal binary

number of pipeline registers

num

ber o

f LU

Ts

0 3 6 90

100

200

300

400

decimal binary


max

. fre

quen

cy (M

Hz)

Comparison to binary floatingpoint multiplier● 64 bit binary floatingpoint multiplier generated with CoreGen● no DSP48E● Type2 decimal vs. CoreGen binary multiplier


decimal mult. : 3.2 – 3.5 more LUTs binary mult. : 1.6 – 2.2 times faster

29/30


0 3 6 90

5000

10000

decimal binary


num

ber o

f LU

Ts

0 3 6 90

100

200

300

400

decimal binary


max

. fre

quen

cy (M

Hz)

Comparison to binary floatingpoint multiplier● 64 bit binary floatingpoint multiplier generated with CoreGen● no DSP48E● Type2 decimal vs. CoreGen binary multiplier


decimal mult. : 3.2 – 3.5 more LUTs binary mult. : 1.6 – 2.2 times faster

29/30


Summary

● decimal fixedpoint multiplier– parallel, fully combinational– configurable number of pipeline stages

● decimal floatingpoint multiplier– configurable number of pipeline stages– three different implementations– tradeoff: area vs. speed

● future work: fully IEEE 7542008 compliant coprocessor

30/30

Thank you for your attention!!!

Date post:	27-Apr-2020
Category:	Documents
Upload:	others
View:	21 times
Download:	0 times

An IEEE 7542008 Decimal Parallel and Pipelined FPGA ... Decimal FixedPoint Multiplier Decimal...

Documents