An Efﬁcient Softcore Multiplier Architecture for Xilinx...

An Efficient Softcore Multiplier Architecture for Xilinx FPGAs

22nd IEEE Symposium on Computer Arithmetic

Martin Kumm, Shahid Abbas and Peter ZipfUniversity of Kassel, Germany

2

CONTENTS

1. State-of-the-art

2. Proposed multiplier

3. Results

WHY FPGA  SOFTCORE MULTIPLIERS?

The need for efficient multipliers forced FPGA vendors to embed hard multiplier blocks

FPGA softcore multipliers are still required:

Small word sizes (worse mapping for embedded mults)

Large word sizes ("fill gaps")

Replace embedded mults on small/low-cost FPGAs

3

Research for efficient multipliers is an ongoing process nearly since >50 years

Efficient multipliers in terms of gates may not be efficient on FPGAs

FPGA optimized structures are relatively rare

WHY THEY ARE DIFFERENT?

4


5

Xilinx slice 6/7 series

PREVIOUS WORK

01

01

01

CarryLogic

01

LUTLUTLUTLUT

A Baugh-Wooley like multiplier was proposed in  [Parandeh-Afshar 2011]

Two partial products are generated and added using carry chain

Compression tree of already reduced PP's necessary

6

PREVIOUS WORK

01

01

01

CarryLogic

01

LUTLUTLUTLUT

A Baugh-Wooley like multiplier was proposed in  [Parandeh-Afshar 2011]

Two partial products are generated and added using carry chain

Compression tree of already reduced PP's necessary

full adder

6

PREVIOUS WORKAnother idea was discussed in [Brunie 2013]:

Decompose multiplication into small multipliers that fit into single LUTs, e. g., 3x3, 2x3, 1x4

Use a compression tree to add partial results

p =M1 + 23M2 + 26M3 + . . .

. . .+ 23M4 + 26M5 + 29M6 + . . .

. . .+ 26M7 + 29M8 + 212M9

7

BOOTH RECODING

a · b =MX

m=0m even

a · BEm2m

bm+1 bm bm�1 BEm zm cm sm

0 0 0 0 1 0 00 0 1 1 0 0 00 1 0 1 0 0 00 1 1 2 0 0 11 0 0 -2 0 1 11 0 1 -1 0 1 01 1 0 -1 0 1 01 1 1 0 1 0 0

8

9

BOOTH MULTIPLIER

c6c6 c4

c4c4c4 c2c2c2c2c2c2 c0

c0c0c0c0c0c0c0

00

0LSB

MSB

b

+=

10

BOOTH MULTIPLIER

c6c6 c4

c41 c2c21 c0

c011

00

0LSB

MSB

b

+=

PROPOSED ARCHITECTURE

01

01

CarryLogic

01

0

0 1

0 1

0 1

LUTLUTLUT

0

0 1

0 1

0 1

LUT

01

11


01

01

CarryLogic

01

0

0 1

0 1

0 1

LUTLUTLUT

0

0 1

0 1

0 1

LUT

01

11

full adder


12

RESULTSThe number of slices can be precisely predicted:     

Design was implemented as generic VHDL

A pipelined multiplier can be obtained by using the  (otherwise unused) slice FFs without much additional cost

Reference circuits (Parandeh-Afshar & LUT-based) were designed with the FloPoCo library [de Dinechin 2012]

Xilinx Coregen was used as a commercial reference

#slices(M,N) = dN/4 + 1e| {z }slices per row

· bM/2 + 1c| {z }no of rows

13

RESULTS VIRTEX 6 COMBINATORIAL, SLICES

8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

0

200

400

600

800

1,000

1,200

1,400

1,600

1,800

2,000

Input word size (N)

#Slices

1x4 LUT Multiplier

3x2 LUT Multiplier

3x3 LUT Multiplier

Parandeh-Afshar Multiplier

Coregen (area)

Coregen (speed)

proposed

14

8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

0

20

40

60

80

Input word size (N)

Slicereduction(%)

1x4 LUT Multiplier

3x2 LUT Multiplier

3x3 LUT Multiplier


Coregen (area)

Coregen (speed)

RESULTS VIRTEX 6COMBINATORIAL, SLICE RED.

15

RESULTS VIRTEX 6 COMBINATORIAL, FREQ.

8 12 16 20 24 28 32 36 40 44 48 52 56 60 640

100

200

300

400

500

600

700

Input word size (N)

Frequ

ency

[MHz]

1x4 LUT Multiplier3x2 LUT Multiplier3x3 LUT MultiplierParandeh-Afshar MultiplierCoregen (area)Coregen (speed)proposed

16

RESULTS VIRTEX 6PIPELINED, SLICES

8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

0

200

400

600

800

1,000

1,200

1,400

1,600

1,800

2,000

Input word size (N)

#Slices

1x4 LUT Multiplier

3x2 LUT Multiplier

3x3 LUT Multiplier


Coregen (area)

Coregen (speed)

proposed

17

RESULTS VIRTEX 6PIPELINED, SLICE RED.

18

8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

�10

0

10

20

30

40

50

60

70

80

Input word size (N)

Slicereduction(%)

1x4 LUT Multiplier

3x2 LUT Multiplier

3x3 LUT Multiplier


Coregen (area)

Coregen (speed)

RESULTS VIRTEX 6PIPELINED, FREQ.

8 12 16 20 24 28 32 36 40 44 48 52 56 60 640

100

200

300

400

500

600

700

Input word size (N)

Frequ

ency

[MHz]

1x4 LUT Multiplier3x2 LUT Multiplier3x3 LUT MultiplierParandeh-Afshar MultiplierCoregen (area)Coregen (speed)proposed

19

UNFORTUNATELY NOT POSSIBLE ON ALTERA FPGAS

20 Altera ALM

MAYBE POSSIBLE NEXT?

21

CONCLUSION

Compared to the best known design, up to

50% slices can be saved for the combinatorial multiplier

30% slices can be saved for the pipelined multiplier

Portable to FPGAs providing a 5-input LUT at one full adder input

"Free addition" supports multiply-accumulate (MAC) operation

22

LITERATURE

[Parandeh-Afshar 2011]: Parandeh-Afshar & Ienne Measuring and Reducing the Performance Gap between Embedded and Soft Multipliers on FPGAs, FPL 2011

[Brunie 2013]: Brunie, de Dinechin, Istoan, Sergent, Illyes & Popa Arithmetic Core Generation Using Bit Heaps, FPL 2013

[de Dinechin 2012]: de Dinechin & Pasca Designing Custom Arithmetic Data Paths with FloPoCo IEEE Design & Test of Computers 2012

THANK YOU!

23

BOOTH RECODING

b =bM�12M�1 + . . .+ b22

2 + b121 + b0

=bM�12M�1 + . . .+ b22

2 + 2b121 + �b121 + b0| {z }

BE0=�2b1+b0

=bM�12M�1 + . . .

. . .+ 2b323 �b323 + b222 + 2b121| {z }

BE2=(�2b3+b2+b1)22

+BE0

=MX

m=0m even

BEm2m with BEm = �2bm+1 + bm + bm�1

25


26 Altera ALM


D

FF/LATINIT1INIT0SRHISRLO

SR

CECK

D6:1

CEQ

CK SR

Q

SRHISRLOINIT1INIT0

27

Date post:	19-Oct-2020
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

An Efﬁcient Softcore Multiplier Architecture for Xilinx...

Documents