An Efficient Softcore Multiplier Architecture for Xilinx FPGAs
22nd IEEE Symposium on Computer Arithmetic
Martin Kumm, Shahid Abbas and Peter ZipfUniversity of Kassel, Germany
2
CONTENTS
1. State-of-the-art
2. Proposed multiplier
3. Results
WHY FPGA SOFTCORE MULTIPLIERS?
The need for efficient multipliers forced FPGA vendors to embed hard multiplier blocks
FPGA softcore multipliers are still required:
Small word sizes (worse mapping for embedded mults)
Large word sizes ("fill gaps")
Replace embedded mults on small/low-cost FPGAs
3
Research for efficient multipliers is an ongoing process nearly since >50 years
Efficient multipliers in terms of gates may not be efficient on FPGAs
FPGA optimized structures are relatively rare
WHY THEY ARE DIFFERENT?
4
WHY THEY ARE DIFFERENT?
5
Xilinx slice 6/7 series
PREVIOUS WORK
01
01
01
CarryLogic
01
LUTLUTLUTLUT
A Baugh-Wooley like multiplier was proposed in [Parandeh-Afshar 2011]
Two partial products are generated and added using carry chain
Compression tree of already reduced PP's necessary
6
PREVIOUS WORK
01
01
01
CarryLogic
01
LUTLUTLUTLUT
A Baugh-Wooley like multiplier was proposed in [Parandeh-Afshar 2011]
Two partial products are generated and added using carry chain
Compression tree of already reduced PP's necessary
full adder
6
PREVIOUS WORKAnother idea was discussed in [Brunie 2013]:
Decompose multiplication into small multipliers that fit into single LUTs, e. g., 3x3, 2x3, 1x4
Use a compression tree to add partial results
p =M1 + 23M2 + 26M3 + . . .
. . .+ 23M4 + 26M5 + 29M6 + . . .
. . .+ 26M7 + 29M8 + 212M9
7
BOOTH RECODING
a · b =MX
m=0m even
a · BEm2m
bm+1 bm bm�1 BEm zm cm sm
0 0 0 0 1 0 00 0 1 1 0 0 00 1 0 1 0 0 00 1 1 2 0 0 11 0 0 -2 0 1 11 0 1 -1 0 1 01 1 0 -1 0 1 01 1 1 0 1 0 0
8
9
BOOTH MULTIPLIER
c6c6 c4
c4c4c4 c2c2c2c2c2c2 c0
c0c0c0c0c0c0c0
00
0LSB
MSB
b
+=
10
BOOTH MULTIPLIER
c6c6 c4
c41 c2c21 c0
c011
00
0LSB
MSB
b
+=
PROPOSED ARCHITECTURE
01
01
CarryLogic
01
0
0 1
0 1
0 1
LUTLUTLUT
0
0 1
0 1
0 1
LUT
01
11
PROPOSED ARCHITECTURE
01
01
CarryLogic
01
0
0 1
0 1
0 1
LUTLUTLUT
0
0 1
0 1
0 1
LUT
01
11
full adder
PROPOSED ARCHITECTURE
12
RESULTSThe number of slices can be precisely predicted:
Design was implemented as generic VHDL
A pipelined multiplier can be obtained by using the (otherwise unused) slice FFs without much additional cost
Reference circuits (Parandeh-Afshar & LUT-based) were designed with the FloPoCo library [de Dinechin 2012]
Xilinx Coregen was used as a commercial reference
#slices(M,N) = dN/4 + 1e| {z }slices per row
· bM/2 + 1c| {z }no of rows
13
RESULTS VIRTEX 6 COMBINATORIAL, SLICES
8 12 16 20 24 28 32 36 40 44 48 52 56 60 64
0
200
400
600
800
1,000
1,200
1,400
1,600
1,800
2,000
Input word size (N)
#Slices
1x4 LUT Multiplier
3x2 LUT Multiplier
3x3 LUT Multiplier
Parandeh-Afshar Multiplier
Coregen (area)
Coregen (speed)
proposed
14
8 12 16 20 24 28 32 36 40 44 48 52 56 60 64
0
20
40
60
80
Input word size (N)
Slicereduction(%)
1x4 LUT Multiplier
3x2 LUT Multiplier
3x3 LUT Multiplier
Parandeh-Afshar Multiplier
Coregen (area)
Coregen (speed)
RESULTS VIRTEX 6COMBINATORIAL, SLICE RED.
15
RESULTS VIRTEX 6 COMBINATORIAL, FREQ.
8 12 16 20 24 28 32 36 40 44 48 52 56 60 640
100
200
300
400
500
600
700
Input word size (N)
Frequ
ency
[MHz]
1x4 LUT Multiplier3x2 LUT Multiplier3x3 LUT MultiplierParandeh-Afshar MultiplierCoregen (area)Coregen (speed)proposed
16
RESULTS VIRTEX 6PIPELINED, SLICES
8 12 16 20 24 28 32 36 40 44 48 52 56 60 64
0
200
400
600
800
1,000
1,200
1,400
1,600
1,800
2,000
Input word size (N)
#Slices
1x4 LUT Multiplier
3x2 LUT Multiplier
3x3 LUT Multiplier
Parandeh-Afshar Multiplier
Coregen (area)
Coregen (speed)
proposed
17
RESULTS VIRTEX 6PIPELINED, SLICE RED.
18
8 12 16 20 24 28 32 36 40 44 48 52 56 60 64
�10
0
10
20
30
40
50
60
70
80
Input word size (N)
Slicereduction(%)
1x4 LUT Multiplier
3x2 LUT Multiplier
3x3 LUT Multiplier
Parandeh-Afshar Multiplier
Coregen (area)
Coregen (speed)
RESULTS VIRTEX 6PIPELINED, FREQ.
8 12 16 20 24 28 32 36 40 44 48 52 56 60 640
100
200
300
400
500
600
700
Input word size (N)
Frequ
ency
[MHz]
1x4 LUT Multiplier3x2 LUT Multiplier3x3 LUT MultiplierParandeh-Afshar MultiplierCoregen (area)Coregen (speed)proposed
19
UNFORTUNATELY NOT POSSIBLE ON ALTERA FPGAS
20 Altera ALM
MAYBE POSSIBLE NEXT?
21
CONCLUSION
Compared to the best known design, up to
50% slices can be saved for the combinatorial multiplier
30% slices can be saved for the pipelined multiplier
Portable to FPGAs providing a 5-input LUT at one full adder input
"Free addition" supports multiply-accumulate (MAC) operation
22
LITERATURE
[Parandeh-Afshar 2011]: Parandeh-Afshar & Ienne Measuring and Reducing the Performance Gap between Embedded and Soft Multipliers on FPGAs, FPL 2011
[Brunie 2013]: Brunie, de Dinechin, Istoan, Sergent, Illyes & Popa Arithmetic Core Generation Using Bit Heaps, FPL 2013
[de Dinechin 2012]: de Dinechin & Pasca Designing Custom Arithmetic Data Paths with FloPoCo IEEE Design & Test of Computers 2012
THANK YOU!
23
BOOTH RECODING
b =bM�12M�1 + . . .+ b22
2 + b121 + b0
=bM�12M�1 + . . .+ b22
2 + 2b121 + �b121 + b0| {z }
BE0=�2b1+b0
=bM�12M�1 + . . .
. . .+ 2b323 �b323 + b222 + 2b121| {z }
BE2=(�2b3+b2+b1)22
+BE0
=MX
m=0m even
BEm2m with BEm = �2bm+1 + bm + bm�1
25
WHY THEY ARE DIFFERENT?
26 Altera ALM
WHY THEY ARE DIFFERENT?
D
FF/LATINIT1INIT0SRHISRLO
SR
CECK
D6:1
CEQ
CK SR
Q
SRHISRLOINIT1INIT0
27