+ All Categories
Home > Documents > An Efficient Softcore Multiplier Architecture for Xilinx...

An Efficient Softcore Multiplier Architecture for Xilinx...

Date post: 19-Oct-2020
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
29
An Efficient Softcore Multiplier Architecture for Xilinx FPGAs 22 nd IEEE Symposium on Computer Arithmetic Martin Kumm, Shahid Abbas and Peter Zipf University of Kassel, Germany
Transcript
  • An Efficient Softcore Multiplier Architecture for Xilinx FPGAs

    22nd IEEE Symposium on Computer Arithmetic

    Martin Kumm, Shahid Abbas and Peter ZipfUniversity of Kassel, Germany

  • 2

    CONTENTS

    1. State-of-the-art

    2. Proposed multiplier

    3. Results

  • WHY FPGA 
SOFTCORE MULTIPLIERS?

    The need for efficient multipliers forced FPGA vendors to embed hard multiplier blocks

    FPGA softcore multipliers are still required:

    Small word sizes (worse mapping for embedded mults)

    Large word sizes ("fill gaps")

    Replace embedded mults on small/low-cost FPGAs

    3

  • Research for efficient multipliers is an ongoing process nearly since >50 years

    Efficient multipliers in terms of gates may not be efficient on FPGAs

    FPGA optimized structures are relatively rare

    WHY THEY ARE DIFFERENT?

    4

  • WHY THEY ARE DIFFERENT?

    5

    Xilinx slice 6/7 series

  • PREVIOUS WORK

    01

    01

    01

    CarryLogic

    01

    LUTLUTLUTLUT

    A Baugh-Wooley like multiplier was proposed in 
[Parandeh-Afshar 2011]

    Two partial products are generated and added using carry chain

    Compression tree of already reduced PP's necessary

    6

  • PREVIOUS WORK

    01

    01

    01

    CarryLogic

    01

    LUTLUTLUTLUT

    A Baugh-Wooley like multiplier was proposed in 
[Parandeh-Afshar 2011]

    Two partial products are generated and added using carry chain

    Compression tree of already reduced PP's necessary

    full adder

    6

  • PREVIOUS WORKAnother idea was discussed in [Brunie 2013]:

    Decompose multiplication into small multipliers that fit into single LUTs, e. g., 3x3, 2x3, 1x4

    Use a compression tree to add partial results

    p =M1 + 23M2 + 26M3 + . . .

    . . .+ 23M4 + 26M5 + 29M6 + . . .

    . . .+ 26M7 + 29M8 + 212M9

    7

  • BOOTH RECODING

    a · b =MX

    m=0m even

    a · BEm2m

    bm+1 bm bm�1 BEm zm cm sm

    0 0 0 0 1 0 00 0 1 1 0 0 00 1 0 1 0 0 00 1 1 2 0 0 11 0 0 -2 0 1 11 0 1 -1 0 1 01 1 0 -1 0 1 01 1 1 0 1 0 0

    8

  • 9

    BOOTH MULTIPLIER

    c6c6 c4

    c4c4c4 c2c2c2c2c2c2 c0

    c0c0c0c0c0c0c0

    00

    0LSB

    MSB

    b

    +=

  • 10

    BOOTH MULTIPLIER

    c6c6 c4

    c41 c2c21 c0

    c011

    00

    0LSB

    MSB

    b

    +=

  • PROPOSED ARCHITECTURE

    01

    01

    CarryLogic

    01

    0

    0 1

    0 1

    0 1

    LUTLUTLUT

    0

    0 1

    0 1

    0 1

    LUT

    01

    11

  • PROPOSED ARCHITECTURE

    01

    01

    CarryLogic

    01

    0

    0 1

    0 1

    0 1

    LUTLUTLUT

    0

    0 1

    0 1

    0 1

    LUT

    01

    11

    full adder

  • PROPOSED ARCHITECTURE

    12

  • RESULTSThe number of slices can be precisely predicted: 

 


    Design was implemented as generic VHDL

    A pipelined multiplier can be obtained by using the 
(otherwise unused) slice FFs without much additional cost

    Reference circuits (Parandeh-Afshar & LUT-based) were designed with the FloPoCo library [de Dinechin 2012]

    Xilinx Coregen was used as a commercial reference

    #slices(M,N) = dN/4 + 1e| {z }slices per row

    · bM/2 + 1c| {z }no of rows

    13

  • RESULTS VIRTEX 6 COMBINATORIAL, SLICES

    8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

    0

    200

    400

    600

    800

    1,000

    1,200

    1,400

    1,600

    1,800

    2,000

    Input word size (N)

    #Slices

    1x4 LUT Multiplier

    3x2 LUT Multiplier

    3x3 LUT Multiplier

    Parandeh-Afshar Multiplier

    Coregen (area)

    Coregen (speed)

    proposed

    14

  • 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

    0

    20

    40

    60

    80

    Input word size (N)

    Slicereduction(%)

    1x4 LUT Multiplier

    3x2 LUT Multiplier

    3x3 LUT Multiplier

    Parandeh-Afshar Multiplier

    Coregen (area)

    Coregen (speed)

    RESULTS VIRTEX 6COMBINATORIAL, SLICE RED.

    15

  • RESULTS VIRTEX 6 COMBINATORIAL, FREQ.

    8 12 16 20 24 28 32 36 40 44 48 52 56 60 640

    100

    200

    300

    400

    500

    600

    700

    Input word size (N)

    Frequ

    ency

    [MHz]

    1x4 LUT Multiplier3x2 LUT Multiplier3x3 LUT MultiplierParandeh-Afshar MultiplierCoregen (area)Coregen (speed)proposed

    16

  • RESULTS VIRTEX 6PIPELINED, SLICES

    8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

    0

    200

    400

    600

    800

    1,000

    1,200

    1,400

    1,600

    1,800

    2,000

    Input word size (N)

    #Slices

    1x4 LUT Multiplier

    3x2 LUT Multiplier

    3x3 LUT Multiplier

    Parandeh-Afshar Multiplier

    Coregen (area)

    Coregen (speed)

    proposed

    17

  • RESULTS VIRTEX 6PIPELINED, SLICE RED.

    18

    8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

    �10

    0

    10

    20

    30

    40

    50

    60

    70

    80

    Input word size (N)

    Slicereduction(%)

    1x4 LUT Multiplier

    3x2 LUT Multiplier

    3x3 LUT Multiplier

    Parandeh-Afshar Multiplier

    Coregen (area)

    Coregen (speed)

  • RESULTS VIRTEX 6PIPELINED, FREQ.

    8 12 16 20 24 28 32 36 40 44 48 52 56 60 640

    100

    200

    300

    400

    500

    600

    700

    Input word size (N)

    Frequ

    ency

    [MHz]

    1x4 LUT Multiplier3x2 LUT Multiplier3x3 LUT MultiplierParandeh-Afshar MultiplierCoregen (area)Coregen (speed)proposed

    19

  • UNFORTUNATELY NOT POSSIBLE ON ALTERA FPGAS

    20 Altera ALM

  • MAYBE POSSIBLE NEXT?

    21

  • CONCLUSION

    Compared to the best known design, up to

    50% slices can be saved for the combinatorial multiplier

    30% slices can be saved for the pipelined multiplier

    Portable to FPGAs providing a 5-input LUT at one full adder input

    "Free addition" supports multiply-accumulate (MAC) operation

    22

  • LITERATURE

    [Parandeh-Afshar 2011]: Parandeh-Afshar & Ienne Measuring and Reducing the Performance Gap between Embedded and Soft Multipliers on FPGAs, FPL 2011

    [Brunie 2013]: Brunie, de Dinechin, Istoan, Sergent, Illyes & Popa Arithmetic Core Generation Using Bit Heaps, FPL 2013

    [de Dinechin 2012]: de Dinechin & Pasca Designing Custom Arithmetic Data Paths with FloPoCo IEEE Design & Test of Computers 2012

    THANK YOU!

    23

  • BOOTH RECODING

    b =bM�12M�1 + . . .+ b22

    2 + b121 + b0

    =bM�12M�1 + . . .+ b22

    2 + 2b121 + �b121 + b0| {z }

    BE0=�2b1+b0

    =bM�12M�1 + . . .

    . . .+ 2b323 �b323 + b222 + 2b121| {z }

    BE2=(�2b3+b2+b1)22

    +BE0

    =MX

    m=0m even

    BEm2m with BEm = �2bm+1 + bm + bm�1

    25

  • WHY THEY ARE DIFFERENT?

    26 Altera ALM

  • WHY THEY ARE DIFFERENT?

    D

    FF/LATINIT1INIT0SRHISRLO

    SR

    CECK

    D6:1

    CEQ

    CK SR

    Q

    SRHISRLOINIT1INIT0

    27


Recommended