Date post: | 11-Jan-2016 |
Category: |
Documents |
Upload: | poppy-caldwell |
View: | 212 times |
Download: | 0 times |
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
1
FPGA based Acceleration of Linear Algebra Computations.
B.Y. Vinay KumarSiddharth JoshiSumedh Attarde
Prof. Sachin PatkarProf. H. Narayanan
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
2
Outline
Double Precision Dense Matrix-Matrix Multiplication. Motivation Related Work Algorithm Design Results Conclusions
Double Precision Sparse Matrix-Vector Multiplication. Introduction Prasanna DeLorimier David Gregg et. al. What can we do ?
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
3
FPGA based Double Precision Dense Matrix-Matrix Multiplication.
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
4
Motivation
FPGAs have been making inroads for HiPC. Accelerating BLAS-3 achieved by accelerating matrix
multiplications. Modern FPGAs provide an abundance of resources – We
must capitalise upon these.
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
5
Related Work{1/2}
The two main works ~ Dou and Prasanna. Both based on linear arrays, both use memory switching – both sustain their peak.
Dou : Optimised for a large VirtexII pro device (Xillinx).Created his own MAC (Not fully compliant).Sub-block dimensions must be powers of 2.Optimised for Low IO bandwidth.
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
6
Related Work{2/2}
Prasanna:
Scaling results in speed degradation of about 35% (2 PEs to 20 PEs).
2.1 GFLOPs on a CRAY XD1 with VirtexII Pros (XC2VP50).
For design only (XC2VP125) they report 15% clock degradation on 2 to 24 PEs.
» They state they have not made any platform specific optimisations, for the implemented design.
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
7
Algorithm
1. Broadcast ‘A’, keep a unique ‘B’ per PE2. Multiply, and put in pipeline of multiplier.3. Output is fed to directly to Adder+Ram
(accumulator)4. When the updated C is ready, take them out.
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
8
Design-1
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
9
Design-II
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
10
FPGA Synthesis/PAR data{1/2}
PE DSP48Es FIFO B RAM Slice Reg Slice LUT
1 16 1 2 2511 1374
4 64 4 8 10377 5451
8 128 8 16 20865 10886
16 256 16 32 41841 21750
20(SX240) 320 20 40 52329 27176
40 (SX240)
640 40 80 103335 53914
Table: Clock Speed in MHz for the overall design for different number of PE.
Device/PE 1 4 8 16 19 20 40
SX95T-3 377 374 373 373 372 201 -
SX240T-2 374 373 344 - - 372 371.7
Table: Resource Utilisation for SX95T and SX240T (post PAR)
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
11
FPGA Synthesis/PAR data{2/2}
Table: Resource Utilisation for Virtex II ProXC2VP100 (post PAR)
15 PE 20 PE
MULT18x18 240(54%) 304(68%)
RAMB16s 90 (20%) 114(26%)
Slices 30218 (68%) 37023(83%)
Speed 133.94 MHz 133.79 MHz
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
12
Conclusions
We propose a variation of the rank one update algorithm for matrix multiplication.
We introduce a scalable processing element for this algorithm, targeted a Virtex-5 SX240T FPGA
The two designs clearly show the difference of local storage on IO bandwidth.
The designs achieved a design speed of 373 MHz, 40 PEs and a sustained performance of 29.8 GFLOPS for a single FPGA. We also provide 5.3 GFLOPS on a XC2VP100.
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
13
FPGA based Double Precision Sparse Matrix-Vector Multiplication.
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
14
Introduction
There are three main papers we will be looking at Viktor Prasanna: Hybrid method use HLL+S/W+HDL Michael DeLorimier: Maximum performance but unrealisticDavid Gregg et. al.: Most realistic assumptions wrt DRAM
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
15
Prasanna
Use of prexisting IP cores – specifically for iterative solver (CG)
4 input reduction ckt does dot product results in partial sums as op.
Adder loop with Array does summation of dotproduct – created using
HLL
Reduction ckt at the end uses B-Tree to create the final value
IP s are available
DRAM looked at – but not realistically
Order of Matrices is small
DRAM is bottleneck
With their IP's they have a good architecture -however change the IP
and modify datapath – eg. Dou MAC
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
16
DeLorimier
Use BRAMs for everything.
Use for iterative Solver – specifically CG
MAC requires interleaving
They do load balancing in their partitioner which requires – a
communication stage, very matrix/partitioner dependent.
Communication is the bottleneck
Performance:750 MFLOPS / processor
16 Virtex II 6000s
Each has 5 PE + 1 CE
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
17
David Gregg et. al. (SPAR)
They only report the use of the SPAR architecture for FPGAs
They use very pessimistic DRAM access times. Emphasis on
cache-miss removal
Not using their Block RAMs well – maybe something
interesting can be done here
128 MFLOPS for 3 parallel SPAR units but remove cache miss
and we get a peak of 570 MFLOPS
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
18
What can we do ?
Both use CSR – Not required why not modify representation
Two approaches : We can try both simultaneously
Prasanna – split across dot products (same row many PE)
Delorimier – split accross rows (many rows – one PE)
Use data from SPAR – viable approach – both do zero
multiplies – we get away with one zero multiply/coloumn
Minimise communication or overlap it. - we can do interleaving
for this – while one stage computes the previous one
communicates.
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
19
Questions ?
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
20
THANK YOU
Thank You