BIPSBIPS
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Tuning Sparse Matrix Vector Multiplication for multi-core SMPs
(details in paper at SC07)
Samuel Williams1,2, Richard Vuduc3, Leonid Oliker1,2,John Shalf2, Katherine Yelick1,2, James Demmel1,2
1University of California Berkeley2Lawrence Berkeley National Laboratory3Georgia Institute of Technology
BIPSBIPS Overview
Multicore is the de facto performance solution for the next decade
Examined Sparse Matrix Vector Multiplication (SpMV) kernelImportant HPC kernelMemory intensiveChallenging for multicore
Present two auto-tuned threaded implementations: Pthread, cache-based implementationCell local store-based implementation
Benchmarked performance across 4 diverse multicore architecturesIntel Xeon (Clovertown)AMD OpteronSun Niagara2IBM Cell Broadband Engine
Compare with leading MPI implementation(PETSc) with an auto-tuned serial kernel (OSKI)
Show Cell delivers good performance and efficiency, while Niagara2 delivers good performance and productivity
BIPSBIPS Sparse Matrix Vector Multiplication
Sparse MatrixMost entries are 0.0Performance advantage in onlystoring/operating on the nonzerosRequires significant meta data
Evaluate y=AxA is a sparse matrixx & y are dense vectors
ChallengesDifficult to exploit ILP(bad for superscalar),Difficult to exploit DLP(bad for SIMD)Irregular memory access to source vectorDifficult to load balanceVery low computational intensity (often >6 bytes/flop)
A x y
BIPSBIPS
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Dataset (Matrices)Multicore SMPs
Test Suite
BIPSBIPS Matrices Used
Pruned original SPARSITY suite down to 14none should fit in cacheSubdivided them into 4 categoriesRank ranges from 2K to 1M
Dense
Protein FEM /Spheres
FEM /Cantilever
WindTunnel
FEM /Harbor QCD FEM /
Ship Economics Epidemiology
FEM /Accelerator Circuit webbase
LP
2K x 2K Dense matrixstored in sparse format
Well Structured(sorted by nonzeros/row)
Poorly Structuredhodgepodge
Extreme Aspect Ratio(linear programming)
BIPSBIPS Multicore SMP Systems
4MBShared L2
Core2
FSB
Fully Buffered DRAM
10.6GB/s
Core2
Chipset (4x64b controllers)
10.6GB/s
10.6 GB/s(write)
4MBShared L2
Core2 Core2
4MBShared L2
Core2
FSB
Core2
4MBShared L2
Core2 Core2
21.3 GB/s(read)
Intel ClovertownC
ross
bar S
witc
h
Fully Buffered DRAM
4MB
Sha
red
L2 (1
6 w
ay)
42.7GB/s (read), 21.3 GB/s (write)
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
179
GB
/s(fi
ll)90
GB
/s(w
ritet
hru)
Sun Niagara2
4x128b FBDIMM memory controllers
AMD Opteron
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
DDR2 DRAM DDR2 DRAM
10.6GB/s 10.6GB/s
4GB/s(each direction)
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
<<20GB/seach
direction
SPE256K
PPE512K L2
MFC
BIF
XDR
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
SPE 256K
PPE 512K L2
MFC
BIF
XDR
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
IBM Cell Blade
BIPSBIPSMulticore SMP Systems
(memory hierarchy)
4MBShared L2
Core2
FSB
Fully Buffered DRAM
10.6GB/s
Core2
Chipset (4x64b controllers)
10.6GB/s
10.6 GB/s(write)
4MBShared L2
Core2 Core2
4MBShared L2
Core2
FSB
Core2
4MBShared L2
Core2 Core2
21.3 GB/s(read)
Intel ClovertownC
ross
bar S
witc
h
Fully Buffered DRAM
4MB
Sha
red
L2 (1
6 w
ay)
42.7GB/s (read), 21.3 GB/s (write)
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
179
GB
/s(fi
ll)90
GB
/s(w
ritet
hru)
Sun Niagara2
4x128b FBDIMM memory controllers
AMD Opteron
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
DDR2 DRAM DDR2 DRAM
10.6GB/s 10.6GB/s
4GB/s(each direction)
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
<<20GB/seach
direction
SPE256K
PPE512K L2
MFC
BIF
XDR
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
SPE 256K
PPE 512K L2
MFC
BIF
XDR
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
IBM Cell Blade
Conventional Cache-based
Memory Hierarchy
Disjoint Local Store
Memory Hierarchy
BIPSBIPSMulticore SMP Systems
(2 implementations)
4MBShared L2
Core2
FSB
Fully Buffered DRAM
10.6GB/s
Core2
Chipset (4x64b controllers)
10.6GB/s
10.6 GB/s(write)
4MBShared L2
Core2 Core2
4MBShared L2
Core2
FSB
Core2
4MBShared L2
Core2 Core2
21.3 GB/s(read)
Intel ClovertownC
ross
bar S
witc
h
Fully Buffered DRAM
4MB
Sha
red
L2 (1
6 w
ay)
42.7GB/s (read), 21.3 GB/s (write)
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
179
GB
/s(fi
ll)90
GB
/s(w
ritet
hru)
Sun Niagara2
4x128b FBDIMM memory controllers
AMD Opteron
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
DDR2 DRAM DDR2 DRAM
10.6GB/s 10.6GB/s
4GB/s(each direction)
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
<<20GB/seach
direction
SPE256K
PPE512K L2
MFC
BIF
XDR
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
SPE 256K
PPE 512K L2
MFC
BIF
XDR
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
IBM Cell Blade
Cache+Pthreads SpMV
implementation
Local Store SpMV
Implementation
BIPSBIPSMulticore SMP Systems
(cache)
4MBShared L2
Core2
FSB
Fully Buffered DRAM
10.6GB/s
Core2
Chipset (4x64b controllers)
10.6GB/s
10.6 GB/s(write)
4MBShared L2
Core2 Core2
4MBShared L2
Core2
FSB
Core2
4MBShared L2
Core2 Core2
21.3 GB/s(read)
Intel ClovertownC
ross
bar S
witc
h
Fully Buffered DRAM
4MB
Sha
red
L2 (1
6 w
ay)
42.7GB/s (read), 21.3 GB/s (write)
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
179
GB
/s(fi
ll)90
GB
/s(w
ritet
hru)
Sun Niagara2
4x128b FBDIMM memory controllers
AMD Opteron
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
DDR2 DRAM DDR2 DRAM
10.6GB/s 10.6GB/s
4GB/s(each direction)
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
<<20GB/seach
direction
SPE256K
PPE512K L2
MFC
BIF
XDR
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
SPE 256K
PPE 512K L2
MFC
BIF
XDR
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
IBM Cell Blade
16MB(vectors fit)
4MB
4MB(local store)
4MB
BIPSBIPSMulticore SMP Systems
(peak flops)
4MBShared L2
Core2
FSB
Fully Buffered DRAM
10.6GB/s
Core2
Chipset (4x64b controllers)
10.6GB/s
10.6 GB/s(write)
4MBShared L2
Core2 Core2
4MBShared L2
Core2
FSB
Core2
4MBShared L2
Core2 Core2
21.3 GB/s(read)
Intel ClovertownC
ross
bar S
witc
h
Fully Buffered DRAM
4MB
Sha
red
L2 (1
6 w
ay)
42.7GB/s (read), 21.3 GB/s (write)
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
179
GB
/s(fi
ll)90
GB
/s(w
ritet
hru)
Sun Niagara2
4x128b FBDIMM memory controllers
AMD Opteron
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
DDR2 DRAM DDR2 DRAM
10.6GB/s 10.6GB/s
4GB/s(each direction)
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
<<20GB/seach
direction
SPE256K
PPE512K L2
MFC
BIF
XDR
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
SPE 256K
PPE 512K L2
MFC
BIF
XDR
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
IBM Cell Blade
75 Gflop/s(TLP + ILP + SIMD)
17 Gflop/s(TLP + ILP)
29 Gflop/s(TLP + ILP + SIMD)
11 Gflop/s(TLP)
BIPSBIPSMulticore SMP Systems
(peak read bandwidth)
4MBShared L2
Core2
FSB
Fully Buffered DRAM
10.6GB/s
Core2
Chipset (4x64b controllers)
10.6GB/s
10.6 GB/s(write)
4MBShared L2
Core2 Core2
4MBShared L2
Core2
FSB
Core2
4MBShared L2
Core2 Core2
21.3 GB/s(read)
Intel ClovertownC
ross
bar S
witc
h
Fully Buffered DRAM
4MB
Sha
red
L2 (1
6 w
ay)
42.7GB/s (read), 21.3 GB/s (write)
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
179
GB
/s(fi
ll)90
GB
/s(w
ritet
hru)
Sun Niagara2
4x128b FBDIMM memory controllers
AMD Opteron
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
DDR2 DRAM DDR2 DRAM
10.6GB/s 10.6GB/s
4GB/s(each direction)
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
<<20GB/seach
direction
SPE256K
PPE512K L2
MFC
BIF
XDR
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
SPE 256K
PPE 512K L2
MFC
BIF
XDR
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
IBM Cell Blade
21 GB/s 21 GB/s
51 GB/s43 GB/s
BIPSBIPSMulticore SMP Systems
(NUMA)
4MBShared L2
Core2
FSB
Fully Buffered DRAM
10.6GB/s
Core2
Chipset (4x64b controllers)
10.6GB/s
10.6 GB/s(write)
4MBShared L2
Core2 Core2
4MBShared L2
Core2
FSB
Core2
4MBShared L2
Core2 Core2
21.3 GB/s(read)
Intel ClovertownC
ross
bar S
witc
h
Fully Buffered DRAM
4MB
Sha
red
L2 (1
6 w
ay)
42.7GB/s (read), 21.3 GB/s (write)
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
179
GB
/s(fi
ll)90
GB
/s(w
ritet
hru)
Sun Niagara2
4x128b FBDIMM memory controllers
AMD Opteron
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
DDR2 DRAM DDR2 DRAM
10.6GB/s 10.6GB/s
4GB/s(each direction)
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
<<20GB/seach
direction
SPE256K
PPE512K L2
MFC
BIF
XDR
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
SPE 256K
PPE 512K L2
MFC
BIF
XDR
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
IBM Cell Blade
Unifo
rm M
emor
y Ac
cess
Non-
Unifo
rm M
emor
y Ac
cess
BIPSBIPS
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Naïve Implementation for Cache-based Machines
Performance across suiteAlso, included a median performance number
BIPSBIPS vanilla C Performance
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Den
se
Pro
tein
FE
M-S
ph
r
FE
M-C
an
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Den
se
Pro
tein
FE
M-S
ph
r
FE
M-C
an
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Den
se
Pro
tein
FE
M-S
ph
r
FE
M-C
an
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
Intel Clovertown AMD Opteron
Sun Niagara2Vanilla C implementationMatrix stored in CSR (compressed sparse row)Explored compiler options - only the best is presented herex86 core delivers > 10x performance of Niagara2 thread
BIPSBIPS
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
SPMD, strong scalingOptimized for multicore/threadingVariety of shared memory programming modelsare acceptable(not just Pthreads)More colors = more optimizations = more work
Pthread Implementation for Cache-Based Machines
BIPSBIPS Naïve Parallel Performance
Naïve Pthreads
Naïve Single Thread
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Den
se
Pro
tein
FE
M-S
ph
r
FE
M-C
an
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Den
se
Pro
tein
FE
M-S
ph
r
FE
M-C
an
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Den
se
Pro
tein
FE
M-S
ph
r
FE
M-C
an
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
Intel Clovertown AMD Opteron
Sun Niagara2Simple row parallelizationPthreads (didn’t have to be)1P Niagara2 > 2x5x 2P x86 machines
BIPSBIPS
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Den
se
Pro
tein
FE
M-S
ph
r
FE
M-C
an
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Den
se
Pro
tein
FE
M-S
ph
r
FE
M-C
an
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Den
se
Pro
tein
FE
M-S
ph
r
FE
M-C
an
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
sNaïve Parallel Performance
Naïve Pthreads
Naïve Single Thread
Intel Clovertown AMD Opteron
Sun Niagara2
8x cores = 1.9x performance8x cores = 1.9x performance 4x cores = 1.5x performance4x cores = 1.5x performance
64x threads = 41x performance64x threads = 41x performanceSimple row parallelizationPthreads (didn’t have to be)1P Niagara2 > 2x5x 2P x86 machines
BIPSBIPS
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Den
se
Pro
tein
FE
M-S
ph
r
FE
M-C
an
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Den
se
Pro
tein
FE
M-S
ph
r
FE
M-C
an
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Den
se
Pro
tein
FE
M-S
ph
r
FE
M-C
an
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
sNaïve Parallel Performance
Naïve Pthreads
Naïve Single Thread
Intel Clovertown AMD Opteron
Sun Niagara2
1.4% of peak flops29% of bandwidth1.4% of peak flops29% of bandwidth
4% of peak flops20% of bandwidth4% of peak flops20% of bandwidth
25% of peak flops39% of bandwidth25% of peak flops39% of bandwidth
Simple row parallelizationPthreads (didn’t have to be)1P Niagara2 > 2x5x 2P x86 machines
BIPSBIPS Case for Auto-tuning
How do we deliver good performance across all these architectures, across all matrices without exhaustivelyoptimizing every combination
Auto-tuning (in general)Write a Perl script that generates all possible optimizationsHeuristically, or exhaustively searches the optimizationsExisting SpMV solution: OSKI (developed at UCB)
This work:Tuning geared toward multi-core/-threadinggenerates SSE/SIMD intrinsics, prefetching, loop transformations, alternate data structures, etc…“prototype for parallel OSKI”
BIPSBIPSPerformance
(+NUMA & SW Prefetching)
+Software Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve Single Thread
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Den
se
Pro
tein
FE
M-S
ph
r
FE
M-C
an
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Den
se
Pro
tein
FE
M-S
ph
r
FE
M-C
an
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Den
se
Pro
tein
FE
M-S
ph
r
FE
M-C
an
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
Intel Clovertown AMD Opteron
Sun Niagara2NUMA aware allocationTag prefetches with the appropriate temporal localityTune for the optimal distance
BIPSBIPS Matrix Compression
For memory bound kernels, minimizing memorytraffic should maximize performanceCompress the meta data
Exploit structure to eliminate meta data
Heuristic: select the compression thatminimizes the matrix size:
power of 2 register blocking CSR/COO format 16b/32b indicesetc…
Side effect: matrix may be minimized to the point where it fits entirely in cache
BIPSBIPSPerformance
(+matrix compression)
+Matrix Compression
+Software Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve Single Thread
0.0
1.0
2.0
3.0
4.0
5.0
6.0
Den
se
Pro
tein
FE
M-S
ph
r
FE
M-C
an
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
Den
se
Pro
tein
FE
M-S
ph
r
FE
M-C
an
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
Den
se
Pro
tein
FE
M-S
ph
r
FE
M-C
an
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
Intel Clovertown AMD Opteron
Sun Niagara2Very significant on somemissed the boat on others
BIPSBIPSPerformance (+cache & TLB blocking)
+Cache/TLB Blocking
+Matrix Compression
+Software Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve Single Thread
0.0
1.0
2.0
3.0
4.0
5.0
6.0
Den
se
Pro
tein
FE
M-S
ph
r
FE
M-C
an
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
Den
se
Pro
tein
FE
M-S
ph
r
FE
M-C
an
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
Den
se
Pro
tein
FE
M-S
ph
r
FE
M-C
an
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
Intel Clovertown AMD Opteron
Sun Niagara2Cache blocking geared towards sparse accesses
BIPSBIPSPerformance
(more DIMMs,firmware fix …)
+More DIMMs, Rank configuration, etc…
+Cache/TLB Blocking
+Matrix Compression
+Software Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve Single Thread
0.0
1.0
2.0
3.0
4.0
5.0
6.0
Den
se
Pro
tein
FEM
-Sp
hr
FEM
-Can
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
Den
se
Pro
tein
FEM
-Sp
hr
FEM
-Can
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
Den
se
Pro
tein
FEM
-Sp
hr
FEM
-Can
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
Intel Clovertown AMD Opteron
Sun Niagara2
BIPSBIPS
0.0
1.0
2.0
3.0
4.0
5.0
6.0
Den
se
Pro
tein
FEM
-Sp
hr
FEM
-Can
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
Den
se
Pro
tein
FEM
-Sp
hr
FEM
-Can
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
Den
se
Pro
tein
FEM
-Sp
hr
FEM
-Can
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
sPerformance
(more DIMMs,firmware fix …)
+More DIMMs, Rank configuration, etc…
+Cache/TLB Blocking
+Matrix Compression
+Software Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve Single Thread
Intel Clovertown AMD Opteron
Sun Niagara2
4% of peak flops52% of bandwidth4% of peak flops52% of bandwidth
20% of peak flops66% of bandwidth20% of peak flops66% of bandwidth
52% of peak flops54% of bandwidth52% of peak flops54% of bandwidth
BIPSBIPS
0.0
1.0
2.0
3.0
4.0
5.0
6.0
Den
se
Pro
tein
FEM
-Sp
hr
FEM
-Can
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
Den
se
Pro
tein
FEM
-Sp
hr
FEM
-Can
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
Den
se
Pro
tein
FEM
-Sp
hr
FEM
-Can
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
sPerformance
(more DIMMs,firmware fix …)
+More DIMMs, Rank configuration, etc…
+Cache/TLB Blocking
+Matrix Compression
+Software Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve Single Thread
Intel Clovertown AMD Opteron
Sun Niagara2
3 essential optimizations:Parallelize, Prefetch, Compress
3 essential optimizations:Parallelize, Prefetch, Compress
4 essential optimizations:Parallelize, NUMA, Prefetch, Compress
4 essential optimizations:Parallelize, NUMA, Prefetch, Compress
2 essential optimizations:Parallelize, Compress
2 essential optimizations:Parallelize, Compress
BIPSBIPS
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
CommentsPerformance
Cell Implementation
BIPSBIPS Cell Implementation
No vanilla C implementation (aside from the PPE)In some cases, what were optional optimizations on cache based machines, are requirements for correctness on Cell
Even SIMDized double precision is extremely weakScalar double precision is unbearableMinimum register blocking is 2x1 (SIMDizable)Can increase memory traffic by 66%
Optional Cache blocking is now required local store blockingSpatial and temporal locality is captured by software when the matrix is optimizedIn essence, the high bits of column indices are grouped into DMAlists
Branchless implementation
Despite the following performance numbers, Cell was still handicapped by double precision
BIPSBIPS
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
Den
se
Pro
tein
FEM
-Sp
hr
FEM
-Can
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
16 SPEs8 SPEs1 SPE
Performance (all)
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
Den
se
Pro
tein
FEM
-Sp
hr
FEM
-Can
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
Den
se
Pro
tein
FEM
-Sp
hr
FEM
-Can
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
Den
se
Pro
tein
FEM
-Sp
hr
FEM
-Can
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
Intel Clovertown AMD Opteron
Sun Niagara2 IBM Cell Broadband Engine
BIPSBIPS
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
Den
se
Pro
tein
FEM
-Sp
hr
FEM
-Can
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
16 SPEs8 SPEs1 SPE
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
Den
se
Pro
tein
FEM
-Sp
hr
FEM
-Can
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
Den
se
Pro
tein
FEM
-Sp
hr
FEM
-Can
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
Den
se
Pro
tein
FEM
-Sp
hr
FEM
-Can
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
Intel Clovertown AMD Opteron
Sun Niagara2
Hardware Efficiency
IBM Cell Broadband Engine39% of peak flops89% of bandwidth39% of peak flops89% of bandwidth
4% of peak flops52% of bandwidth4% of peak flops52% of bandwidth
20% of peak flops66% of bandwidth20% of peak flops66% of bandwidth
52% of peak flops54% of bandwidth52% of peak flops54% of bandwidth
BIPSBIPS
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
Den
se
Pro
tein
FEM
-Sp
hr
FEM
-Can
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
16 SPEs8 SPEs1 SPE
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
Den
se
Pro
tein
FEM
-Sp
hr
FEM
-Can
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
Den
se
Pro
tein
FEM
-Sp
hr
FEM
-Can
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
Den
se
Pro
tein
FEM
-Sp
hr
FEM
-Can
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
Intel Clovertown AMD Opteron
Sun Niagara2
Productivity
IBM Cell Broadband Engine
3 essential optimizations:Parallelize, Prefetch, Compress
3 essential optimizations:Parallelize, Prefetch, Compress
4 essential optimizations:Parallelize, NUMA, Prefetch, Compress
4 essential optimizations:Parallelize, NUMA, Prefetch, Compress
2 essential optimizations:Parallelize, Compress
2 essential optimizations:Parallelize, Compress
5 essential optimizations:Parallelize, NUMA, DMA, Compress, Cache(LS) Block
5 essential optimizations:Parallelize, NUMA, DMA, Compress, Cache(LS) Block
BIPSBIPS
0.000
0.200
0.400
0.600
0.800
1.000
0 8 16 24 32 40 48 56 64
threads
Parallel Efficiency
0.000
0.200
0.400
0.600
0.800
1.000
1 2 3 4
cores
0.000
0.200
0.400
0.600
0.800
1.000
1.200
1 2 3 4 5 6 7 8
cores
0.000
0.200
0.400
0.600
0.800
1.000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
cores
Intel Clovertown AMD Opteron
Sun Niagara2 IBM Cell Broadband Engine
BIPSBIPS
0.000
0.200
0.400
0.600
0.800
1.000
0 8 16 24 32 40 48 56 64
threads
Parallel Efficiency (breakdown)
0.000
0.200
0.400
0.600
0.800
1.000
1 2 3 4
cores
0.000
0.200
0.400
0.600
0.800
1.000
1.200
1 2 3 4 5 6 7 8
cores
0.000
0.200
0.400
0.600
0.800
1.000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
cores
Intel Clovertown AMD Opteron
Sun Niagara2 IBM Cell Broadband Engine
multicore
multicore
multicore
multicore
multisocket
multisocket
multisocket
MT
BIPSBIPS
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Multicore MPI ImplementationThis is the default approach to programming multicore
BIPSBIPS Multicore MPI Implementation
Used PETSc with shared memory MPICHUsed OSKI (developed @ UCB) to optimize each thread= good autotuned shared memory MPI implementation
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Den
se
Pro
tein
FE
M-S
ph
r
FE
M-C
an
t
Tu
nn
el
FE
M-H
ar
QC
D
FE
M-S
hip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Den
se
Pro
tein
FEM
-Sp
hr
FEM
-Can
t
Tu
nn
el
FEM
-Har
QC
D
FEM
-Sh
ip
Eco
no
m
Ep
idem
FEM
-Acc
el
Cir
cuit
Web
base LP
Med
ian
GFlo
p/
s
MPI(autotuned) Pthreads(autotuned)Naïve Single Thread
Intel Clovertown AMD Opteron
BIPSBIPS
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Summary
BIPSBIPS
Median Performance
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
7.0
IntelClovertown
AMD Opteron Sun Niagara2 IBM Cell
GFlo
p/
s
Naïve PETSc+OSKI(autotuned) Pthreads(autotuned)
Median Power Efficiency
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
18.0
20.0
22.0
24.0
IntelClovertown
AMD Opteron Sun Niagara2 IBM Cell
MFlo
p/
s/W
att
Naïve PETSc+OSKI(autotuned) Pthreads(autotuned)
Median Performance & Efficiency
1P Niagara2 was consistently better than 2P x86 machines (more potential bandwidth)Cell delivered by far the best performance (better utilization)
Used digital power meter to measure sustained system powerFBDIMM is power hungry (12W/DIMM)
Clovertown(330W)Niagara2 (350W) power
BIPSBIPS Summary
Paradoxically, the most complex/advanced architectures required the most tuning, and delivered the lowest performance.
Niagara2 delivered both very good performance and productivityCell delivered very good performance and efficiency
90% of memory bandwidth (most get ~50%)High power efficiencyEasily understood performanceExtra traffic = lower performance (future work can address this)
Our multicore specific autotuned implementation significantly outperformed an autotuned MPI implementationIt exploited:
Matrix compression geared towards multicore, rather than singleNUMAPrefetching
BIPSBIPS Acknowledgments
UC BerkeleyRADLab Cluster (Opterons)PSI cluster(Clovertowns)
Sun MicrosystemsNiagara2 access
Forschungszentrum JülichCell blade cluster access
BIPSBIPS
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Questions?
BIPSBIPS
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Backup Slides
BIPSBIPS Parallelization
Matrix partitioned by rows and balanced by the number of nonzerosSPMD like approachA barrier() is called before and after the SpMV kernelEach sub matrix stored separately in CSR Load balancing can be challenging
# of threads explored in powers of 2 (in paper)
A
x
y
BIPSBIPS Exploiting NUMA, Affinity
Bandwidth on the Opteron(and Cell) can vary substantially based on placement of data
Bind each sub matrix and the thread to process it togetherExplored libnuma, Linux, and Solaris routinesAdjacent blocks bound to adjacent cores
Opteron
DDR2 DRAM
Opteron
DDR2 DRAM
Opteron
DDR2 DRAM
Opteron
DDR2 DRAM
Opteron
DDR2 DRAM
Opteron
DDR2 DRAM
Single ThreadMultiple Threads,
One memory controllerMultiple Threads,
Both memory controllers
BIPSBIPS Cache and TLB Blocking
Accesses to the matrix and destination vector are streamingBut, access to the source vector can be randomReorganize matrix (and thus access pattern) to maximize reuse.Applies equally to TLB blocking (caching PTEs)
Heuristic: block destination, then keep addingmore columns as long as the number ofsource vector cache lines(or pages) touchedis less than the cache(or TLB). Apply allprevious optimizations individually to eachcache block.
Search: neither, cache, cache&TLB
Better locality at the expense of confusingthe hardware prefetchers.
A
x
y
BIPSBIPS Banks, Ranks, and DIMMs
In this SPMD approach, as the number of threads increases, so to does the number of concurrent streams to memory.Most memory controllers have finite capability to reorder the requests. (DMA can avoid or minimize this)Addressing/Bank conflicts become increasingly likely
Add more DIMMs, configuration of ranks can helpClovertown system was already fully populated