.
......
Alleviating memory-bandwidth limitations forscalability and energy efficiency
Lessons learned from the optimization of SpMxV
Georgios [email protected]
Computing Systems LaboratoryNational Technical University of Athens
Oct 3, 2013Parallel Processing for Energy Efficiency
Parallel Processing for Energy Efficiency (PP4EE) 1
Outline
...1 Compression as an approach to scale up memory-bound applications
...2 Sparse Matrices and SpMxV
...3 CSX: A new storage format for sparse matrices
...4 Conclusions – Areas of future research – Discussion
Parallel Processing for Energy Efficiency (PP4EE) 2
Outline
...1 Compression as an approach to scale up memory-bound applications
...2 Sparse Matrices and SpMxV
...3 CSX: A new storage format for sparse matrices
...4 Conclusions – Areas of future research – Discussion
Parallel Processing for Energy Efficiency (PP4EE) 3
Application classes(based on their performance on shared memory systems)
..core. core. core. core. L1. L1. L1. L1.
L2
.
L2
.
main memory (or off-chip cache)
Parallel Processing for Energy Efficiency (PP4EE) 4
Application classes(based on their performance on shared memory systems)
4 Good scalability4 temporal locality4 no synchronization4 load balance
..core. core. core. core. L1. L1. L1. L1.
L2
.
L2
.
main memory (or off-chip cache)
Parallel Processing for Energy Efficiency (PP4EE) 4
Application classes(based on their performance on shared memory systems)
7 Applications with intensive memory accesses7 (very) poor temporal locality7 high memory-to-computation ratio7 limited scalability due to contention on memory
..core. core. core. core. L1. L1. L1. L1.
L2
.
L2
.
main memory (or off-chip cache)
Parallel Processing for Energy Efficiency (PP4EE) 4
Applications with intensive memory accesses(Example: memcomp benchmark)
memcomp
1 LOAD
k ADDs
FP (double)
unrolled
1 2 3 4 5 6 7 8
cores utilized
1
2
3
4
5
6
7
8
speedup
k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8
Parallel Processing for Energy Efficiency (PP4EE) 5
Improving performance using compression
exchange memory cycles for CPU cycles
..c. m..
serial
. c. m.c
.
c
.
c
..
parallel (4 cores)
.
c’
.
m’
..
decompression cost
.
c’
.
m’
.
c’
.
c’
.
c’
.
cost amortization
Parallel Processing for Energy Efficiency (PP4EE) 6
Improving performance using compression
exchange memory cycles for CPU cycles
..c. m..
serial
. c. m.c
.
c
.
c
..
parallel (4 cores)
.
c’
.
m’
..
decompression cost
.
c’
.
m’
.
c’
.
c’
.
c’
.
cost amortization
Parallel Processing for Energy Efficiency (PP4EE) 6
Improving performance using compression
exchange memory cycles for CPU cycles
..c. m..
serial
. c. m.c
.
c
.
c
..
parallel (4 cores)
.
c’
.
m’
..
decompression cost
.
c’
.
m’
.
c’
.
c’
.
c’
.
cost amortization
Parallel Processing for Energy Efficiency (PP4EE) 6
Outline
...1 Compression as an approach to scale up memory-bound applications
...2 Sparse Matrices and SpMxVStorage formatsSparse-matrix vector multiplication: SpMxVSpMxV performance
...3 CSX: A new storage format for sparse matrices
...4 Conclusions – Areas of future research – Discussion
Parallel Processing for Energy Efficiency (PP4EE) 7
Sparse Matrices
Dominated by zeroes
Applications:PDEsGraphsLinear Programming
Efficient Representation(space and computation)
non-zero values (value data)structure (index data)
Sparse storage formatsCOO: BasicCSR: most common, base-lineBCSR: state of the art
Parallel Processing for Energy Efficiency (PP4EE) 8
CSR (Compressed Sparse Row)
..
..5.4 ..1.1 ..0 ..0 ..0 ..0
..0 ..6.3 ..0 ..7.7 ..0 ..8.8
..0 ..0 ..1.1 ..0 ..0 ..0
..0 ..0 ..2.9 ..0 ..3.7 ..2.9
..9.0 ..0 ..0 ..1.1 ..4.5 ..0
..1.1 ..0 ..2.9 ..3.7 ..0 ..1.1
.
.
.
..row_ptr: . . . ..0 ..2 ..5 ..6 ..9 ..12 ..16 .
..col_ind: ..0 ..1 ..1 ..3 ..5 ..2 ..2 ..4 ..5 ..0 ..3 ..4 ..0 ..2 ..3 ..5
..values: ..5.4 ..1.1 ..6.3 ..7.7 ..8.8 ..1.1 ..2.9 ..3.7 ..2.9 ..9.0 ..1.1 ..4.5 ..1.1 ..2.9 ..3.7 ..1.1
.
nnz
.
nrows+1
......
Parallel Processing for Energy Efficiency (PP4EE) 9
CSR (Compressed Sparse Row)
..
..5.4 ..1.1 ..0 ..0 ..0 ..0
..0 ..6.3 ..0 ..7.7 ..0 ..8.8
..0 ..0 ..1.1 ..0 ..0 ..0
..0 ..0 ..2.9 ..0 ..3.7 ..2.9
..9.0 ..0 ..0 ..1.1 ..4.5 ..0
..1.1 ..0 ..2.9 ..3.7 ..0 ..1.1
.
.
.
..row_ptr: . . . ..0 ..2 ..5 ..6 ..9 ..12 ..16 .
..col_ind: ..0 ..1 ..1 ..3 ..5 ..2 ..2 ..4 ..5 ..0 ..3 ..4 ..0 ..2 ..3 ..5
..values: ..5.4 ..1.1 ..6.3 ..7.7 ..8.8 ..1.1 ..2.9 ..3.7 ..2.9 ..9.0 ..1.1 ..4.5 ..1.1 ..2.9 ..3.7 ..1.1
.
nnz
.
nrows+1
......
Parallel Processing for Energy Efficiency (PP4EE) 9
SpMxV (sparse-matrix vector multiplication)
.
. y = A · x, A is sparse
Important computational kernelSolving PDEs (GMRES, CG) for CFD, economic modelingGraphs (PageRank)abundant amount of research work
....a11 ..a12 ..a13 ..a14..a21 ..a22 ..a23 ..a24..a31 ..a32 ..a33 ..a34..a41 ..a42 ..a43 ..a44
.
.
.
..x1
..x2
..x3
..x4
.
.
.
..y1 =∑
a1i ·xi..y2 =
∑a2i ·xi
..y3 =∑
a3i ·xi..y4 =
∑a4i ·xi
.
.
. ·. =
..
0
.
0
.
a21 ·x1 + a24 ·x4
Parallel Processing for Energy Efficiency (PP4EE) 10
SpMxV (sparse-matrix vector multiplication)
.
. y = A · x, A is sparse
Important computational kernelSolving PDEs (GMRES, CG) for CFD, economic modelingGraphs (PageRank)abundant amount of research work
....a11 ..a12 ..a13 ..a14..a21 ..a22 ..a23 ..a24..a31 ..a32 ..a33 ..a34..a41 ..a42 ..a43 ..a44
.
.
.
..x1
..x2
..x3
..x4
.
.
.
..y1 =∑
a1i ·xi..y2 =
∑a2i ·xi
..y3 =∑
a3i ·xi..y4 =
∑a4i ·xi
.
.
. ·. =.
.
0
.
0
.
a21 ·x1 + a24 ·x4
Parallel Processing for Energy Efficiency (PP4EE) 10
SpMxV (sparse-matrix vector multiplication)
.
. y = A · x, A is sparse
Important computational kernelSolving PDEs (GMRES, CG) for CFD, economic modelingGraphs (PageRank)abundant amount of research work
....a11 ..a12 ..a13 ..a14..a21 ..a22 ..a23 ..a24..a31 ..a32 ..a33 ..a34..a41 ..a42 ..a43 ..a44
.
.
.
..x1
..x2
..x3
..x4
.
.
.
..y1 =∑
a1i ·xi..y2 =
∑a2i ·xi
..y3 =∑
a3i ·xi..y4 =
∑a4i ·xi
.
.
. ·. =..
0
.
0
.
a21 ·x1 + a24 ·x4
Parallel Processing for Energy Efficiency (PP4EE) 10
CSR SpMxV
......for (i=0; i < N; i++)
for (j=row_ptr[i]; j < row_ptr[i+1]; j++)
y[i] += values[j] * x[col_ind[j]];
.
..row_ptr: . . . ..0 ..2 ..5 ..6 ..9 ..12 ..16 .
..col_ind: ..0 ..1 ..1 ..3 ..5 ..2 ..2 ..4 ..5 ..0 ..3 ..4 ..0 ..2 ..3 ..5
..x: . . . . ..x0 ..x1 ..x2 ..x3 ..x4 ..x5
..values: ..5.4 ..1.1 ..6.3 ..7.7 ..8.8 ..1.1 ..2.9 ..3.7 ..2.9 ..9.0 ..1.1 ..4.5 ..1.1 ..2.9 ..3.7 ..1.1
..y: . . . . ..y0 ..y1 ..y2 ..y3 ..y4 ..y5
.
row limits
.
i=3
.
(indirect access)
.
(∗)
.
(∑
)
Parallel Processing for Energy Efficiency (PP4EE) 11
CSR SpMxV
..........for (i=0; i < N; i++)
for (j=row_ptr[i]; j < row_ptr[i+1]; j++)
y[i] += values[j] * x[col_ind[j]];
.
..row_ptr: . . . ..0 ..2 ..5 ..6 ..9 ..12 ..16 .
..col_ind: ..0 ..1 ..1 ..3 ..5 ..2 ..2 ..4 ..5 ..0 ..3 ..4 ..0 ..2 ..3 ..5
..x: . . . . ..x0 ..x1 ..x2 ..x3 ..x4 ..x5
..values: ..5.4 ..1.1 ..6.3 ..7.7 ..8.8 ..1.1 ..2.9 ..3.7 ..2.9 ..9.0 ..1.1 ..4.5 ..1.1 ..2.9 ..3.7 ..1.1
..y: . . . . ..y0 ..y1 ..y2 ..y3 ..y4 ..y5
.
row limits
.
i=3
.
(indirect access)
.
(∗)
.
(∑
)
Parallel Processing for Energy Efficiency (PP4EE) 11
CSR SpMxV
..........for (i=0; i < N; i++)
for (j=row_ptr[i]; j < row_ptr[i+1]; j++)
y[i] += values[j] * x[col_ind[j]];
.
..row_ptr: . . . ..0 ..2 ..5 ..6 ..9 ..12 ..16 .
..col_ind: ..0 ..1 ..1 ..3 ..5 ..2 ..2 ..4 ..5 ..0 ..3 ..4 ..0 ..2 ..3 ..5
..x: . . . . ..x0 ..x1 ..x2 ..x3 ..x4 ..x5
..values: ..5.4 ..1.1 ..6.3 ..7.7 ..8.8 ..1.1 ..2.9 ..3.7 ..2.9 ..9.0 ..1.1 ..4.5 ..1.1 ..2.9 ..3.7 ..1.1
..y: . . . . ..y0 ..y1 ..y2 ..y3 ..y4 ..y5
.
row limits
.
i=3
.
(indirect access)
.
(∗)
.
(∑
)
Parallel Processing for Energy Efficiency (PP4EE) 11
Traditional SpMxV optimization methods
traditional goal: optimizing computation
specialized sparse storage formats(exploitation of “regularities”)
examples (regularity↔ format):2D blocks of constant size↔ BCSR [Im and Yelick ’01]1D blocks of variable size↔ [Pinar and Heath ’99]Diagonals↔ DIAG
Parallel Processing for Energy Efficiency (PP4EE) 12
Traditional SpMxV optimization: BCSR[Im and Yelick ’01]
CSR extension: r×c blocks instead of elements⇒ per-block index informationoptimize computation (register blocking)⇒ specialized SpMxV versions for r×c
padding may be required
..
A =
4.6 9.3 0 0 0 0 2.4 5.68.6 8.2 0 0 0 0 5.3 1.6
0 0 0 0 1.9 7.9 0 00 0 0 0 7.1 0 0 0
0 0 8.6 1.7 2.4 7.6 0 00 0 3.9 2.2 3.0 3.3 0 0
0 0 0 0 1.8 0 7.9 1.20 0 0 0 0 7.8 1.0 5.3
brow ptr : 0 2 3 5 7
bcol ind : (0 6 4 2 4 4 6 )
blocks :4.6 9.3
8.6 8.2
2.4 5.6
5.3 1.6
1.9 7.9
7.1 0
8.6 1.7
3.9 2.2
2.4 7.6
3.0 3.3
1.8 0
0 7.8
7.9 1.2
1.0 5.3
bval : ( 4.6 9.3 8.6 8.2 2.4 5.6 5.3 1.6 1.9 7.9 7.1 0.0 . . . )
....
...
...
Parallel Processing for Energy Efficiency (PP4EE) 13
Traditional SpMxV optimization: BCSR[Im and Yelick ’01]
CSR extension: r×c blocks instead of elements⇒ per-block index informationoptimize computation (register blocking)⇒ specialized SpMxV versions for r×cpadding may be required
..
A =
4.6 9.3 0 0 0 0 2.4 5.68.6 8.2 0 0 0 0 5.3 1.6
0 0 0 0 1.9 7.9 0 00 0 0 0 7.1 0 0 0
0 0 8.6 1.7 2.4 7.6 0 00 0 3.9 2.2 3.0 3.3 0 0
0 0 0 0 1.8 0 7.9 1.20 0 0 0 0 7.8 1.0 5.3
brow ptr : 0 2 3 5 7
bcol ind : (0 6 4 2 4 4 6 )
blocks :4.6 9.3
8.6 8.2
2.4 5.6
5.3 1.6
1.9 7.9
7.1 0
8.6 1.7
3.9 2.2
2.4 7.6
3.0 3.3
1.8 0
0 7.8
7.9 1.2
1.0 5.3
bval : ( 4.6 9.3 8.6 8.2 2.4 5.6 5.3 1.6 1.9 7.9 7.1 0.0 . . . )
.......
Parallel Processing for Energy Efficiency (PP4EE) 13
SpMxV performance(CSR)
related work→ several performance issues
performance evaluation in 100matrices [Goumas et. al. ’09]
memory bandwidth is the bottleneck(optimization attempts to improve computations are misguided)
1 2 4 80.8
1.2
1.6
2.0
2.4
2.8
3.2 allaverage
compression for improving SpMxV performance(reduce working set)
for matrices larger than cacheParallel Processing for Energy Efficiency (PP4EE) 14
SpMxV performance(CSR)
related work→ several performance issues
performance evaluation in 100matrices [Goumas et. al. ’09]
memory bandwidth is the bottleneck(optimization attempts to improve computations are misguided)
1 2 4 80.8
1.2
1.6
2.0
2.4
2.8
3.2 allaverage
compression for improving SpMxV performance(reduce working set)
for matrices larger than cacheParallel Processing for Energy Efficiency (PP4EE) 14
SpMxV performance(CSR)
related work→ several performance issues
performance evaluation in 100matrices [Goumas et. al. ’09]
memory bandwidth is the bottleneck(optimization attempts to improve computations are misguided)
1 2 4 80.8
1.2
1.6
2.0
2.4
2.8
3.2 allaverage
compression for improving SpMxV performance(reduce working set)
for matrices larger than cacheParallel Processing for Energy Efficiency (PP4EE) 14
CSR SpMxV working set(nnz ≫ N)
sidx : column index sizesval : value sizennz : non-zero valuesws : working set size
ws =index data︷ ︸︸ ︷nnz · sidx +
value data︷ ︸︸ ︷nnz · sval
....sidx = 32 bit
..sval = 64 bit
.
. =⇒. index data. value data
Parallel Processing for Energy Efficiency (PP4EE) 15
Outline
...1 Compression as an approach to scale up memory-bound applications
...2 Sparse Matrices and SpMxV
...3 CSX: A new storage format for sparse matricesIndex compression to optimize SpMxVCSX: The Compressed Sparse eXtended storage formatCSX substructuresCSX implementationExperimental Evaluation
...4 Conclusions – Areas of future research – Discussion
Parallel Processing for Energy Efficiency (PP4EE) 16
Index compression
Initial remarks:SpMxV is a memory-bound kernel
data compression can be a viable approach
index data seem a good target for compression (include a lot ofredundancy)
Specialized storage formatsindirectly may lead to index compression
typically exploit regularities: e.g., 2D blocks, diagonals, etc.
e.g., BSCR one column index per blockoriginal goal: register blockingbutmay lead to data increase due to padding
Storage formats that explicitly target index compressiondelta encoding (DCSR [Willcock and Lumsdaine ’06], CSR-DU)CSX (generalization of CSR-DU)
Parallel Processing for Energy Efficiency (PP4EE) 17
Index compression
Initial remarks:SpMxV is a memory-bound kernel
data compression can be a viable approach
index data seem a good target for compression (include a lot ofredundancy)
Specialized storage formatsindirectly may lead to index compression
typically exploit regularities: e.g., 2D blocks, diagonals, etc.
e.g., BSCR one column index per blockoriginal goal: register blockingbutmay lead to data increase due to padding
Storage formats that explicitly target index compressiondelta encoding (DCSR [Willcock and Lumsdaine ’06], CSR-DU)CSX (generalization of CSR-DU)
Parallel Processing for Energy Efficiency (PP4EE) 17
Index compression
Initial remarks:SpMxV is a memory-bound kernel
data compression can be a viable approach
index data seem a good target for compression (include a lot ofredundancy)
Specialized storage formatsindirectly may lead to index compression
typically exploit regularities: e.g., 2D blocks, diagonals, etc.
e.g., BSCR one column index per blockoriginal goal: register blockingbutmay lead to data increase due to padding
Storage formats that explicitly target index compressiondelta encoding (DCSR [Willcock and Lumsdaine ’06], CSR-DU)CSX (generalization of CSR-DU)
Parallel Processing for Energy Efficiency (PP4EE) 17
First step towards index compression: Delta Encodingapplied in each matrix row
index data→ column indices
Delta encoding for column indices ([Willcock and Lumsdaine ’06])
store delta distance from previous index, not absolute value
instead of cii, store:δi = cii − cii−1 ⇒ δi ≤ cii ⇒ (potentially) less space per index
.......
Parallel Processing for Energy Efficiency (PP4EE) 18
First step towards index compression: Delta Encodingapplied in each matrix row
index data→ column indices
Delta encoding for column indices ([Willcock and Lumsdaine ’06])
store delta distance from previous index, not absolute value
instead of cii, store:δi = cii − cii−1 ⇒ δi ≤ cii ⇒ (potentially) less space per index
.......
Parallel Processing for Energy Efficiency (PP4EE) 18
CSX motivationcurrent approaches are not aggressive enough – there are much more regularities to exploit
regularities and sparse storage formatsBCSR, VBL [Pinar and Heath ’99], DIAG
.
. . . .
. . . . .
. . .
. . .
. . ..
.. ..1 ..2 ..3 ..4 ..5 ..6 ..7 ..8 ..9 ..10
..1 .. .. .. .. .. .. .. .. .. ..
..2 .. .. .. .. .. .. .. .. .. ..
..3 .. .. .. .. .. .. .. .. .. ..
..4 .. .. .. .. .. .. .. .. .. ..
..5 .. .. .. .. .. .. .. .. .. ..
..6 .. .. .. .. .. .. .. .. .. ..
..7 .. .. .. .. .. .. .. .. .. ..
..8 .. .. .. .. .. .. .. .. .. ..
..9 .. .. .. .. .. .. .. .. .. ..
..10 .. .. .. .. .. .. .. .. .. ..
.
. ..
.
.. .
. . .. .. . .
.
. .
.
. ..
.. ..1 ..2 ..3 ..4 ..5 ..6 ..7 ..8 ..9 ..10
..1 .. .. .. .. .. .. .. .. .. ..
..2 .. .. .. .. .. .. .. .. .. ..
..3 .. .. .. .. .. .. .. .. .. ..
..4 .. .. .. .. .. .. .. .. .. ..
..5 .. .. .. .. .. .. .. .. .. ..
..6 .. .. .. .. .. .. .. .. .. ..
..7 .. .. .. .. .. .. .. .. .. ..
..8 .. .. .. .. .. .. .. .. .. ..
..9 .. .. .. .. .. .. .. .. .. ..
..10 .. .. .. .. .. .. .. .. .. ..
multiple regularities↔ composite formats [Agarwal et. al ’92]multiple sub-matrices — each in different formatA·x = (A0 + A1)·x = A0 ·x+ A1 ·x
Parallel Processing for Energy Efficiency (PP4EE) 19
CSX: Compressed Sparse eXtendedIn a nutshell
Objectivestarget memory-bandwidth limitations of SpMxV (implied largematrices)adapt to matrix structure and architecturegenerate efficient code
Approachapply aggressive index compressionexploit a wide set of matrix “regularities”employ code generation to produce efficient code tailored per matrix3 preprocessing phases:
detection of regularitiesmatrix encodingcode generation
drastically reduce preprocessing times
Parallel Processing for Energy Efficiency (PP4EE) 20
CSX substructures(regularities supported by CSX)
Horizontal
(delta run-length-encoding—drle)
..sequential elements .
(y, x+ i) → (y, x) (y, x+ 1) (y, x+ 2) . . .
.
..x ..x ..x ..x ..x
.
(e.g: col. indices: 1,2,3,4,5)
Other 1D directions (Vertical, Diagonal, Anti-Diagonal)
..
..x
..x
..x
..x
..x
...x. ..x. . ..x. . . ..x. . . . ..x
.. . . . ..x. . . ..x. . ..x. ..x..x
.
(y+ i·δ, x)
.
(y+ i·δ, x+ i·δ)
.
(y− i·δ, x+ i·δ)
2D blocks.. ..x ..x
..x ..x.
(x+ i)×(y+ j) (double nested loop)
Parallel Processing for Energy Efficiency (PP4EE) 21
CSX substructures(regularities supported by CSX)
Horizontal (delta run-length-encoding—drle)
..sequential elements with a constant difference δ.
(y, x+ i · δ) → (y, x) (y, x+ δ) (y, x+ 2 · δ) . . ..
..x ..x ..x ..x ..x
.
(e.g: col. indices: 2,4,6,8,10)
Other 1D directions (Vertical, Diagonal, Anti-Diagonal)
..
..x
..x
..x
..x
..x
...x. ..x. . ..x. . . ..x. . . . ..x
.. . . . ..x. . . ..x. . ..x. ..x..x
.
(y+ i·δ, x)
.
(y+ i·δ, x+ i·δ)
.
(y− i·δ, x+ i·δ)
2D blocks.. ..x ..x
..x ..x.
(x+ i)×(y+ j) (double nested loop)
Parallel Processing for Energy Efficiency (PP4EE) 21
CSX substructures(regularities supported by CSX)
Horizontal (delta run-length-encoding—drle)
..sequential elements with a constant difference δ.
(y, x+ i · δ) → (y, x) (y, x+ δ) (y, x+ 2 · δ) . . ..
..x ..x ..x ..x ..x
Other 1D directions (Vertical, Diagonal, Anti-Diagonal)
..
..x
..x
..x
..x
..x
...x. ..x. . ..x. . . ..x. . . . ..x
.. . . . ..x. . . ..x. . ..x. ..x..x
.
(y+ i·δ, x)
.
(y+ i·δ, x+ i·δ)
.
(y− i·δ, x+ i·δ)
2D blocks.. ..x ..x
..x ..x.
(x+ i)×(y+ j) (double nested loop)
Parallel Processing for Energy Efficiency (PP4EE) 21
CSX substructures(regularities supported by CSX)
Horizontal (delta run-length-encoding—drle)
..sequential elements with a constant difference δ.
(y, x+ i · δ) → (y, x) (y, x+ δ) (y, x+ 2 · δ) . . ..
..x ..x ..x ..x ..x
Other 1D directions (Vertical, Diagonal, Anti-Diagonal)
..
..x
..x
..x
..x
..x
...x. ..x. . ..x. . . ..x. . . . ..x
.. . . . ..x. . . ..x. . ..x. ..x..x
.
(y+ i·δ, x)
.
(y+ i·δ, x+ i·δ)
.
(y− i·δ, x+ i·δ)
2D blocks.. ..x ..x
..x ..x.
(x+ i)×(y+ j) (double nested loop)Parallel Processing for Energy Efficiency (PP4EE) 21
CSX substructures on matrices
xeno
n2A
SIC
_680
kto
rso3
Che
bysh
ev4
Ham
rle3
pre2
cage
13at
mos
mod
joh
ne2
kkt_
pow
erT
SO
PF
_RS
_b23
83G
a41A
s41H
72F
rees
cale
1ra
jat3
1F
1pa
rabo
lic_f
emof
fsho
reco
nsph
bmw
7st_
1G
3_ci
rcui
tth
erm
al2
m_t
1bm
wcr
a_1
hood
cran
kseg
_2nd
12k
af_5
_k10
1in
line_
1ld
oor
bone
S10
0
20
40
60
80
100
Non
-zer
o el
emen
ts c
over
age
(%)
d8d16d32h(1)h(2)v(1)v(2)d(1)d(3)d(11)d(857)d(1714)rd(1)br(2,2)br(2,3)br(2,4)br(2,6)
br(2,9)br(2,12)br(2,18)br(3,3)br(3,6)br(3,9)br(3,12)br(3,15)br(3,18)br(4,4)br(5,5)br(5,15)br(7,7)br(7,14)br(7,21)bc(3,2)
Parallel Processing for Energy Efficiency (PP4EE) 22
CSX Encoding
..1 .1 . 6.
8.variable
int
.variable
int
.xed
{8,16,32}.…
.xed
{8,16,32}.CTL
.nr .rjmp .id .size .ujmp .ucol .deltas
. .
.Head .Body
.
.
.
. .
.
.d(2)
.h(1) .ad(1)
.v(1)
.bc(4,2)
.bc(4,2).bc(3,2).
.. ..1 ..2 ..3 ..4 ..5 ..6 ..7 ..8 ..9 ..10
..1 .. .. .. .. .. .. .. .. .. ..
..2 .. .. .. .. .. .. .. .. .. ..
..3 .. .. .. .. .. .. .. .. .. ..
..4 .. .. .. .. .. .. .. .. .. ..
..5 .. .. .. .. .. .. .. .. .. ..
..6 .. .. .. .. .. .. .. .. .. ..
..7 .. .. .. .. .. .. .. .. .. ..
..8 .. .. .. .. .. .. .. .. .. ..
..9 .. .. .. .. .. .. .. .. .. ..
..10 .. .. .. .. .. .. .. .. .. ..
1 0 0 4 0h(1)
0 0 1 4 5ad(1)
1 1 2 8 1 0bc(4,2)
0 0 3 4 9v(1)
1 0 4 4 3d(2)
1 1 2 8 2 2bc(4,2)
1 0 5 6 2bc(3,2)
nr rjmp id size [ujmp]:ucol
Parallel Processing for Energy Efficiency (PP4EE) 23
CSX Code generation
..Encodedmatrix
.C code
generator
.Clang
front-end
. LLVMmodule
.Native SpMV code
.Functionpointer. . .
.. SpMVsource templates
.SpMV
.source.Emit.LLVM
.Execution.Engine
.call
top-level SpMxV templatebig case statement based on substructure
code for each substructure in the matrix
Parallel Processing for Energy Efficiency (PP4EE) 24
CSX preprocessing cost
Ü what about preprocessing (compression) cost?
depends on the application
frequently, the matrix is used across numerous SpMxV runsû sufficient repetitions→ overhead will be amortizedmethods to reduce preprocessing cost:
reduce the number of substructures scannedsample the matrix for substructuresparallelize preprocessing
Parallel Processing for Energy Efficiency (PP4EE) 25
Experimental preliminaries
matrix suite:30 matricesUniversity of Florida sparse matrix collection [Davis and Hu, 2011]real world applications, large variety of applicationsincluding problems without an underlying 2D/3D geometrydo not t into aggregate cache
compare against:CSRBCSR (we always select the best performing block)VBL (1D variable length blocks)
double (64-bit) oating point values
Parallel Processing for Energy Efficiency (PP4EE) 26
CSX Compression ratio
xeno
n2A
SIC
_680
kto
rso3
Che
bysh
ev4
Ham
rle3
pre2
cage
13at
mos
mod
joh
ne2
kkt_
pow
erT
SO
PF
_RS
_b23
83G
a41A
s41H
72F
rees
cale
1ra
jat3
1F
1pa
rabo
lic_f
emof
fsho
reco
nsph
bmw
7st_
1G
3_ci
rcui
tth
erm
al2
m_t
1bm
wcr
a_1
hood
cran
kseg
_2nd
12k
af_5
_k10
1in
line_
1ld
oor
bone
S10
-80
-60
-40
-20
0
20
40
Com
pres
sion
rat
io (
%)
BCSRVBLCSXMaximum
maximum: only consider valuesCSX always the best optionCSX never has negative compression
Parallel Processing for Energy Efficiency (PP4EE) 27
CSX: Evaluation on SMP
Harpertown
C
L1
C
L1
C
L1
C
L1
L2 L2
CPU CPU
2× 4 = 8 cores
L2: 6MiB, per 2 cores
Dunnington
C
L1
C
L1
C
L1
C
L1
C
L1
C
L1
L2 L2 L2
CPU
L3
CPU
L3
CPU
L3
CPU
L3
4× 6 = 24 cores
L2: 3MiB, per 2 cores
L3: 16MiB
Parallel Processing for Energy Efficiency (PP4EE) 28
CSX: SMP: Average speedup over serial CSR(share-all core lling policy)
Harpertown
1 2 4 8
Threads
1
2
3
Spe
edup
ove
r se
rial C
SR
CSRBCSRVBLCSX
improvement over MT CSRfor 8 threads:
CSX: 26.4%VBL: 18.5%BCSR: 4.1%
Dunnington
1 2 6 12 24
Threads
1
2
4
6
810
Spe
edup
ove
r se
rial C
SR
CSRBCSRVBLCSX
improvement over MT CSRfor 24 threads:
CSX: 61%VBL: 28.8%BCSR: 6.3%
Parallel Processing for Energy Efficiency (PP4EE) 29
CSX: Preprocessing cost
Dunnington
CSX-delta
CSX-horiz
CSX-samplingCSX-full
8 32 128 512
Serial CSR SpMV operations
1
1.2
1.4
1.6
Per
f. im
pr. o
ver
M/T
CS
R
Gainestown (NUMA)
CSX-delta
CSX-horiz
CSX-sampling CSX-full
16 64 256 1024
Serial CSR SpMV operations
1
1.1
1.2
Per
f. im
pr. o
ver
M/T
CS
R
Parallel Processing for Energy Efficiency (PP4EE) 30
Energy and power measurements
Total energy (idle cores included)matrix af_5_k101
0
200
400
600
800
1000
1200
1400
csr csx csr csx csr csx csr csx csr csx csr csx csr csx 1
2
3
4
5
6
7
8
9
10
Ener
gy (J
)
Spee
dup(
w.r.t
. CSR
1 th
read
)
#threads
Total energy breakdown
uncore
core
dram
csr speedup
csx speedup
3224168421
Powerdensematrix
1 2 4 8 16 24 32#Threads
5
10
15
20
25
30
35
40
45
Aver
age
Pow
er(W
)
coreuncoredram
Parallel Processing for Energy Efficiency (PP4EE) 31
More info on CSX (papers)
CSX details:K. Kourtis, V. Karakasis, G. Goumas, and N. Koziris. “CSX: an extended compression format for spmv on sharedmemory systems,”16th ACM symposium on Principles and practice of parallel programming (PPoPP ’11). ACM, New York, NY, USA, 247-256.
V. Karakasis, T. Gkountouvas, K. Kourtis, G. Goumas, N. Koziris, “An Extended Compression Format for the Optimization of SparseMatrix-Vector Multiplication,” IEEE Transactions on Parallel and Distributed Systems, vol. 24, no. 10, pp. 1930-1940, Oct., 2013.
CSX for symmetric matrices:T. Gkountouvas, V. Karakasis, K. Kourtis, G. Goumas, and N. Koziris. “Improving the performance of the symmetric sparsematrix-vectormultiplication inmulticore”. In 27th IEEE International Parallel & Distributed Processing Symposium (IPDPS’13), Boston, MA,USA, 2013.
CSX integrated with Elmer multiphysics simulation softwareV. Karakasis, G. Goumas, K. Nikas, N. Koziris, J. Ruokolainen, and P. Råback. “Using State-of-the-Art Sparse Matrix Optimizations forAccelerating the Performance of Multiphysics Simulations”. In PARA 2012: Workshop on State-of-the-Art in Scienti c and ParallelComputing, Helsinki, Finland, 2012. Springer.
Value compression for SpMxVK. Kourtis, G. Goumas and N. Koziris, “Exploiting Compression Opportunities to Improve SpMxV Performance on SharedMemorySystems,” ACM Transactions on Architecture and Code Optimization (TACO), Vol 7, No 3, December 2011.
The energy pro le of CSX and CSRJ. C. Meyer, V. Karakasis, J. Cebrián, L. Natvig, D. Siakavaras, and K. Nikas. “Energy-efficient sparse matrix autotuning with CSX – Atrade-off study”. In NinthWorkshop on High-Performance, Power-Aware Computing (HPPAC’13), IPDPS’13, Boston, MA, USA, 2013.
Parallel Processing for Energy Efficiency (PP4EE) 32
More info on CSX
download code:http://www.cslab.ece.ntua.gr/csx/
https://github.com/cslab-ntua/csx
current status:working on the release of an API and librarysupport tools (disk representation, le format converters)
Parallel Processing for Energy Efficiency (PP4EE) 33
Outline
...1 Compression as an approach to scale up memory-bound applications
...2 Sparse Matrices and SpMxV
...3 CSX: A new storage format for sparse matrices
...4 Conclusions – Areas of future research – Discussion
Parallel Processing for Energy Efficiency (PP4EE) 34
Conclusions
Compression can improve SpMxV performanceCSX applies aggressive index data compression to optimize SpMxV
supports arbitrary regularitiestunable preprocessing cost
yet, preprocessing can be a concern
outperforms baseline and state-of-the-art alternatives
Parallel Processing for Energy Efficiency (PP4EE) 35
Areas of current and future researchand (hopefully) opportunities for collaboration
relevant to CSX, SpMxV and compressionapply compression to the oating-point values of the matrix (recall:these consume 2/3 of the data!)
generalize to other applications
investigate opportunities for hardware support (scalability, space,energy)
contention-aware schedulingtime and space scheduling of resource-hungry applications (forhomogeneous and heterogeneous CMPs)
performance prediction
energy-aware computingpower and energy-aware algorithms and techniques
predict execution behavior based on power consumption snapshots
Parallel Processing for Energy Efficiency (PP4EE) 36
Areas of current and future researchand (hopefully) opportunities for collaboration
relevant to CSX, SpMxV and compressionapply compression to the oating-point values of the matrix (recall:these consume 2/3 of the data!)
generalize to other applications
investigate opportunities for hardware support (scalability, space,energy)
contention-aware schedulingtime and space scheduling of resource-hungry applications (forhomogeneous and heterogeneous CMPs)
performance prediction
energy-aware computingpower and energy-aware algorithms and techniques
predict execution behavior based on power consumption snapshots
Parallel Processing for Energy Efficiency (PP4EE) 36
Areas of current and future researchand (hopefully) opportunities for collaboration
relevant to CSX, SpMxV and compressionapply compression to the oating-point values of the matrix (recall:these consume 2/3 of the data!)
generalize to other applications
investigate opportunities for hardware support (scalability, space,energy)
contention-aware schedulingtime and space scheduling of resource-hungry applications (forhomogeneous and heterogeneous CMPs)
performance prediction
energy-aware computingpower and energy-aware algorithms and techniques
predict execution behavior based on power consumption snapshots
Parallel Processing for Energy Efficiency (PP4EE) 36
EOF
Thank you!Questions?
Parallel Processing for Energy Efficiency (PP4EE) 37