Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. ·...

Post on 11-Sep-2020

0 views 0 download

transcript

.

......

Alleviating memory-bandwidth limitations forscalability and energy efficiency

Lessons learned from the optimization of SpMxV

Georgios Goumasgoumas@cslab.ece.ntua.gr

Computing Systems LaboratoryNational Technical University of Athens

Oct 3, 2013Parallel Processing for Energy Efficiency

Parallel Processing for Energy Efficiency (PP4EE) 1

Outline

...1 Compression as an approach to scale up memory-bound applications

...2 Sparse Matrices and SpMxV

...3 CSX: A new storage format for sparse matrices

...4 Conclusions – Areas of future research – Discussion

Parallel Processing for Energy Efficiency (PP4EE) 2

Outline

...1 Compression as an approach to scale up memory-bound applications

...2 Sparse Matrices and SpMxV

...3 CSX: A new storage format for sparse matrices

...4 Conclusions – Areas of future research – Discussion

Parallel Processing for Energy Efficiency (PP4EE) 3

Application classes(based on their performance on shared memory systems)

..core. core. core. core. L1. L1. L1. L1.

L2

.

L2

.

main memory (or off-chip cache)

Parallel Processing for Energy Efficiency (PP4EE) 4

Application classes(based on their performance on shared memory systems)

4 Good scalability4 temporal locality4 no synchronization4 load balance

..core. core. core. core. L1. L1. L1. L1.

L2

.

L2

.

main memory (or off-chip cache)

Parallel Processing for Energy Efficiency (PP4EE) 4

Application classes(based on their performance on shared memory systems)

7 Applications with intensive memory accesses7 (very) poor temporal locality7 high memory-to-computation ratio7 limited scalability due to contention on memory

..core. core. core. core. L1. L1. L1. L1.

L2

.

L2

.

main memory (or off-chip cache)

Parallel Processing for Energy Efficiency (PP4EE) 4

Applications with intensive memory accesses(Example: memcomp benchmark)

memcomp

1 LOAD

k ADDs

FP (double)

unrolled

1 2 3 4 5 6 7 8

cores utilized

1

2

3

4

5

6

7

8

speedup

k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8

Parallel Processing for Energy Efficiency (PP4EE) 5

Improving performance using compression

exchange memory cycles for CPU cycles

..c. m..

serial

. c. m.c

.

c

.

c

..

parallel (4 cores)

.

c’

.

m’

..

decompression cost

.

c’

.

m’

.

c’

.

c’

.

c’

.

cost amortization

Parallel Processing for Energy Efficiency (PP4EE) 6

Improving performance using compression

exchange memory cycles for CPU cycles

..c. m..

serial

. c. m.c

.

c

.

c

..

parallel (4 cores)

.

c’

.

m’

..

decompression cost

.

c’

.

m’

.

c’

.

c’

.

c’

.

cost amortization

Parallel Processing for Energy Efficiency (PP4EE) 6

Improving performance using compression

exchange memory cycles for CPU cycles

..c. m..

serial

. c. m.c

.

c

.

c

..

parallel (4 cores)

.

c’

.

m’

..

decompression cost

.

c’

.

m’

.

c’

.

c’

.

c’

.

cost amortization

Parallel Processing for Energy Efficiency (PP4EE) 6

Outline

...1 Compression as an approach to scale up memory-bound applications

...2 Sparse Matrices and SpMxVStorage formatsSparse-matrix vector multiplication: SpMxVSpMxV performance

...3 CSX: A new storage format for sparse matrices

...4 Conclusions – Areas of future research – Discussion

Parallel Processing for Energy Efficiency (PP4EE) 7

Sparse Matrices

Dominated by zeroes

Applications:PDEsGraphsLinear Programming

Efficient Representation(space and computation)

non-zero values (value data)structure (index data)

Sparse storage formatsCOO: BasicCSR: most common, base-lineBCSR: state of the art

Parallel Processing for Energy Efficiency (PP4EE) 8

CSR (Compressed Sparse Row)

..

..5.4 ..1.1 ..0 ..0 ..0 ..0

..0 ..6.3 ..0 ..7.7 ..0 ..8.8

..0 ..0 ..1.1 ..0 ..0 ..0

..0 ..0 ..2.9 ..0 ..3.7 ..2.9

..9.0 ..0 ..0 ..1.1 ..4.5 ..0

..1.1 ..0 ..2.9 ..3.7 ..0 ..1.1

.

.

.

..row_ptr: . . . ..0 ..2 ..5 ..6 ..9 ..12 ..16 .

..col_ind: ..0 ..1 ..1 ..3 ..5 ..2 ..2 ..4 ..5 ..0 ..3 ..4 ..0 ..2 ..3 ..5

..values: ..5.4 ..1.1 ..6.3 ..7.7 ..8.8 ..1.1 ..2.9 ..3.7 ..2.9 ..9.0 ..1.1 ..4.5 ..1.1 ..2.9 ..3.7 ..1.1

.

nnz

.

nrows+1

......

Parallel Processing for Energy Efficiency (PP4EE) 9

CSR (Compressed Sparse Row)

..

..5.4 ..1.1 ..0 ..0 ..0 ..0

..0 ..6.3 ..0 ..7.7 ..0 ..8.8

..0 ..0 ..1.1 ..0 ..0 ..0

..0 ..0 ..2.9 ..0 ..3.7 ..2.9

..9.0 ..0 ..0 ..1.1 ..4.5 ..0

..1.1 ..0 ..2.9 ..3.7 ..0 ..1.1

.

.

.

..row_ptr: . . . ..0 ..2 ..5 ..6 ..9 ..12 ..16 .

..col_ind: ..0 ..1 ..1 ..3 ..5 ..2 ..2 ..4 ..5 ..0 ..3 ..4 ..0 ..2 ..3 ..5

..values: ..5.4 ..1.1 ..6.3 ..7.7 ..8.8 ..1.1 ..2.9 ..3.7 ..2.9 ..9.0 ..1.1 ..4.5 ..1.1 ..2.9 ..3.7 ..1.1

.

nnz

.

nrows+1

......

Parallel Processing for Energy Efficiency (PP4EE) 9

SpMxV (sparse-matrix vector multiplication)

.

. y = A · x, A is sparse

Important computational kernelSolving PDEs (GMRES, CG) for CFD, economic modelingGraphs (PageRank)abundant amount of research work

....a11 ..a12 ..a13 ..a14..a21 ..a22 ..a23 ..a24..a31 ..a32 ..a33 ..a34..a41 ..a42 ..a43 ..a44

.

.

.

..x1

..x2

..x3

..x4

.

.

.

..y1 =∑

a1i ·xi..y2 =

∑a2i ·xi

..y3 =∑

a3i ·xi..y4 =

∑a4i ·xi

.

.

. ·. =

..

0

.

0

.

a21 ·x1 + a24 ·x4

Parallel Processing for Energy Efficiency (PP4EE) 10

SpMxV (sparse-matrix vector multiplication)

.

. y = A · x, A is sparse

Important computational kernelSolving PDEs (GMRES, CG) for CFD, economic modelingGraphs (PageRank)abundant amount of research work

....a11 ..a12 ..a13 ..a14..a21 ..a22 ..a23 ..a24..a31 ..a32 ..a33 ..a34..a41 ..a42 ..a43 ..a44

.

.

.

..x1

..x2

..x3

..x4

.

.

.

..y1 =∑

a1i ·xi..y2 =

∑a2i ·xi

..y3 =∑

a3i ·xi..y4 =

∑a4i ·xi

.

.

. ·. =.

.

0

.

0

.

a21 ·x1 + a24 ·x4

Parallel Processing for Energy Efficiency (PP4EE) 10

SpMxV (sparse-matrix vector multiplication)

.

. y = A · x, A is sparse

Important computational kernelSolving PDEs (GMRES, CG) for CFD, economic modelingGraphs (PageRank)abundant amount of research work

....a11 ..a12 ..a13 ..a14..a21 ..a22 ..a23 ..a24..a31 ..a32 ..a33 ..a34..a41 ..a42 ..a43 ..a44

.

.

.

..x1

..x2

..x3

..x4

.

.

.

..y1 =∑

a1i ·xi..y2 =

∑a2i ·xi

..y3 =∑

a3i ·xi..y4 =

∑a4i ·xi

.

.

. ·. =..

0

.

0

.

a21 ·x1 + a24 ·x4

Parallel Processing for Energy Efficiency (PP4EE) 10

CSR SpMxV

......for (i=0; i < N; i++)

for (j=row_ptr[i]; j < row_ptr[i+1]; j++)

y[i] += values[j] * x[col_ind[j]];

.

..row_ptr: . . . ..0 ..2 ..5 ..6 ..9 ..12 ..16 .

..col_ind: ..0 ..1 ..1 ..3 ..5 ..2 ..2 ..4 ..5 ..0 ..3 ..4 ..0 ..2 ..3 ..5

..x: . . . . ..x0 ..x1 ..x2 ..x3 ..x4 ..x5

..values: ..5.4 ..1.1 ..6.3 ..7.7 ..8.8 ..1.1 ..2.9 ..3.7 ..2.9 ..9.0 ..1.1 ..4.5 ..1.1 ..2.9 ..3.7 ..1.1

..y: . . . . ..y0 ..y1 ..y2 ..y3 ..y4 ..y5

.

row limits

.

i=3

.

(indirect access)

.

(∗)

.

(∑

)

Parallel Processing for Energy Efficiency (PP4EE) 11

CSR SpMxV

..........for (i=0; i < N; i++)

for (j=row_ptr[i]; j < row_ptr[i+1]; j++)

y[i] += values[j] * x[col_ind[j]];

.

..row_ptr: . . . ..0 ..2 ..5 ..6 ..9 ..12 ..16 .

..col_ind: ..0 ..1 ..1 ..3 ..5 ..2 ..2 ..4 ..5 ..0 ..3 ..4 ..0 ..2 ..3 ..5

..x: . . . . ..x0 ..x1 ..x2 ..x3 ..x4 ..x5

..values: ..5.4 ..1.1 ..6.3 ..7.7 ..8.8 ..1.1 ..2.9 ..3.7 ..2.9 ..9.0 ..1.1 ..4.5 ..1.1 ..2.9 ..3.7 ..1.1

..y: . . . . ..y0 ..y1 ..y2 ..y3 ..y4 ..y5

.

row limits

.

i=3

.

(indirect access)

.

(∗)

.

(∑

)

Parallel Processing for Energy Efficiency (PP4EE) 11

CSR SpMxV

..........for (i=0; i < N; i++)

for (j=row_ptr[i]; j < row_ptr[i+1]; j++)

y[i] += values[j] * x[col_ind[j]];

.

..row_ptr: . . . ..0 ..2 ..5 ..6 ..9 ..12 ..16 .

..col_ind: ..0 ..1 ..1 ..3 ..5 ..2 ..2 ..4 ..5 ..0 ..3 ..4 ..0 ..2 ..3 ..5

..x: . . . . ..x0 ..x1 ..x2 ..x3 ..x4 ..x5

..values: ..5.4 ..1.1 ..6.3 ..7.7 ..8.8 ..1.1 ..2.9 ..3.7 ..2.9 ..9.0 ..1.1 ..4.5 ..1.1 ..2.9 ..3.7 ..1.1

..y: . . . . ..y0 ..y1 ..y2 ..y3 ..y4 ..y5

.

row limits

.

i=3

.

(indirect access)

.

(∗)

.

(∑

)

Parallel Processing for Energy Efficiency (PP4EE) 11

Traditional SpMxV optimization methods

traditional goal: optimizing computation

specialized sparse storage formats(exploitation of “regularities”)

examples (regularity↔ format):2D blocks of constant size↔ BCSR [Im and Yelick ’01]1D blocks of variable size↔ [Pinar and Heath ’99]Diagonals↔ DIAG

Parallel Processing for Energy Efficiency (PP4EE) 12

Traditional SpMxV optimization: BCSR[Im and Yelick ’01]

CSR extension: r×c blocks instead of elements⇒ per-block index informationoptimize computation (register blocking)⇒ specialized SpMxV versions for r×c

padding may be required

..

A =

4.6 9.3 0 0 0 0 2.4 5.68.6 8.2 0 0 0 0 5.3 1.6

0 0 0 0 1.9 7.9 0 00 0 0 0 7.1 0 0 0

0 0 8.6 1.7 2.4 7.6 0 00 0 3.9 2.2 3.0 3.3 0 0

0 0 0 0 1.8 0 7.9 1.20 0 0 0 0 7.8 1.0 5.3

brow ptr : 0 2 3 5 7

bcol ind : (0 6 4 2 4 4 6 )

blocks :4.6 9.3

8.6 8.2

2.4 5.6

5.3 1.6

1.9 7.9

7.1 0

8.6 1.7

3.9 2.2

2.4 7.6

3.0 3.3

1.8 0

0 7.8

7.9 1.2

1.0 5.3

bval : ( 4.6 9.3 8.6 8.2 2.4 5.6 5.3 1.6 1.9 7.9 7.1 0.0 . . . )

....

...

...

Parallel Processing for Energy Efficiency (PP4EE) 13

Traditional SpMxV optimization: BCSR[Im and Yelick ’01]

CSR extension: r×c blocks instead of elements⇒ per-block index informationoptimize computation (register blocking)⇒ specialized SpMxV versions for r×cpadding may be required

..

A =

4.6 9.3 0 0 0 0 2.4 5.68.6 8.2 0 0 0 0 5.3 1.6

0 0 0 0 1.9 7.9 0 00 0 0 0 7.1 0 0 0

0 0 8.6 1.7 2.4 7.6 0 00 0 3.9 2.2 3.0 3.3 0 0

0 0 0 0 1.8 0 7.9 1.20 0 0 0 0 7.8 1.0 5.3

brow ptr : 0 2 3 5 7

bcol ind : (0 6 4 2 4 4 6 )

blocks :4.6 9.3

8.6 8.2

2.4 5.6

5.3 1.6

1.9 7.9

7.1 0

8.6 1.7

3.9 2.2

2.4 7.6

3.0 3.3

1.8 0

0 7.8

7.9 1.2

1.0 5.3

bval : ( 4.6 9.3 8.6 8.2 2.4 5.6 5.3 1.6 1.9 7.9 7.1 0.0 . . . )

.......

Parallel Processing for Energy Efficiency (PP4EE) 13

SpMxV performance(CSR)

related work→ several performance issues

performance evaluation in 100matrices [Goumas et. al. ’09]

memory bandwidth is the bottleneck(optimization attempts to improve computations are misguided)

1 2 4 80.8

1.2

1.6

2.0

2.4

2.8

3.2 allaverage

compression for improving SpMxV performance(reduce working set)

for matrices larger than cacheParallel Processing for Energy Efficiency (PP4EE) 14

SpMxV performance(CSR)

related work→ several performance issues

performance evaluation in 100matrices [Goumas et. al. ’09]

memory bandwidth is the bottleneck(optimization attempts to improve computations are misguided)

1 2 4 80.8

1.2

1.6

2.0

2.4

2.8

3.2 allaverage

compression for improving SpMxV performance(reduce working set)

for matrices larger than cacheParallel Processing for Energy Efficiency (PP4EE) 14

SpMxV performance(CSR)

related work→ several performance issues

performance evaluation in 100matrices [Goumas et. al. ’09]

memory bandwidth is the bottleneck(optimization attempts to improve computations are misguided)

1 2 4 80.8

1.2

1.6

2.0

2.4

2.8

3.2 allaverage

compression for improving SpMxV performance(reduce working set)

for matrices larger than cacheParallel Processing for Energy Efficiency (PP4EE) 14

CSR SpMxV working set(nnz ≫ N)

sidx : column index sizesval : value sizennz : non-zero valuesws : working set size

ws =index data︷ ︸︸ ︷nnz · sidx +

value data︷ ︸︸ ︷nnz · sval

....sidx = 32 bit

..sval = 64 bit

.

. =⇒. index data. value data

Parallel Processing for Energy Efficiency (PP4EE) 15

Outline

...1 Compression as an approach to scale up memory-bound applications

...2 Sparse Matrices and SpMxV

...3 CSX: A new storage format for sparse matricesIndex compression to optimize SpMxVCSX: The Compressed Sparse eXtended storage formatCSX substructuresCSX implementationExperimental Evaluation

...4 Conclusions – Areas of future research – Discussion

Parallel Processing for Energy Efficiency (PP4EE) 16

Index compression

Initial remarks:SpMxV is a memory-bound kernel

data compression can be a viable approach

index data seem a good target for compression (include a lot ofredundancy)

Specialized storage formatsindirectly may lead to index compression

typically exploit regularities: e.g., 2D blocks, diagonals, etc.

e.g., BSCR one column index per blockoriginal goal: register blockingbutmay lead to data increase due to padding

Storage formats that explicitly target index compressiondelta encoding (DCSR [Willcock and Lumsdaine ’06], CSR-DU)CSX (generalization of CSR-DU)

Parallel Processing for Energy Efficiency (PP4EE) 17

Index compression

Initial remarks:SpMxV is a memory-bound kernel

data compression can be a viable approach

index data seem a good target for compression (include a lot ofredundancy)

Specialized storage formatsindirectly may lead to index compression

typically exploit regularities: e.g., 2D blocks, diagonals, etc.

e.g., BSCR one column index per blockoriginal goal: register blockingbutmay lead to data increase due to padding

Storage formats that explicitly target index compressiondelta encoding (DCSR [Willcock and Lumsdaine ’06], CSR-DU)CSX (generalization of CSR-DU)

Parallel Processing for Energy Efficiency (PP4EE) 17

Index compression

Initial remarks:SpMxV is a memory-bound kernel

data compression can be a viable approach

index data seem a good target for compression (include a lot ofredundancy)

Specialized storage formatsindirectly may lead to index compression

typically exploit regularities: e.g., 2D blocks, diagonals, etc.

e.g., BSCR one column index per blockoriginal goal: register blockingbutmay lead to data increase due to padding

Storage formats that explicitly target index compressiondelta encoding (DCSR [Willcock and Lumsdaine ’06], CSR-DU)CSX (generalization of CSR-DU)

Parallel Processing for Energy Efficiency (PP4EE) 17

First step towards index compression: Delta Encodingapplied in each matrix row

index data→ column indices

Delta encoding for column indices ([Willcock and Lumsdaine ’06])

store delta distance from previous index, not absolute value

instead of cii, store:δi = cii − cii−1 ⇒ δi ≤ cii ⇒ (potentially) less space per index

.......

Parallel Processing for Energy Efficiency (PP4EE) 18

First step towards index compression: Delta Encodingapplied in each matrix row

index data→ column indices

Delta encoding for column indices ([Willcock and Lumsdaine ’06])

store delta distance from previous index, not absolute value

instead of cii, store:δi = cii − cii−1 ⇒ δi ≤ cii ⇒ (potentially) less space per index

.......

Parallel Processing for Energy Efficiency (PP4EE) 18

CSX motivationcurrent approaches are not aggressive enough – there are much more regularities to exploit

regularities and sparse storage formatsBCSR, VBL [Pinar and Heath ’99], DIAG

.

. . . .

. . . . .

. . .

. . .

. . ..

.. ..1 ..2 ..3 ..4 ..5 ..6 ..7 ..8 ..9 ..10

..1 .. .. .. .. .. .. .. .. .. ..

..2 .. .. .. .. .. .. .. .. .. ..

..3 .. .. .. .. .. .. .. .. .. ..

..4 .. .. .. .. .. .. .. .. .. ..

..5 .. .. .. .. .. .. .. .. .. ..

..6 .. .. .. .. .. .. .. .. .. ..

..7 .. .. .. .. .. .. .. .. .. ..

..8 .. .. .. .. .. .. .. .. .. ..

..9 .. .. .. .. .. .. .. .. .. ..

..10 .. .. .. .. .. .. .. .. .. ..

.

. ..

.

.. .

. . .. .. . .

.

. .

.

. ..

.. ..1 ..2 ..3 ..4 ..5 ..6 ..7 ..8 ..9 ..10

..1 .. .. .. .. .. .. .. .. .. ..

..2 .. .. .. .. .. .. .. .. .. ..

..3 .. .. .. .. .. .. .. .. .. ..

..4 .. .. .. .. .. .. .. .. .. ..

..5 .. .. .. .. .. .. .. .. .. ..

..6 .. .. .. .. .. .. .. .. .. ..

..7 .. .. .. .. .. .. .. .. .. ..

..8 .. .. .. .. .. .. .. .. .. ..

..9 .. .. .. .. .. .. .. .. .. ..

..10 .. .. .. .. .. .. .. .. .. ..

multiple regularities↔ composite formats [Agarwal et. al ’92]multiple sub-matrices — each in different formatA·x = (A0 + A1)·x = A0 ·x+ A1 ·x

Parallel Processing for Energy Efficiency (PP4EE) 19

CSX: Compressed Sparse eXtendedIn a nutshell

Objectivestarget memory-bandwidth limitations of SpMxV (implied largematrices)adapt to matrix structure and architecturegenerate efficient code

Approachapply aggressive index compressionexploit a wide set of matrix “regularities”employ code generation to produce efficient code tailored per matrix3 preprocessing phases:

detection of regularitiesmatrix encodingcode generation

drastically reduce preprocessing times

Parallel Processing for Energy Efficiency (PP4EE) 20

CSX substructures(regularities supported by CSX)

Horizontal

(delta run-length-encoding—drle)

..sequential elements .

(y, x+ i) → (y, x) (y, x+ 1) (y, x+ 2) . . .

.

..x ..x ..x ..x ..x

.

(e.g: col. indices: 1,2,3,4,5)

Other 1D directions (Vertical, Diagonal, Anti-Diagonal)

..

..x

..x

..x

..x

..x

...x. ..x. . ..x. . . ..x. . . . ..x

.. . . . ..x. . . ..x. . ..x. ..x..x

.

(y+ i·δ, x)

.

(y+ i·δ, x+ i·δ)

.

(y− i·δ, x+ i·δ)

2D blocks.. ..x ..x

..x ..x.

(x+ i)×(y+ j) (double nested loop)

Parallel Processing for Energy Efficiency (PP4EE) 21

CSX substructures(regularities supported by CSX)

Horizontal (delta run-length-encoding—drle)

..sequential elements with a constant difference δ.

(y, x+ i · δ) → (y, x) (y, x+ δ) (y, x+ 2 · δ) . . ..

..x ..x ..x ..x ..x

.

(e.g: col. indices: 2,4,6,8,10)

Other 1D directions (Vertical, Diagonal, Anti-Diagonal)

..

..x

..x

..x

..x

..x

...x. ..x. . ..x. . . ..x. . . . ..x

.. . . . ..x. . . ..x. . ..x. ..x..x

.

(y+ i·δ, x)

.

(y+ i·δ, x+ i·δ)

.

(y− i·δ, x+ i·δ)

2D blocks.. ..x ..x

..x ..x.

(x+ i)×(y+ j) (double nested loop)

Parallel Processing for Energy Efficiency (PP4EE) 21

CSX substructures(regularities supported by CSX)

Horizontal (delta run-length-encoding—drle)

..sequential elements with a constant difference δ.

(y, x+ i · δ) → (y, x) (y, x+ δ) (y, x+ 2 · δ) . . ..

..x ..x ..x ..x ..x

Other 1D directions (Vertical, Diagonal, Anti-Diagonal)

..

..x

..x

..x

..x

..x

...x. ..x. . ..x. . . ..x. . . . ..x

.. . . . ..x. . . ..x. . ..x. ..x..x

.

(y+ i·δ, x)

.

(y+ i·δ, x+ i·δ)

.

(y− i·δ, x+ i·δ)

2D blocks.. ..x ..x

..x ..x.

(x+ i)×(y+ j) (double nested loop)

Parallel Processing for Energy Efficiency (PP4EE) 21

CSX substructures(regularities supported by CSX)

Horizontal (delta run-length-encoding—drle)

..sequential elements with a constant difference δ.

(y, x+ i · δ) → (y, x) (y, x+ δ) (y, x+ 2 · δ) . . ..

..x ..x ..x ..x ..x

Other 1D directions (Vertical, Diagonal, Anti-Diagonal)

..

..x

..x

..x

..x

..x

...x. ..x. . ..x. . . ..x. . . . ..x

.. . . . ..x. . . ..x. . ..x. ..x..x

.

(y+ i·δ, x)

.

(y+ i·δ, x+ i·δ)

.

(y− i·δ, x+ i·δ)

2D blocks.. ..x ..x

..x ..x.

(x+ i)×(y+ j) (double nested loop)Parallel Processing for Energy Efficiency (PP4EE) 21

CSX substructures on matrices

xeno

n2A

SIC

_680

kto

rso3

Che

bysh

ev4

Ham

rle3

pre2

cage

13at

mos

mod

joh

ne2

kkt_

pow

erT

SO

PF

_RS

_b23

83G

a41A

s41H

72F

rees

cale

1ra

jat3

1F

1pa

rabo

lic_f

emof

fsho

reco

nsph

bmw

7st_

1G

3_ci

rcui

tth

erm

al2

m_t

1bm

wcr

a_1

hood

cran

kseg

_2nd

12k

af_5

_k10

1in

line_

1ld

oor

bone

S10

0

20

40

60

80

100

Non

-zer

o el

emen

ts c

over

age

(%)

d8d16d32h(1)h(2)v(1)v(2)d(1)d(3)d(11)d(857)d(1714)rd(1)br(2,2)br(2,3)br(2,4)br(2,6)

br(2,9)br(2,12)br(2,18)br(3,3)br(3,6)br(3,9)br(3,12)br(3,15)br(3,18)br(4,4)br(5,5)br(5,15)br(7,7)br(7,14)br(7,21)bc(3,2)

Parallel Processing for Energy Efficiency (PP4EE) 22

CSX Encoding

..1 .1 . 6.

8.variable

int

.variable

int

.xed

{8,16,32}.…

.xed

{8,16,32}.CTL

.nr .rjmp .id .size .ujmp .ucol .deltas

. .

.Head .Body

.

.

.

. .

.

.d(2)

.h(1) .ad(1)

.v(1)

.bc(4,2)

.bc(4,2).bc(3,2).

.. ..1 ..2 ..3 ..4 ..5 ..6 ..7 ..8 ..9 ..10

..1 .. .. .. .. .. .. .. .. .. ..

..2 .. .. .. .. .. .. .. .. .. ..

..3 .. .. .. .. .. .. .. .. .. ..

..4 .. .. .. .. .. .. .. .. .. ..

..5 .. .. .. .. .. .. .. .. .. ..

..6 .. .. .. .. .. .. .. .. .. ..

..7 .. .. .. .. .. .. .. .. .. ..

..8 .. .. .. .. .. .. .. .. .. ..

..9 .. .. .. .. .. .. .. .. .. ..

..10 .. .. .. .. .. .. .. .. .. ..

1 0 0 4 0h(1)

0 0 1 4 5ad(1)

1 1 2 8 1 0bc(4,2)

0 0 3 4 9v(1)

1 0 4 4 3d(2)

1 1 2 8 2 2bc(4,2)

1 0 5 6 2bc(3,2)

nr rjmp id size [ujmp]:ucol

Parallel Processing for Energy Efficiency (PP4EE) 23

CSX Code generation

..Encodedmatrix

.C code

generator

.Clang

front-end

. LLVMmodule

.Native SpMV code

.Functionpointer. . .

.. SpMVsource templates

.SpMV

.source.Emit.LLVM

.Execution.Engine

.call

top-level SpMxV templatebig case statement based on substructure

code for each substructure in the matrix

Parallel Processing for Energy Efficiency (PP4EE) 24

CSX preprocessing cost

Ü what about preprocessing (compression) cost?

depends on the application

frequently, the matrix is used across numerous SpMxV runsû sufficient repetitions→ overhead will be amortizedmethods to reduce preprocessing cost:

reduce the number of substructures scannedsample the matrix for substructuresparallelize preprocessing

Parallel Processing for Energy Efficiency (PP4EE) 25

Experimental preliminaries

matrix suite:30 matricesUniversity of Florida sparse matrix collection [Davis and Hu, 2011]real world applications, large variety of applicationsincluding problems without an underlying 2D/3D geometrydo not t into aggregate cache

compare against:CSRBCSR (we always select the best performing block)VBL (1D variable length blocks)

double (64-bit) oating point values

Parallel Processing for Energy Efficiency (PP4EE) 26

CSX Compression ratio

xeno

n2A

SIC

_680

kto

rso3

Che

bysh

ev4

Ham

rle3

pre2

cage

13at

mos

mod

joh

ne2

kkt_

pow

erT

SO

PF

_RS

_b23

83G

a41A

s41H

72F

rees

cale

1ra

jat3

1F

1pa

rabo

lic_f

emof

fsho

reco

nsph

bmw

7st_

1G

3_ci

rcui

tth

erm

al2

m_t

1bm

wcr

a_1

hood

cran

kseg

_2nd

12k

af_5

_k10

1in

line_

1ld

oor

bone

S10

-80

-60

-40

-20

0

20

40

Com

pres

sion

rat

io (

%)

BCSRVBLCSXMaximum

maximum: only consider valuesCSX always the best optionCSX never has negative compression

Parallel Processing for Energy Efficiency (PP4EE) 27

CSX: Evaluation on SMP

Harpertown

C

L1

C

L1

C

L1

C

L1

L2 L2

CPU CPU

2× 4 = 8 cores

L2: 6MiB, per 2 cores

Dunnington

C

L1

C

L1

C

L1

C

L1

C

L1

C

L1

L2 L2 L2

CPU

L3

CPU

L3

CPU

L3

CPU

L3

4× 6 = 24 cores

L2: 3MiB, per 2 cores

L3: 16MiB

Parallel Processing for Energy Efficiency (PP4EE) 28

CSX: SMP: Average speedup over serial CSR(share-all core lling policy)

Harpertown

1 2 4 8

Threads

1

2

3

Spe

edup

ove

r se

rial C

SR

CSRBCSRVBLCSX

improvement over MT CSRfor 8 threads:

CSX: 26.4%VBL: 18.5%BCSR: 4.1%

Dunnington

1 2 6 12 24

Threads

1

2

4

6

810

Spe

edup

ove

r se

rial C

SR

CSRBCSRVBLCSX

improvement over MT CSRfor 24 threads:

CSX: 61%VBL: 28.8%BCSR: 6.3%

Parallel Processing for Energy Efficiency (PP4EE) 29

CSX: Preprocessing cost

Dunnington

CSX-delta

CSX-horiz

CSX-samplingCSX-full

8 32 128 512

Serial CSR SpMV operations

1

1.2

1.4

1.6

Per

f. im

pr. o

ver

M/T

CS

R

Gainestown (NUMA)

CSX-delta

CSX-horiz

CSX-sampling CSX-full

16 64 256 1024

Serial CSR SpMV operations

1

1.1

1.2

Per

f. im

pr. o

ver

M/T

CS

R

Parallel Processing for Energy Efficiency (PP4EE) 30

Energy and power measurements

Total energy (idle cores included)matrix af_5_k101

0

200

400

600

800

1000

1200

1400

csr csx csr csx csr csx csr csx csr csx csr csx csr csx 1

2

3

4

5

6

7

8

9

10

Ener

gy (J

)

Spee

dup(

w.r.t

. CSR

1 th

read

)

#threads

Total energy breakdown

uncore

core

dram

csr speedup

csx speedup

3224168421

Powerdensematrix

1 2 4 8 16 24 32#Threads

5

10

15

20

25

30

35

40

45

Aver

age

Pow

er(W

)

coreuncoredram

Parallel Processing for Energy Efficiency (PP4EE) 31

More info on CSX (papers)

CSX details:K. Kourtis, V. Karakasis, G. Goumas, and N. Koziris. “CSX: an extended compression format for spmv on sharedmemory systems,”16th ACM symposium on Principles and practice of parallel programming (PPoPP ’11). ACM, New York, NY, USA, 247-256.

V. Karakasis, T. Gkountouvas, K. Kourtis, G. Goumas, N. Koziris, “An Extended Compression Format for the Optimization of SparseMatrix-Vector Multiplication,” IEEE Transactions on Parallel and Distributed Systems, vol. 24, no. 10, pp. 1930-1940, Oct., 2013.

CSX for symmetric matrices:T. Gkountouvas, V. Karakasis, K. Kourtis, G. Goumas, and N. Koziris. “Improving the performance of the symmetric sparsematrix-vectormultiplication inmulticore”. In 27th IEEE International Parallel & Distributed Processing Symposium (IPDPS’13), Boston, MA,USA, 2013.

CSX integrated with Elmer multiphysics simulation softwareV. Karakasis, G. Goumas, K. Nikas, N. Koziris, J. Ruokolainen, and P. Råback. “Using State-of-the-Art Sparse Matrix Optimizations forAccelerating the Performance of Multiphysics Simulations”. In PARA 2012: Workshop on State-of-the-Art in Scienti c and ParallelComputing, Helsinki, Finland, 2012. Springer.

Value compression for SpMxVK. Kourtis, G. Goumas and N. Koziris, “Exploiting Compression Opportunities to Improve SpMxV Performance on SharedMemorySystems,” ACM Transactions on Architecture and Code Optimization (TACO), Vol 7, No 3, December 2011.

The energy pro le of CSX and CSRJ. C. Meyer, V. Karakasis, J. Cebrián, L. Natvig, D. Siakavaras, and K. Nikas. “Energy-efficient sparse matrix autotuning with CSX – Atrade-off study”. In NinthWorkshop on High-Performance, Power-Aware Computing (HPPAC’13), IPDPS’13, Boston, MA, USA, 2013.

Parallel Processing for Energy Efficiency (PP4EE) 32

More info on CSX

download code:http://www.cslab.ece.ntua.gr/csx/

https://github.com/cslab-ntua/csx

current status:working on the release of an API and librarysupport tools (disk representation, le format converters)

Parallel Processing for Energy Efficiency (PP4EE) 33

Outline

...1 Compression as an approach to scale up memory-bound applications

...2 Sparse Matrices and SpMxV

...3 CSX: A new storage format for sparse matrices

...4 Conclusions – Areas of future research – Discussion

Parallel Processing for Energy Efficiency (PP4EE) 34

Conclusions

Compression can improve SpMxV performanceCSX applies aggressive index data compression to optimize SpMxV

supports arbitrary regularitiestunable preprocessing cost

yet, preprocessing can be a concern

outperforms baseline and state-of-the-art alternatives

Parallel Processing for Energy Efficiency (PP4EE) 35

Areas of current and future researchand (hopefully) opportunities for collaboration

relevant to CSX, SpMxV and compressionapply compression to the oating-point values of the matrix (recall:these consume 2/3 of the data!)

generalize to other applications

investigate opportunities for hardware support (scalability, space,energy)

contention-aware schedulingtime and space scheduling of resource-hungry applications (forhomogeneous and heterogeneous CMPs)

performance prediction

energy-aware computingpower and energy-aware algorithms and techniques

predict execution behavior based on power consumption snapshots

Parallel Processing for Energy Efficiency (PP4EE) 36

Areas of current and future researchand (hopefully) opportunities for collaboration

relevant to CSX, SpMxV and compressionapply compression to the oating-point values of the matrix (recall:these consume 2/3 of the data!)

generalize to other applications

investigate opportunities for hardware support (scalability, space,energy)

contention-aware schedulingtime and space scheduling of resource-hungry applications (forhomogeneous and heterogeneous CMPs)

performance prediction

energy-aware computingpower and energy-aware algorithms and techniques

predict execution behavior based on power consumption snapshots

Parallel Processing for Energy Efficiency (PP4EE) 36

Areas of current and future researchand (hopefully) opportunities for collaboration

relevant to CSX, SpMxV and compressionapply compression to the oating-point values of the matrix (recall:these consume 2/3 of the data!)

generalize to other applications

investigate opportunities for hardware support (scalability, space,energy)

contention-aware schedulingtime and space scheduling of resource-hungry applications (forhomogeneous and heterogeneous CMPs)

performance prediction

energy-aware computingpower and energy-aware algorithms and techniques

predict execution behavior based on power consumption snapshots

Parallel Processing for Energy Efficiency (PP4EE) 36

EOF

Thank you!Questions?

Parallel Processing for Energy Efficiency (PP4EE) 37