+ All Categories
Home > Documents > The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10...

The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10...

Date post: 07-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
19
The Spatter Benchmark Or: Benchmarking and Modeling Sparse Memory Accesses for Heterogeneous Systems Patrick Lavin, Jerey Young, Jason Riedy, Rich Vuduc
Transcript
Page 1: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)

The Spatter Benchmark Or: Benchmarking and Modeling Sparse Memory Accesses

for Heterogeneous Systems

Patrick Lavin, Jeffrey Young, Jason Riedy, Rich Vuduc

Page 2: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)

Purpose• Dense memory access is well understood, but it is

difficult to predict how a memory system will respond in irregular scenarios

• Indirection, poor spatial and temporal locality

• Spatter allows us to view performance changes across architectures, so that we can better understand their differences

• You can’t understand what you can’t measure

!2

Page 3: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)

Spatter• Memory benchmark

based on a scatter/gatherkernels • Scatter Y[j[:]] = X[:]

• Gather Y[:] = X[i[:]]

• SG Y[j[:]] = X[i[:]]

• Designed to model sparse data movement in applications like SuperLU and kernels like SpGEMM

• Includes effects of indirection and random or sparse access

!3

Gather

SRC

idx152 8

DEST

Page 4: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)

Configuration

• Backend - OpenMP, OpenCL, or CUDA

• Work per thread (work item)

• CUDA (or OpenCL) block size

• Buffer sizes, cache-ability, access stride

!4

Page 5: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)

Examples

!5

Page 6: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)

Examples - CSR SpMV

• Gather elements of x, then doa dot product with data in A.

A xy

=

=

for (i in range(nrows)): indices ←⃪ row[i] : row[i+1] gather(tmp, x, col[indices]) y[i] = dot_prod(val[indices], tmp)

!6

Page 7: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)

Examples - CSC SpMV

• Scale some a column of A by the value in x, then scatter-accumulate into y.

for (i in range(ncols)): indices ←⃪ col[i] : col[i+1] tmp ←⃪ vector_scale(val[indices], x[i]) scatter_accum(y, row[indices], tmp)

A xy

=

=

!7

Page 8: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)

Examples - SpGEMM• Scatter-accumulate

columns of A corresponding to non-zero entries in a column of B into a dense SPA buffer. Gather SPA into C.

A BC

=

=

Algorithm from Buluç and Gilbert: Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments https://doi.org/10.1137/110848244

for (j in range(ncols) : SPA = 0 //dense accumulation buffer for non-zero B(k,j) : scatter_accum(SPA, A(:,k)*B(k,j)) gather(C.val, SPA) gather(C.row, which(SPA)) C.col[j+1] = C.col[j] + nnz(SPA)

SPA:

!8

Page 9: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)

Examples - Vectorization• Some forms of vectorization may naturally lead to Gather/

Scatter operations

for (j in range(N)): for (i in range(4)): out[j] += data[i,j]

Column-Major

for (j = 0; j < N; j+=8): for (i in range(4)): gather_accum_stride(temp,j+i, 8, 4) //gather 8 elements, //gap size 4 out[j:j+8] = temp

!9

Page 10: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)

Example - SuperLU

• SuperLU spends a large portion its runtime on just scattering data

0

20

40

60

80

100

ND24K

BBMATH2O

Norm

alize

d ex

ecut

ion

time

SCATTER DGEMM REST

������� ������ ������� ��������

Chart credits: Piyush Sao !10

Page 11: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)

Benchmark Output

!11

Page 12: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)

Performance Exploration

0

10

20

30

40

50

60

70

1 2 4 8 16 32 64Work per threadEf

f. Ba

ndw

idth

(% o

f Bab

elSt

ream

)

Tesla P100, CUDA BackendGather Bandwidth

0

10

20

30

40

50

60

70

1 2 4 8 16 32 64

BDW, OpenMP Backend

0

10

20

30

40

50

60

70

1 2 4 8 16 32 64

Power8, OpenMP Backend

0

10

20

30

40

50

60

70

1 2 4 8 16 32 64

Tesla P100, CUDA BackendScatter Bandwidth

0

10

20

30

40

50

60

70

1 2 4 8 16 32 64

BDW, OpenMP Backend

0

10

20

30

40

50

60

70

1 2 4 8 16 32 64

Power8, OpenMP Backend

Sparsity1

2

4

8

16

32

64

128

!12

Uniform Stride

Page 13: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)

Uniform Stride Access

0

10

20

30

40

50

60

70

80

1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparsity

Effe

ctive

Ban

dwid

th (%

of B

abel

Stre

am)

Device−BackendBDW−OMP

GV100−CUDA

K40c−CUDA

KNL−OMP

P100−CUDA

Power8−OMP

SNB−OMP

ThunderX2−OMP

Titan Xp−CUDA

GATHER kernel, Linear AccessImpact of Access Sparsity

0

10

20

30

40

50

60

70

80

1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparsity

Effe

ctive

Ban

dwid

th (%

of B

abel

Stre

am)

Device−BackendBDW−OMP

GV100−CUDA

K40c−CUDA

KNL−OMP

P100−CUDA

Power8−OMP

SNB−OMP

ThunderX2−OMP

Titan Xp−CUDA

SCATTER kernel, Linear AccessImpact of Access SparsityGather Scatter

CPU

GPU

!13

Page 14: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)

Random AccessGather, Uniform

0

10

20

30

1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparsity

Effe

ctive

Ban

dwid

th (%

of B

abel

Stre

am)

Device−BackendK40c−CUDA

P100−CUDA

Titan Xp−CUDA

GATHER kernel, Random AccessImpact of Access Sparsity

0

10

20

30

1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparsity

Effe

ctive

Ban

dwid

th (%

of B

abel

Stre

am)

Device−BackendBDW−OMP

KNL−OMP

Power8−OMP

SNB−OMP

ThunderX2−OMP

GATHER kernel, Random AccessImpact of Access Sparsity

0

10

20

30

1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparsity

Effe

ctive

Ban

dwid

th (%

of B

abel

Stre

am)

Device−BackendK40c−CUDA

P100−CUDA

Titan Xp−CUDA

GATHER kernel, Random AccessImpact of Access Sparsity

0

10

20

30

1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparsity

Effe

ctive

Ban

dwid

th (%

of B

abel

Stre

am)

Device−BackendBDW−OMP

KNL−OMP

Power8−OMP

SNB−OMP

ThunderX2−OMP

GATHER kernel, Random AccessImpact of Access Sparsity

!14

Page 15: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)

Energy Efficiency

0

100

200

300

400

500

600

700

800

900

1000

1100

1200

1300

1400

1500

1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparisty

Effic

ienc

y (B

ytes

/Jou

le)

DeviceBDW−OMP

GV100−CUDA

K40c−CUDA

KNL−OMP

P100−CUDA

Power8−OMP

SNB−OMP

ThunderX2−OMP

Titan Xp−CUDA

GATHER kernel, Linear AccessImpact of Access Sparsity

0

100

200

300

400

500

600

700

800

900

1000

1100

1200

1300

1400

1500

1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparisty

Effic

ienc

y (B

ytes

/Jou

le)

DeviceBDW−OMP

GV100−CUDA

K40c−CUDA

KNL−OMP

P100−CUDA

Power8−OMP

SNB−OMP

ThunderX2−OMP

Titan Xp−CUDA

SCATTER kernel, Linear AccessImpact of Access Sparsity

Uniform Stride

Gather Scatter

CPU

GPU

!15

Page 16: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)

What’s Next?• Partner with industry to run on upcoming systems

• Evaluation of slightly more complex synthetic traces

• Mostly stride-1

• Write collisions

• Gather/Scatter traces from real (DOE) mini-apps

• Measure impact of vector length (SVE and AVX) on generated code (and therefore cache performance)

• CILK backend for EMU, FPGA-specific OpenCL Backend

• More general kernels, with accumulation and a length buffer

• Simplify to present a STREAM-like result

!16

Page 17: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)

More Info

!17

• Spatter.io

• Documentation

• Guide to easily plot your GPU against ours

• ArXiv Pre-print

• Spatter: A Benchmark Suite for Evaluating Sparse Access Patterns

• https://arxiv.org/abs/1811.03743

• Code

• https://github.com/hpcgarage/spatter

Page 18: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)

Acknowledgements

This material is based upon work supported by the National Science Foundation under Award #1710371.

Page 19: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)

The Spatter Benchmark (spatter.io)

Or: Benchmarking and Modeling Sparse Memory Accesses for Heterogeneous Systems

Patrick Lavin, Jeffery Young, Jason Riedy, Rich Vuduc


Recommended