+ All Categories
Home > Documents > Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication...

Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication...

Date post: 30-Apr-2018
Category:
Upload: lyngoc
View: 237 times
Download: 1 times
Share this document with a friend
45
BIPS BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07) Samuel Williams 1,2 , Richard Vuduc 3 , Leonid Oliker 1,2 , John Shalf 2 , Katherine Yelick 1,2 , James Demmel 1,2 1 University of California Berkeley 2 Lawrence Berkeley National Laboratory 3 Georgia Institute of Technology [email protected]
Transcript
Page 1: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

Tuning Sparse Matrix Vector Multiplication for multi-core SMPs

(details in paper at SC07)

Samuel Williams1,2, Richard Vuduc3, Leonid Oliker1,2,John Shalf2, Katherine Yelick1,2, James Demmel1,2

1University of California Berkeley2Lawrence Berkeley National Laboratory3Georgia Institute of Technology

[email protected]

Page 2: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS Overview

Multicore is the de facto performance solution for the next decade

Examined Sparse Matrix Vector Multiplication (SpMV) kernelImportant HPC kernelMemory intensiveChallenging for multicore

Present two auto-tuned threaded implementations: Pthread, cache-based implementationCell local store-based implementation

Benchmarked performance across 4 diverse multicore architecturesIntel Xeon (Clovertown)AMD OpteronSun Niagara2IBM Cell Broadband Engine

Compare with leading MPI implementation(PETSc) with an auto-tuned serial kernel (OSKI)

Show Cell delivers good performance and efficiency, while Niagara2 delivers good performance and productivity

Page 3: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS Sparse Matrix Vector Multiplication

Sparse MatrixMost entries are 0.0Performance advantage in onlystoring/operating on the nonzerosRequires significant meta data

Evaluate y=AxA is a sparse matrixx & y are dense vectors

ChallengesDifficult to exploit ILP(bad for superscalar),Difficult to exploit DLP(bad for SIMD)Irregular memory access to source vectorDifficult to load balanceVery low computational intensity (often >6 bytes/flop)

A x y

Page 4: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

Dataset (Matrices)Multicore SMPs

Test Suite

Page 5: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS Matrices Used

Pruned original SPARSITY suite down to 14none should fit in cacheSubdivided them into 4 categoriesRank ranges from 2K to 1M

Dense

Protein FEM /Spheres

FEM /Cantilever

WindTunnel

FEM /Harbor QCD FEM /

Ship Economics Epidemiology

FEM /Accelerator Circuit webbase

LP

2K x 2K Dense matrixstored in sparse format

Well Structured(sorted by nonzeros/row)

Poorly Structuredhodgepodge

Extreme Aspect Ratio(linear programming)

Page 6: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS Multicore SMP Systems

4MBShared L2

Core2

FSB

Fully Buffered DRAM

10.6GB/s

Core2

Chipset (4x64b controllers)

10.6GB/s

10.6 GB/s(write)

4MBShared L2

Core2 Core2

4MBShared L2

Core2

FSB

Core2

4MBShared L2

Core2 Core2

21.3 GB/s(read)

Intel ClovertownC

ross

bar S

witc

h

Fully Buffered DRAM

4MB

Sha

red

L2 (1

6 w

ay)

42.7GB/s (read), 21.3 GB/s (write)

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

179

GB

/s(fi

ll)90

GB

/s(w

ritet

hru)

Sun Niagara2

4x128b FBDIMM memory controllers

AMD Opteron

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

DDR2 DRAM DDR2 DRAM

10.6GB/s 10.6GB/s

4GB/s(each direction)

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

<<20GB/seach

direction

SPE256K

PPE512K L2

MFC

BIF

XDR

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

SPE 256K

PPE 512K L2

MFC

BIF

XDR

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

IBM Cell Blade

Page 7: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPSMulticore SMP Systems

(memory hierarchy)

4MBShared L2

Core2

FSB

Fully Buffered DRAM

10.6GB/s

Core2

Chipset (4x64b controllers)

10.6GB/s

10.6 GB/s(write)

4MBShared L2

Core2 Core2

4MBShared L2

Core2

FSB

Core2

4MBShared L2

Core2 Core2

21.3 GB/s(read)

Intel ClovertownC

ross

bar S

witc

h

Fully Buffered DRAM

4MB

Sha

red

L2 (1

6 w

ay)

42.7GB/s (read), 21.3 GB/s (write)

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

179

GB

/s(fi

ll)90

GB

/s(w

ritet

hru)

Sun Niagara2

4x128b FBDIMM memory controllers

AMD Opteron

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

DDR2 DRAM DDR2 DRAM

10.6GB/s 10.6GB/s

4GB/s(each direction)

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

<<20GB/seach

direction

SPE256K

PPE512K L2

MFC

BIF

XDR

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

SPE 256K

PPE 512K L2

MFC

BIF

XDR

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

IBM Cell Blade

Conventional Cache-based

Memory Hierarchy

Disjoint Local Store

Memory Hierarchy

Page 8: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPSMulticore SMP Systems

(2 implementations)

4MBShared L2

Core2

FSB

Fully Buffered DRAM

10.6GB/s

Core2

Chipset (4x64b controllers)

10.6GB/s

10.6 GB/s(write)

4MBShared L2

Core2 Core2

4MBShared L2

Core2

FSB

Core2

4MBShared L2

Core2 Core2

21.3 GB/s(read)

Intel ClovertownC

ross

bar S

witc

h

Fully Buffered DRAM

4MB

Sha

red

L2 (1

6 w

ay)

42.7GB/s (read), 21.3 GB/s (write)

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

179

GB

/s(fi

ll)90

GB

/s(w

ritet

hru)

Sun Niagara2

4x128b FBDIMM memory controllers

AMD Opteron

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

DDR2 DRAM DDR2 DRAM

10.6GB/s 10.6GB/s

4GB/s(each direction)

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

<<20GB/seach

direction

SPE256K

PPE512K L2

MFC

BIF

XDR

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

SPE 256K

PPE 512K L2

MFC

BIF

XDR

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

IBM Cell Blade

Cache+Pthreads SpMV

implementation

Local Store SpMV

Implementation

Page 9: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPSMulticore SMP Systems

(cache)

4MBShared L2

Core2

FSB

Fully Buffered DRAM

10.6GB/s

Core2

Chipset (4x64b controllers)

10.6GB/s

10.6 GB/s(write)

4MBShared L2

Core2 Core2

4MBShared L2

Core2

FSB

Core2

4MBShared L2

Core2 Core2

21.3 GB/s(read)

Intel ClovertownC

ross

bar S

witc

h

Fully Buffered DRAM

4MB

Sha

red

L2 (1

6 w

ay)

42.7GB/s (read), 21.3 GB/s (write)

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

179

GB

/s(fi

ll)90

GB

/s(w

ritet

hru)

Sun Niagara2

4x128b FBDIMM memory controllers

AMD Opteron

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

DDR2 DRAM DDR2 DRAM

10.6GB/s 10.6GB/s

4GB/s(each direction)

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

<<20GB/seach

direction

SPE256K

PPE512K L2

MFC

BIF

XDR

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

SPE 256K

PPE 512K L2

MFC

BIF

XDR

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

IBM Cell Blade

16MB(vectors fit)

4MB

4MB(local store)

4MB

Page 10: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPSMulticore SMP Systems

(peak flops)

4MBShared L2

Core2

FSB

Fully Buffered DRAM

10.6GB/s

Core2

Chipset (4x64b controllers)

10.6GB/s

10.6 GB/s(write)

4MBShared L2

Core2 Core2

4MBShared L2

Core2

FSB

Core2

4MBShared L2

Core2 Core2

21.3 GB/s(read)

Intel ClovertownC

ross

bar S

witc

h

Fully Buffered DRAM

4MB

Sha

red

L2 (1

6 w

ay)

42.7GB/s (read), 21.3 GB/s (write)

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

179

GB

/s(fi

ll)90

GB

/s(w

ritet

hru)

Sun Niagara2

4x128b FBDIMM memory controllers

AMD Opteron

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

DDR2 DRAM DDR2 DRAM

10.6GB/s 10.6GB/s

4GB/s(each direction)

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

<<20GB/seach

direction

SPE256K

PPE512K L2

MFC

BIF

XDR

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

SPE 256K

PPE 512K L2

MFC

BIF

XDR

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

IBM Cell Blade

75 Gflop/s(TLP + ILP + SIMD)

17 Gflop/s(TLP + ILP)

29 Gflop/s(TLP + ILP + SIMD)

11 Gflop/s(TLP)

Page 11: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPSMulticore SMP Systems

(peak read bandwidth)

4MBShared L2

Core2

FSB

Fully Buffered DRAM

10.6GB/s

Core2

Chipset (4x64b controllers)

10.6GB/s

10.6 GB/s(write)

4MBShared L2

Core2 Core2

4MBShared L2

Core2

FSB

Core2

4MBShared L2

Core2 Core2

21.3 GB/s(read)

Intel ClovertownC

ross

bar S

witc

h

Fully Buffered DRAM

4MB

Sha

red

L2 (1

6 w

ay)

42.7GB/s (read), 21.3 GB/s (write)

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

179

GB

/s(fi

ll)90

GB

/s(w

ritet

hru)

Sun Niagara2

4x128b FBDIMM memory controllers

AMD Opteron

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

DDR2 DRAM DDR2 DRAM

10.6GB/s 10.6GB/s

4GB/s(each direction)

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

<<20GB/seach

direction

SPE256K

PPE512K L2

MFC

BIF

XDR

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

SPE 256K

PPE 512K L2

MFC

BIF

XDR

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

IBM Cell Blade

21 GB/s 21 GB/s

51 GB/s43 GB/s

Page 12: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPSMulticore SMP Systems

(NUMA)

4MBShared L2

Core2

FSB

Fully Buffered DRAM

10.6GB/s

Core2

Chipset (4x64b controllers)

10.6GB/s

10.6 GB/s(write)

4MBShared L2

Core2 Core2

4MBShared L2

Core2

FSB

Core2

4MBShared L2

Core2 Core2

21.3 GB/s(read)

Intel ClovertownC

ross

bar S

witc

h

Fully Buffered DRAM

4MB

Sha

red

L2 (1

6 w

ay)

42.7GB/s (read), 21.3 GB/s (write)

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

179

GB

/s(fi

ll)90

GB

/s(w

ritet

hru)

Sun Niagara2

4x128b FBDIMM memory controllers

AMD Opteron

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

DDR2 DRAM DDR2 DRAM

10.6GB/s 10.6GB/s

4GB/s(each direction)

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

<<20GB/seach

direction

SPE256K

PPE512K L2

MFC

BIF

XDR

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

SPE 256K

PPE 512K L2

MFC

BIF

XDR

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

IBM Cell Blade

Unifo

rm M

emor

y Ac

cess

Non-

Unifo

rm M

emor

y Ac

cess

Page 13: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

Naïve Implementation for Cache-based Machines

Performance across suiteAlso, included a median performance number

Page 14: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS vanilla C Performance

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Den

se

Pro

tein

FE

M-S

ph

r

FE

M-C

an

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Den

se

Pro

tein

FE

M-S

ph

r

FE

M-C

an

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Den

se

Pro

tein

FE

M-S

ph

r

FE

M-C

an

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

Intel Clovertown AMD Opteron

Sun Niagara2Vanilla C implementationMatrix stored in CSR (compressed sparse row)Explored compiler options - only the best is presented herex86 core delivers > 10x performance of Niagara2 thread

Page 15: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

SPMD, strong scalingOptimized for multicore/threadingVariety of shared memory programming modelsare acceptable(not just Pthreads)More colors = more optimizations = more work

Pthread Implementation for Cache-Based Machines

Page 16: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS Naïve Parallel Performance

Naïve Pthreads

Naïve Single Thread

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Den

se

Pro

tein

FE

M-S

ph

r

FE

M-C

an

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Den

se

Pro

tein

FE

M-S

ph

r

FE

M-C

an

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Den

se

Pro

tein

FE

M-S

ph

r

FE

M-C

an

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

Intel Clovertown AMD Opteron

Sun Niagara2Simple row parallelizationPthreads (didn’t have to be)1P Niagara2 > 2x5x 2P x86 machines

Page 17: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Den

se

Pro

tein

FE

M-S

ph

r

FE

M-C

an

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Den

se

Pro

tein

FE

M-S

ph

r

FE

M-C

an

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Den

se

Pro

tein

FE

M-S

ph

r

FE

M-C

an

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

sNaïve Parallel Performance

Naïve Pthreads

Naïve Single Thread

Intel Clovertown AMD Opteron

Sun Niagara2

8x cores = 1.9x performance8x cores = 1.9x performance 4x cores = 1.5x performance4x cores = 1.5x performance

64x threads = 41x performance64x threads = 41x performanceSimple row parallelizationPthreads (didn’t have to be)1P Niagara2 > 2x5x 2P x86 machines

Page 18: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Den

se

Pro

tein

FE

M-S

ph

r

FE

M-C

an

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Den

se

Pro

tein

FE

M-S

ph

r

FE

M-C

an

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Den

se

Pro

tein

FE

M-S

ph

r

FE

M-C

an

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

sNaïve Parallel Performance

Naïve Pthreads

Naïve Single Thread

Intel Clovertown AMD Opteron

Sun Niagara2

1.4% of peak flops29% of bandwidth1.4% of peak flops29% of bandwidth

4% of peak flops20% of bandwidth4% of peak flops20% of bandwidth

25% of peak flops39% of bandwidth25% of peak flops39% of bandwidth

Simple row parallelizationPthreads (didn’t have to be)1P Niagara2 > 2x5x 2P x86 machines

Page 19: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS Case for Auto-tuning

How do we deliver good performance across all these architectures, across all matrices without exhaustivelyoptimizing every combination

Auto-tuning (in general)Write a Perl script that generates all possible optimizationsHeuristically, or exhaustively searches the optimizationsExisting SpMV solution: OSKI (developed at UCB)

This work:Tuning geared toward multi-core/-threadinggenerates SSE/SIMD intrinsics, prefetching, loop transformations, alternate data structures, etc…“prototype for parallel OSKI”

Page 20: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPSPerformance

(+NUMA & SW Prefetching)

+Software Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve Single Thread

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Den

se

Pro

tein

FE

M-S

ph

r

FE

M-C

an

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Den

se

Pro

tein

FE

M-S

ph

r

FE

M-C

an

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Den

se

Pro

tein

FE

M-S

ph

r

FE

M-C

an

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

Intel Clovertown AMD Opteron

Sun Niagara2NUMA aware allocationTag prefetches with the appropriate temporal localityTune for the optimal distance

Page 21: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS Matrix Compression

For memory bound kernels, minimizing memorytraffic should maximize performanceCompress the meta data

Exploit structure to eliminate meta data

Heuristic: select the compression thatminimizes the matrix size:

power of 2 register blocking CSR/COO format 16b/32b indicesetc…

Side effect: matrix may be minimized to the point where it fits entirely in cache

Page 22: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPSPerformance

(+matrix compression)

+Matrix Compression

+Software Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve Single Thread

0.0

1.0

2.0

3.0

4.0

5.0

6.0

Den

se

Pro

tein

FE

M-S

ph

r

FE

M-C

an

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

Den

se

Pro

tein

FE

M-S

ph

r

FE

M-C

an

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

Den

se

Pro

tein

FE

M-S

ph

r

FE

M-C

an

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

Intel Clovertown AMD Opteron

Sun Niagara2Very significant on somemissed the boat on others

Page 23: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPSPerformance (+cache & TLB blocking)

+Cache/TLB Blocking

+Matrix Compression

+Software Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve Single Thread

0.0

1.0

2.0

3.0

4.0

5.0

6.0

Den

se

Pro

tein

FE

M-S

ph

r

FE

M-C

an

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

Den

se

Pro

tein

FE

M-S

ph

r

FE

M-C

an

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

Den

se

Pro

tein

FE

M-S

ph

r

FE

M-C

an

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

Intel Clovertown AMD Opteron

Sun Niagara2Cache blocking geared towards sparse accesses

Page 24: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPSPerformance

(more DIMMs,firmware fix …)

+More DIMMs, Rank configuration, etc…

+Cache/TLB Blocking

+Matrix Compression

+Software Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve Single Thread

0.0

1.0

2.0

3.0

4.0

5.0

6.0

Den

se

Pro

tein

FEM

-Sp

hr

FEM

-Can

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

Den

se

Pro

tein

FEM

-Sp

hr

FEM

-Can

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

Den

se

Pro

tein

FEM

-Sp

hr

FEM

-Can

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

Intel Clovertown AMD Opteron

Sun Niagara2

Page 25: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS

0.0

1.0

2.0

3.0

4.0

5.0

6.0

Den

se

Pro

tein

FEM

-Sp

hr

FEM

-Can

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

Den

se

Pro

tein

FEM

-Sp

hr

FEM

-Can

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

Den

se

Pro

tein

FEM

-Sp

hr

FEM

-Can

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

sPerformance

(more DIMMs,firmware fix …)

+More DIMMs, Rank configuration, etc…

+Cache/TLB Blocking

+Matrix Compression

+Software Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve Single Thread

Intel Clovertown AMD Opteron

Sun Niagara2

4% of peak flops52% of bandwidth4% of peak flops52% of bandwidth

20% of peak flops66% of bandwidth20% of peak flops66% of bandwidth

52% of peak flops54% of bandwidth52% of peak flops54% of bandwidth

Page 26: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS

0.0

1.0

2.0

3.0

4.0

5.0

6.0

Den

se

Pro

tein

FEM

-Sp

hr

FEM

-Can

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

Den

se

Pro

tein

FEM

-Sp

hr

FEM

-Can

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

Den

se

Pro

tein

FEM

-Sp

hr

FEM

-Can

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

sPerformance

(more DIMMs,firmware fix …)

+More DIMMs, Rank configuration, etc…

+Cache/TLB Blocking

+Matrix Compression

+Software Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve Single Thread

Intel Clovertown AMD Opteron

Sun Niagara2

3 essential optimizations:Parallelize, Prefetch, Compress

3 essential optimizations:Parallelize, Prefetch, Compress

4 essential optimizations:Parallelize, NUMA, Prefetch, Compress

4 essential optimizations:Parallelize, NUMA, Prefetch, Compress

2 essential optimizations:Parallelize, Compress

2 essential optimizations:Parallelize, Compress

Page 27: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

CommentsPerformance

Cell Implementation

Page 28: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS Cell Implementation

No vanilla C implementation (aside from the PPE)In some cases, what were optional optimizations on cache based machines, are requirements for correctness on Cell

Even SIMDized double precision is extremely weakScalar double precision is unbearableMinimum register blocking is 2x1 (SIMDizable)Can increase memory traffic by 66%

Optional Cache blocking is now required local store blockingSpatial and temporal locality is captured by software when the matrix is optimizedIn essence, the high bits of column indices are grouped into DMAlists

Branchless implementation

Despite the following performance numbers, Cell was still handicapped by double precision

Page 29: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

Den

se

Pro

tein

FEM

-Sp

hr

FEM

-Can

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

16 SPEs8 SPEs1 SPE

Performance (all)

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

Den

se

Pro

tein

FEM

-Sp

hr

FEM

-Can

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

Den

se

Pro

tein

FEM

-Sp

hr

FEM

-Can

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

Den

se

Pro

tein

FEM

-Sp

hr

FEM

-Can

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

Intel Clovertown AMD Opteron

Sun Niagara2 IBM Cell Broadband Engine

Page 30: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

Den

se

Pro

tein

FEM

-Sp

hr

FEM

-Can

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

16 SPEs8 SPEs1 SPE

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

Den

se

Pro

tein

FEM

-Sp

hr

FEM

-Can

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

Den

se

Pro

tein

FEM

-Sp

hr

FEM

-Can

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

Den

se

Pro

tein

FEM

-Sp

hr

FEM

-Can

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

Intel Clovertown AMD Opteron

Sun Niagara2

Hardware Efficiency

IBM Cell Broadband Engine39% of peak flops89% of bandwidth39% of peak flops89% of bandwidth

4% of peak flops52% of bandwidth4% of peak flops52% of bandwidth

20% of peak flops66% of bandwidth20% of peak flops66% of bandwidth

52% of peak flops54% of bandwidth52% of peak flops54% of bandwidth

Page 31: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

Den

se

Pro

tein

FEM

-Sp

hr

FEM

-Can

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

16 SPEs8 SPEs1 SPE

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

Den

se

Pro

tein

FEM

-Sp

hr

FEM

-Can

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

Den

se

Pro

tein

FEM

-Sp

hr

FEM

-Can

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

Den

se

Pro

tein

FEM

-Sp

hr

FEM

-Can

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

Intel Clovertown AMD Opteron

Sun Niagara2

Productivity

IBM Cell Broadband Engine

3 essential optimizations:Parallelize, Prefetch, Compress

3 essential optimizations:Parallelize, Prefetch, Compress

4 essential optimizations:Parallelize, NUMA, Prefetch, Compress

4 essential optimizations:Parallelize, NUMA, Prefetch, Compress

2 essential optimizations:Parallelize, Compress

2 essential optimizations:Parallelize, Compress

5 essential optimizations:Parallelize, NUMA, DMA, Compress, Cache(LS) Block

5 essential optimizations:Parallelize, NUMA, DMA, Compress, Cache(LS) Block

Page 32: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS

0.000

0.200

0.400

0.600

0.800

1.000

0 8 16 24 32 40 48 56 64

threads

Parallel Efficiency

0.000

0.200

0.400

0.600

0.800

1.000

1 2 3 4

cores

0.000

0.200

0.400

0.600

0.800

1.000

1.200

1 2 3 4 5 6 7 8

cores

0.000

0.200

0.400

0.600

0.800

1.000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

cores

Intel Clovertown AMD Opteron

Sun Niagara2 IBM Cell Broadband Engine

Page 33: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS

0.000

0.200

0.400

0.600

0.800

1.000

0 8 16 24 32 40 48 56 64

threads

Parallel Efficiency (breakdown)

0.000

0.200

0.400

0.600

0.800

1.000

1 2 3 4

cores

0.000

0.200

0.400

0.600

0.800

1.000

1.200

1 2 3 4 5 6 7 8

cores

0.000

0.200

0.400

0.600

0.800

1.000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

cores

Intel Clovertown AMD Opteron

Sun Niagara2 IBM Cell Broadband Engine

multicore

multicore

multicore

multicore

multisocket

multisocket

multisocket

MT

Page 34: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

Multicore MPI ImplementationThis is the default approach to programming multicore

Page 35: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS Multicore MPI Implementation

Used PETSc with shared memory MPICHUsed OSKI (developed @ UCB) to optimize each thread= good autotuned shared memory MPI implementation

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Den

se

Pro

tein

FE

M-S

ph

r

FE

M-C

an

t

Tu

nn

el

FE

M-H

ar

QC

D

FE

M-S

hip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Den

se

Pro

tein

FEM

-Sp

hr

FEM

-Can

t

Tu

nn

el

FEM

-Har

QC

D

FEM

-Sh

ip

Eco

no

m

Ep

idem

FEM

-Acc

el

Cir

cuit

Web

base LP

Med

ian

GFlo

p/

s

MPI(autotuned) Pthreads(autotuned)Naïve Single Thread

Intel Clovertown AMD Opteron

Page 36: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

Summary

Page 37: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS

Median Performance

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0

IntelClovertown

AMD Opteron Sun Niagara2 IBM Cell

GFlo

p/

s

Naïve PETSc+OSKI(autotuned) Pthreads(autotuned)

Median Power Efficiency

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

20.0

22.0

24.0

IntelClovertown

AMD Opteron Sun Niagara2 IBM Cell

MFlo

p/

s/W

att

Naïve PETSc+OSKI(autotuned) Pthreads(autotuned)

Median Performance & Efficiency

1P Niagara2 was consistently better than 2P x86 machines (more potential bandwidth)Cell delivered by far the best performance (better utilization)

Used digital power meter to measure sustained system powerFBDIMM is power hungry (12W/DIMM)

Clovertown(330W)Niagara2 (350W) power

Page 38: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS Summary

Paradoxically, the most complex/advanced architectures required the most tuning, and delivered the lowest performance.

Niagara2 delivered both very good performance and productivityCell delivered very good performance and efficiency

90% of memory bandwidth (most get ~50%)High power efficiencyEasily understood performanceExtra traffic = lower performance (future work can address this)

Our multicore specific autotuned implementation significantly outperformed an autotuned MPI implementationIt exploited:

Matrix compression geared towards multicore, rather than singleNUMAPrefetching

Page 39: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS Acknowledgments

UC BerkeleyRADLab Cluster (Opterons)PSI cluster(Clovertowns)

Sun MicrosystemsNiagara2 access

Forschungszentrum JülichCell blade cluster access

Page 40: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

Questions?

Page 41: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

Backup Slides

Page 42: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS Parallelization

Matrix partitioned by rows and balanced by the number of nonzerosSPMD like approachA barrier() is called before and after the SpMV kernelEach sub matrix stored separately in CSR Load balancing can be challenging

# of threads explored in powers of 2 (in paper)

A

x

y

Page 43: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS Exploiting NUMA, Affinity

Bandwidth on the Opteron(and Cell) can vary substantially based on placement of data

Bind each sub matrix and the thread to process it togetherExplored libnuma, Linux, and Solaris routinesAdjacent blocks bound to adjacent cores

Opteron

DDR2 DRAM

Opteron

DDR2 DRAM

Opteron

DDR2 DRAM

Opteron

DDR2 DRAM

Opteron

DDR2 DRAM

Opteron

DDR2 DRAM

Single ThreadMultiple Threads,

One memory controllerMultiple Threads,

Both memory controllers

Page 44: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS Cache and TLB Blocking

Accesses to the matrix and destination vector are streamingBut, access to the source vector can be randomReorganize matrix (and thus access pattern) to maximize reuse.Applies equally to TLB blocking (caching PTEs)

Heuristic: block destination, then keep addingmore columns as long as the number ofsource vector cache lines(or pages) touchedis less than the cache(or TLB). Apply allprevious optimizations individually to eachcache block.

Search: neither, cache, cache&TLB

Better locality at the expense of confusingthe hardware prefetchers.

A

x

y

Page 45: Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

BIPSBIPS Banks, Ranks, and DIMMs

In this SPMD approach, as the number of threads increases, so to does the number of concurrent streams to memory.Most memory controllers have finite capability to reorder the requests. (DMA can avoid or minimize this)Addressing/Bank conflicts become increasingly likely

Add more DIMMs, configuration of ranks can helpClovertown system was already fully populated


Recommended