Post on 31-Dec-2015
description
transcript
03/09/2007 CS267 Lecture 16 1
CS 267Sparse Matrices:
Sparse Matrix-Vector Multiplyfor Iterative Solvers
Kathy Yelick
www.cs.berkeley.edu/~yelick/cs267_sp07
03/09/2007 CS267 Lecture 16 2
High-end simulation in the physical sciences = 7 numerical methods:
1. Structured Grids (including locally structured grids, e.g. AMR)
2. Unstructured Grids3. Fast Fourier Transform4. Dense Linear Algebra5. Sparse Linear Algebra 6. Particles7. Monte Carlo
Well-defined targets from algorithmic, software, and architecture standpoint
Phillip Colella’s “Seven dwarfs”
• Add 4 for embedded 8. Search/Sort 9. Finite State Machine10. Filter11. Combinational logic
• Then covers all 41 EEMBC benchmarks
• Revise 1 for SPEC• 7. Monte Carlo => Easily parallel (to add ray tracing)
• Then covers 26 SPEC benchmarks
Slide from “Defining Software Requirements for Scientific Computing”, Phillip Colella, 2004
03/09/2007 CS267 Lecture 16 3
ODEs and Sparse Matrices
• All these problems reduce to sparse matrix problems
• Explicit: sparse matrix-vector multiplication (SpMV).• Implicit: solve a sparse linear system
• direct solvers (Gaussian elimination).• iterative solvers (use sparse matrix-vector multiplication).
• Eigenvalue/vector algorithms may also be explicit or implicit.
• Conclusion: SpMV is key to many ODE problems
• Relatively simple algorithm to study in detail• Two key problems: locality and load balance
03/09/2007 CS267 Lecture 16 4
Matrix-vector multiply kernel: y(i) y(i) + A(i,j)x(j)Matrix-vector multiply kernel: y(i) y(i) + A(i,j)x(j)
for each row ifor k=ptr[i] to ptr[i+1] do
y[i] = y[i] + val[k]*x[ind[k]]
SpMV in Compressed Sparse Row (CSR) Format
Matrix-vector multiply kernel: y(i) y(i) + A(i,j)x(j)
for each row ifor k=ptr[i] to ptr[i+1] do
y[i] = y[i] + val[k]*x[ind[k]]
Ay
x Representation of A
CSR format is one of many possibilities
03/09/2007 CS267 Lecture 16 5
Motivation for Automatic Performance Tuning of SpMV
• Historical trends• Sparse matrix-vector multiply (SpMV): 10% of peak or less
• Performance depends on machine, kernel, matrix• Matrix known at run-time• Best data structure + implementation can be surprising
• Our approach: empirical performance modeling and algorithm search
03/09/2007 CS267 Lecture 16 6
SpMV Historical Trends: Fraction of Peak
03/09/2007 CS267 Lecture 16 7
Example: The Difficulty of Tuning
• n = 21200• nnz = 1.5 M• kernel: SpMV
• Source: NASA structural analysis problem
03/09/2007 CS267 Lecture 16 8
Example: The Difficulty of Tuning
• n = 21200• nnz = 1.5 M• kernel: SpMV
• Source: NASA structural analysis problem
• 8x8 dense substructure
03/09/2007 CS267 Lecture 16 9
Taking advantage of block structure in SpMV
• Bottleneck is time to get matrix from memory• Only 2 flops for each nonzero in matrix
• Don’t store each nonzero with index, instead store each nonzero r-by-c block with index
• Storage drops by up to 2x, if rc >> 1, all 32-bit quantities• Time to fetch matrix from memory decreases
• Change both data structure and algorithm• Need to pick r and c• Need to change algorithm accordingly
• In example, is r=c=8 best choice?• Minimizes storage, so looks like a good idea…
03/09/2007 CS267 Lecture 16 10
Speedups on Itanium 2: The Need for Search
Reference
Best: 4x2
Mflop/s
Mflop/s
03/09/2007 CS267 Lecture 16 11
Register Profile: Itanium 2
190 Mflop/s
1190 Mflop/s
03/09/2007 CS267 Lecture 16 12
SpMV Performance (Matrix #2): Generation 2Ultra 2i - 9% Ultra 3 - 5%
Pentium III-M - 15%Pentium III - 19%
63 Mflop/s
35 Mflop/s
109 Mflop/s
53 Mflop/s
96 Mflop/s
42 Mflop/s
120 Mflop/s
58 Mflop/s
03/09/2007 CS267 Lecture 16 13
Register Profiles: Sun and Intel x86Ultra 2i - 11% Ultra 3 - 5%
Pentium III-M - 15%Pentium III - 21%
72 Mflop/s
35 Mflop/s
90 Mflop/s
50 Mflop/s
108 Mflop/s
42 Mflop/s
122 Mflop/s
58 Mflop/s
03/09/2007 CS267 Lecture 16 14
SpMV Performance (Matrix #2): Generation 1Power3 - 13% Power4 - 14%
Itanium 2 - 31%Itanium 1 - 7%
195 Mflop/s
100 Mflop/s
703 Mflop/s
469 Mflop/s
225 Mflop/s
103 Mflop/s
1.1 Gflop/s
276 Mflop/s
03/09/2007 CS267 Lecture 16 15
Register Profiles: IBM and Intel IA-64Power3 - 17% Power4 - 16%
Itanium 2 - 33%Itanium 1 - 8%
252 Mflop/s
122 Mflop/s
820 Mflop/s
459 Mflop/s
247 Mflop/s
107 Mflop/s
1.2 Gflop/s
190 Mflop/s
03/09/2007 CS267 Lecture 16 16
Another example of tuning challenges
• More complicated non-zero structure in general
• N = 16614• NNZ = 1.1M
03/09/2007 CS267 Lecture 16 17
Zoom in to top corner
• More complicated non-zero structure in general
• N = 16614• NNZ = 1.1M
03/09/2007 CS267 Lecture 16 18
3x3 blocks look natural, but…
• More complicated non-zero structure in general
• Example: 3x3 blocking• Logical grid of 3x3 cells
• But would lead to lots of “fill-in”
03/09/2007 CS267 Lecture 16 19
Extra Work Can Improve Efficiency!
• More complicated non-zero structure in general
• Example: 3x3 blocking• Logical grid of 3x3 cells• Fill-in explicit zeros• Unroll 3x3 block multiplies• “Fill ratio” = 1.5
• On Pentium III: 1.5x speedup!• Actual mflop rate 1.52 = 2.25
higher
03/09/2007 CS267 Lecture 16 20
Automatic Register Block Size Selection
• Selecting the r x c block size• Off-line benchmark
• Precompute Mflops(r,c) using dense A for each r x c• Once per machine/architecture
• Run-time “search”• Sample A to estimate Fill(r,c) for each r x c
• Run-time heuristic model• Choose r, c to minimize time ~ Fill(r,c) / Mflops(r,c)
03/09/2007 CS267 Lecture 16 21
Accurate and Efficient Adaptive Fill Estimation
• Idea: Sample matrix• Fraction of matrix to sample: s [0,1]• Cost ~ O(s * nnz)• Control cost by controlling s
• Search at run-time: the constant matters!• Control s automatically by computing statistical confidence
intervals• Idea: Monitor variance
• Cost of tuning• Lower bound: convert matrix in 5 to 40 unblocked SpMVs• Heuristic: 1 to 11 SpMVs
03/09/2007 CS267 Lecture 16 22
Accuracy of the Tuning Heuristics (1/4)
NOTE: “Fair” flops used (ops on explicit zeros not counted as “work”)See p. 375 of Vuduc’s thesis for matrices
03/09/2007 CS267 Lecture 16 23
Accuracy of the Tuning Heuristics (2/4)
03/09/2007 CS267 Lecture 16 24
Accuracy of the Tuning Heuristics (3/4)
03/09/2007 CS267 Lecture 16 25
Accuracy of the Tuning Heuristics (3/4)DGEMV
03/09/2007 CS267 Lecture 16 26
Upper Bounds on Performance for blocked SpMV
• P = (flops) / (time)• Flops = 2 * nnz(A)
• Lower bound on time: Two main assumptions• 1. Count memory ops only (streaming)• 2. Count only compulsory, capacity misses: ignore conflicts
• Account for line sizes• Account for matrix size and nnz
• Charge minimum access “latency” i at Li cache & mem
• e.g., Saavedra-Barrera and PMaC MAPS benchmarks
1mem11
1memmem
Misses)(Misses)(Loads
HitsHitsTime
iiii
iii
03/09/2007 CS267 Lecture 16 27
Example: L2 Misses on Itanium 2
Misses measured using PAPI [Browne ’00]
03/09/2007 CS267 Lecture 16 28
Example: Bounds on Itanium 2
03/09/2007 CS267 Lecture 16 29
Example: Bounds on Itanium 2
03/09/2007 CS267 Lecture 16 30
Example: Bounds on Itanium 2
03/09/2007 CS267 Lecture 16 31
Summary of Other Performance Optimizations
• Optimizations for SpMV• Register blocking (RB): up to 4x over CSR• Variable block splitting: 2.1x over CSR, 1.8x over RB• Diagonals: 2x over CSR• Reordering to create dense structure + splitting: 2x over CSR• Symmetry: 2.8x over CSR, 2.6x over RB• Cache blocking: 2.8x over CSR• Multiple vectors (SpMM): 7x over CSR• And combinations…
• Sparse triangular solve• Hybrid sparse/dense data structure: 1.8x over CSR
• Higher-level kernels• AAT*x, ATA*x: 4x over CSR, 1.8x over RB• A*x: 2x over CSR, 1.5x over RB
03/09/2007 CS267 Lecture 16 32
• Data Structure Transformations• Thread blocking• Cache blocking• Register Blocking• Format selection• Index size reduction
• Kernel Optimizations• Prefetching• Loop structure
SPMV for Shared Memory and Multicore
03/09/2007 CS267 Lecture 16 33
• Load Balancing• Evenly divide number of nonzeros
• Exploit NUMA memory systems on multi-socket SMPs• Must pin threads to cores AND• pin data to sockets
Thread Blocking
03/09/2007 CS267 Lecture 16 34
• R x C processor grid• Each covers the same
number of rows andcolumns.
• Potentially unbalanced
Naïve Approach
03/09/2007 CS267 Lecture 16 35
• R x C processor grid• First, block into rows
• same number of nonzeros in each of theR blocked rows
• Second, block within each blocked row• Not only should each block within a row
have ~same number of nonzeros,• But all blocks should have ~same number
of nonzeros• Third, prune unneeded rows &
columns• Fourth, re-encode the column indices to be
relative to each thread block.
Load Balanced Approach
03/09/2007 CS267 Lecture 16 36
• Cache blocking• Performed for each thread block.• Chop into blocks so entire source vector fits in cache
• Prefetching• Insert explicit prefetch operations to mask latency to memory• Tune prefetch distance/time using search
• Register blocking• As in OSKI, but done separately per cache block• Simpler heuristic: choose block size that minimize total storage
• Index compression• Use 16b ints for indices in blocks less than 64K wide
Memory Optimizations
03/09/2007 CS267 Lecture 16 37
1 thread Performance (preliminary)1 thread Performance (preliminary)
493 513
269
NaiveRegisterBlocking
Nai
veS
oftw
are
Pre
fetc
h
258
467 695
439
NaiveRegisterBlocking
Nai
veS
oftw
are
Pre
fetc
h
297
476 460
351
NaiveRegisterBlocking
Nai
veS
oftw
are
Pre
fetc
h
324
612 1372
623
NaiveRegisterBlocking
Nai
veS
oftw
are
Pre
fetc
h
430
memplus.rua
raefsky3.rua
Dual Socket,Dual Core Opteron
@ 2.2GHz
Quad Socket,Single Core Opteron
@ 2.4GHz
3.2x
1.4x
1.4x2.3x
1.5x
1.6x
2x1.9x 1.4x1.5x
03/09/2007 CS267 Lecture 16 38
2 thread Performance (preliminary)2 thread Performance (preliminary)
- 4950.96x
-
Naive T,R Blocked
Nai
veS
oftw
are
Pre
fetc
h
-
- 12841.85x
-
Naive T,R BlockedN
aive
Sof
twa
reP
refe
tch
-
- 7551.6x
-
Naive T,R Blocked
Nai
veS
oftw
are
Pre
fetc
h
-
- 16391.2x
-
Naive T,R Blocked
Nai
veS
oftw
are
Pre
fetc
h
-
memplus.rua
raefsky3.rua
Dual Socket,Dual Core Opteron
@ 2.2GHz
Quad Socket,Single Core Opteron
@ 2.4GHz
03/09/2007 CS267 Lecture 16 39
4 thread Performance (preliminary)4 thread Performance (preliminary)
- 9852.0x
-
Naive T,R Blocked
Nai
veS
oftw
are
Pre
fetc
h
-
- 19112.75x
-
Naive T,R BlockedN
aive
Sof
twa
reP
refe
tch
-
- 13693.0x
-
Naive T,R Blocked
Nai
veS
oftw
are
Pre
fetc
h
-
- 31482.3x
-
Naive T,R Blocked
Nai
veS
oftw
are
Pre
fetc
h
-
memplus.rua
raefsky3.rua
Dual Socket,Dual Core Opteron
@ 2.2GHz
Quad Socket,Single Core Opteron
@ 2.4GHz
03/09/2007 CS267 Lecture 16 40
Speedup for the best combination of NThreads, blocking, prefetching, …Speedup for the best combination of NThreads, blocking, prefetching, …
- 985
-
Naive T,R Blocked
Nai
veS
oftw
are
Pre
fetc
h
258
- 1911
-
Naive T,R BlockedN
aive
Sof
twa
reP
refe
tch
297
- 1369
-
Naive T,R Blocked
Nai
veS
oftw
are
Pre
fetc
h
324
- 3148
-
Naive T,R Blocked
Nai
veS
oftw
are
Pre
fetc
h
430
memplus.rua
raefsky3.rua
Dual Socket,Dual Core Opteron
@ 2.2GHz
Quad Socket,Single Core Opteron
@ 2.4GHz
7.3x6.4x
3.8x 4.2x
03/09/2007 CS267 Lecture 16 41
Distributed Memory SPMV
• y = A*x, where A is a sparse n x n matrix
• Questions• which processors store
• y[i], x[i], and A[i,j]
• which processors compute• y[i] = sum (from 1 to n) A[i,j] * x[j] = (row i of A) * x … a sparse dot product
• Partitioning• Partition index set {1,…,n} = N1 N2 … Np.• For all i in Nk, Processor k stores y[i], x[i], and row i of A • For all i in Nk, Processor k computes y[i] = (row i of A) * x
• “owner computes” rule: Processor k compute the y[i]s it owns.
x
y
P1
P2
P3
P4
May require communication
03/09/2007 CS267 Lecture 16 42
Two Layouts
• The partitions should be by nonzeros counts, not rows/columns
• 1D Partition: most popular, but for algorithms (NAS CG) that do reductions on y, these scale with log P
• 2D Partition: reductions scale with log sqrt(P), but needs to keep ~= nonzeros for load balance
x
y
P1
P2
P3
P4
x
y
P1
P2
P3
P4
03/09/2007 CS267 Lecture 16 43
Summary
• Sparse matrix vector multiply critical to many applications
• Performance limited by memory systems (and perhaps network)
• Cache blocking, register blocking, prefetching are all important
• Autotuning can be used, but need matrix structure
03/09/2007 CS267 Lecture 16 44
Extra Slides
Including: How to use OSKI
03/09/2007 CS267 Lecture 16 45
Example: Sparse Triangular Factor
• Raefsky4 (structural problem) + SuperLU + colmmd
• N=19779, nnz=12.6 M
Dense trailing triangle: dim=2268, 20% of total nz
Can be as high as 90+%!1.8x over CSR
03/09/2007 CS267 Lecture 16 46
Cache Optimizations for AAT*x
• Cache-level: Interleave multiplication by A, AT
• Only fetch A from memory once
• Register-level: aiT to be rc block row, or diag row
n
i
Tii
Tn
T
nT xaax
a
a
aaxAA1
1
1 )(
dot product“axpy”
… …
03/09/2007 CS267 Lecture 16 47
Example: Combining Optimizations
• Register blocking, symmetry, multiple (k) vectors• Three low-level tuning parameters: r, c, v
v
kX
Y A
cr
+=
*
03/09/2007 CS267 Lecture 16 48
Example: Combining Optimizations
• Register blocking, symmetry, and multiple vectors [Ben Lee @ UCB]
• Symmetric, blocked, 1 vector• Up to 2.6x over nonsymmetric, blocked, 1 vector
• Symmetric, blocked, k vectors• Up to 2.1x over nonsymmetric, blocked, k vecs.• Up to 7.3x over nonsymmetric, nonblocked, 1, vector
• Symmetric Storage: up to 64.7% savings
03/09/2007 CS267 Lecture 16 49
Potential Impact on Applications: T3P
• Application: accelerator design [Ko] • 80% of time spent in SpMV• Relevant optimization techniques
• Symmetric storage• Register blocking
• On Single Processor Itanium 2• 1.68x speedup
• 532 Mflops, or 15% of 3.6 GFlop peak• 4.4x speedup with multiple (8) vectors
• 1380 Mflops, or 38% of peak
03/09/2007 CS267 Lecture 16 50
Potential Impact on Applications: Omega3P
• Application: accelerator cavity design [Ko]• Relevant optimization techniques
• Symmetric storage• Register blocking• Reordering
• Reverse Cuthill-McKee ordering to reduce bandwidth• Traveling Salesman Problem-based ordering to create blocks
– Nodes = columns of A
– Weights(u, v) = no. of nz u, v have in common
– Tour = ordering of columns
– Choose maximum weight tour
– See [Pinar & Heath ’97]
• 2.1x speedup on Power 4, but SPMV not dominant
03/09/2007 CS267 Lecture 16 51
Source: Accelerator Cavity Design Problem (Ko via Husbands)
03/09/2007 CS267 Lecture 16 52
100x100 Submatrix Along Diagonal
03/09/2007 CS267 Lecture 16 53
Post-RCM Reordering
03/09/2007 CS267 Lecture 16 54
Before: Green + RedAfter: Green + Blue
“Microscopic” Effect of RCM Reordering
03/09/2007 CS267 Lecture 16 55
“Microscopic” Effect of Combined RCM+TSP Reordering
Before: Green + RedAfter: Green + Blue
03/09/2007 CS267 Lecture 16 56
(Omega3P)
03/09/2007 CS267 Lecture 16
Optimized Sparse Kernel Interface - OSKI
• Provides sparse kernels automatically tuned for user’s matrix & machine
• BLAS-style functionality: SpMV, Ax & ATy, TrSV• Hides complexity of run-time tuning• Includes new, faster locality-aware kernels: ATAx, Akx
• Faster than standard implementations• Up to 4x faster matvec, 1.8x trisolve, 4x ATA*x
• For “advanced” users & solver library writers• Available as stand-alone library (OSKI 1.0.1b, 3/06)• Available as PETSc extension (OSKI-PETSc .1d, 3/06)• Bebop.cs.berkeley.edu/oski
03/09/2007 CS267 Lecture 16 58
How the OSKI Tunes (Overview)
Benchmarkdata
1. Build forTargetArch.
2. Benchmark
Heuristicmodels
1. EvaluateModels
Generatedcode
variants
2. SelectData Struct.
& Code
Library Install-Time (offline) Application Run-Time
To user:Matrix handlefor kernelcalls
Workloadfrom program
monitoring
Extensibility: Advanced users may write & dynamically add “Code variants” and “Heuristic models” to system.
HistoryMatrix
03/09/2007 CS267 Lecture 16 59
How the OSKI Tunes (Overview)
• At library build/install-time• Pre-generate and compile code variants into dynamic libraries• Collect benchmark data
• Measures and records speed of possible sparse data structure and code variants on target architecture
• Installation process uses standard, portable GNU AutoTools• At run-time
• Library “tunes” using heuristic models• Models analyze user’s matrix & benchmark data to choose
optimized data structure and code• Non-trivial tuning cost: up to ~40 mat-vecs
• Library limits the time it spends tuning based on estimated workload– provided by user or inferred by library
• User may reduce cost by saving tuning results for application on future runs with same or similar matrix
03/09/2007 CS267 Lecture 16 60
Optimizations in the Initial OSKI Release
• Fully automatic heuristics for• Sparse matrix-vector multiply
• Register-level blocking• Register-level blocking + symmetry + multiple vectors• Cache-level blocking
• Sparse triangular solve with register-level blocking and “switch-to-dense” optimization
• Sparse ATA*x with register-level blocking• User may select other optimizations manually
• Diagonal storage optimizations, reordering, splitting; tiled matrix powers kernel (Ak*x)
• All available in dynamic libraries• Accessible via high-level embedded script language
• “Plug-in” extensibility• Very advanced users may write their own heuristics, create
new data structures/code variants and dynamically add them to the system
03/09/2007 CS267 Lecture 16 61
How to Call OSKI: Basic Usage
• May gradually migrate existing apps• Step 1: “Wrap” existing data structures• Step 2: Make BLAS-like kernel calls
int* ptr = …, *ind = …; double* val = …; /* Matrix, in CSR format */
double* x = …, *y = …; /* Let x and y be two dense vectors */
/* Compute y = ·y + ·A·x, 500 times */for( i = 0; i < 500; i++ )
my_matmult( ptr, ind, val, , x, , y );
03/09/2007 CS267 Lecture 16 62
How to Call OSKI: Basic Usage
• May gradually migrate existing apps• Step 1: “Wrap” existing data structures• Step 2: Make BLAS-like kernel calls
int* ptr = …, *ind = …; double* val = …; /* Matrix, in CSR format */
double* x = …, *y = …; /* Let x and y be two dense vectors *//* Step 1: Create OSKI wrappers around this data */oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows,
num_cols, SHARE_INPUTMAT, …);
oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE);
oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE);
/* Compute y = ·y + ·A·x, 500 times */for( i = 0; i < 500; i++ )
my_matmult( ptr, ind, val, , x, , y );
03/09/2007 CS267 Lecture 16 63
How to Call OSKI: Basic Usage
• May gradually migrate existing apps• Step 1: “Wrap” existing data structures• Step 2: Make BLAS-like kernel calls
int* ptr = …, *ind = …; double* val = …; /* Matrix, in CSR format */
double* x = …, *y = …; /* Let x and y be two dense vectors *//* Step 1: Create OSKI wrappers around this data */oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows,
num_cols, SHARE_INPUTMAT, …);
oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE);
oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE);
/* Compute y = ·y + ·A·x, 500 times */for( i = 0; i < 500; i++ )
oski_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view);/* Step 2 */
03/09/2007 CS267 Lecture 16 64
How to Call OSKI: Tune with Explicit Hints
• User calls “tune” routine• May provide explicit tuning hints (OPTIONAL)
oski_matrix_t A_tunable = oski_CreateMatCSR( … );
/* … */
/* Tell OSKI we will call SpMV 500 times (workload hint) */oski_SetHintMatMult(A_tunable, OP_NORMAL, , x_view, , y_view, 500);/* Tell OSKI we think the matrix has 8x8 blocks (structural hint) */oski_SetHint(A_tunable, HINT_SINGLE_BLOCKSIZE, 8, 8);
oski_TuneMat(A_tunable); /* Ask OSKI to tune */
for( i = 0; i < 500; i++ )
oski_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view);
03/09/2007 CS267 Lecture 16 65
How the User Calls OSKI: Implicit Tuning
• Ask library to infer workload• Library profiles all kernel calls• May periodically re-tune
oski_matrix_t A_tunable = oski_CreateMatCSR( … );
/* … */
for( i = 0; i < 500; i++ ) {
oski_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view);oski_TuneMat(A_tunable); /* Ask OSKI to tune */
}
03/09/2007 CS267 Lecture 16 66
Quick-and-dirty Parallelism: OSKI-PETSc
• Extend PETSc’s distributed memory SpMV (MATMPIAIJ)
p0
p1
p2
p3
• PETSc• Each process stores diag
(all-local) and off-diag submatrices
• OSKI-PETSc:• Add OSKI wrappers• Each submatrix tuned
independently
03/09/2007 CS267 Lecture 16 67
OSKI-PETSc Proof-of-Concept Results
• Matrix 1: Accelerator cavity design (R. Lee @ SLAC)• N ~ 1 M, ~40 M non-zeros• 2x2 dense block substructure• Symmetric
• Matrix 2: Linear programming (Italian Railways)• Short-and-fat: 4k x 1M, ~11M non-zeros• Highly unstructured• Big speedup from cache-blocking: no native PETSc format
• Evaluation machine: Xeon cluster• Peak: 4.8 Gflop/s per node
03/09/2007 CS267 Lecture 16 68
Accelerator Cavity Matrix
03/09/2007 CS267 Lecture 16 69
OSKI-PETSc Performance: Accel. Cavity
03/09/2007 CS267 Lecture 16 70
Linear Programming Matrix
…
03/09/2007 CS267 Lecture 16 71
OSKI-PETSc Performance: LP Matrix
03/09/2007 CS267 Lecture 16 72
Tuning Higher Level Algorithms
• So far we have tuned a single sparse matrix kernel• y = AT*A*x motivated by higher level algorithm (SVD)
• What can we do by extending tuning to a higher level?• Consider Krylov subspace methods for Ax=b, Ax = x
• Conjugate Gradients (CG), GMRES, Lanczos, …• Inner loop does y=A*x, dot products, saxpys, scalar ops• Inner loop costs at least O(1) messages• k iterations cost at least O(k) messages
• Our goal: show how to do k iterations with O(1) messages• Possible payoff – make Krylov subspace methods much faster on machines with slow networks• Memory bandwidth improvements too (not discussed)• Obstacles: numerical stability, preconditioning, …
03/09/2007 CS267 Lecture 16 73
Parallel Sparse Matrix-vector multiplication
• y = A*x, where A is a sparse n x n matrix
• Questions• which processors store
• y[i], x[i], and A[i,j]
• which processors compute• y[i] = sum (from 1 to n) A[i,j] * x[j] = (row i of A) * x … a sparse dot product
• Partitioning• Partition index set {1,…,n} = N1 N2 … Np.• For all i in Nk, Processor k stores y[i], x[i], and row i of A • For all i in Nk, Processor k computes y[i] = (row i of A) * x
• “owner computes” rule: Processor k compute the y[i]s it owns.
x
y
P1
P2
P3
P4
May require communication
03/09/2007 CS267 Lecture 16 74
Matrix Reordering via Graph Partitioning
• “Ideal” matrix structure for parallelism: block diagonal• p (number of processors) blocks, can all be computed locally.• If no non-zeros outside these blocks, no communication needed
• Can we reorder the rows/columns to get close to this?• Most nonzeros in diagonal blocks, few outside
P0
P1
P2
P3
P4
= *
P0 P1 P2 P3 P4
03/09/2007 CS267 Lecture 16 75
Goals of Reordering
• Performance goals• balance load (how is load measured?).
• Approx equal number of nonzeros (not necessarily rows)
• balance storage (how much does each processor store?).• Approx equal number of nonzeros
• minimize communication (how much is communicated?).• Minimize nonzeros outside diagonal blocks• Related optimization criterion is to move nonzeros near diagonal
• improve register and cache re-use• Group nonzeros in small vertical blocks so source (x) elements
loaded into cache or registers may be reused (temporal locality)• Group nonzeros in small horizontal blocks so nearby source (x)
elements in the cache may be used (spatial locality)
• Other algorithms reorder for other reasons• Reduce # nonzeros in matrix after Gaussian elimination• Improve numerical stability
03/09/2007 CS267 Lecture 16 76
Graph Partitioning and Sparse Matrices
1 1 1 1
2 1 1 1 1
3 1 1 1
4 1 1 1 1
5 1 1 1 1
6 1 1 1 1
1 2 3 4 5 6
3
6
1
5
2
• Relationship between matrix and graph
• Edges in the graph are nonzero in the matrix: here the matrix is symmetric (edges are unordered) and weights are equal (1)
• If divided over 3 procs, there are 14 nonzeros outside the diagonal blocks, which represent the 7 (bidirectional) edges
4
03/09/2007 CS267 Lecture 16 77
Graph Partitioning and Sparse Matrices
1 1 1 1
2 1 1 1 1
3 1 1 1
4 1 1 1 1
5 1 1 1 1
6 1 1 1 1
1 2 3 4 5 6
• Relationship between matrix and graph
• A “good” partition of the graph has• equal (weighted) number of nodes in each part (load and storage balance).• minimum number of edges crossing between (minimize communication).
• Reorder the rows/columns by putting all nodes in one partition together.
3
6
1
5
42