S4289: Efficient solution of multiple scalar and block-tridiagonalequations
Endre Laszloendre.laszlo [at] oerc.ox.ac.uk
Oxford e-Research Centre, University of Oxford, UKPazmany Peter Catholic University, Budapest, Hungary
Mike Giles (Oxford), Jeremy Appleyard (NVIDIA)
GPU Technology Conference
March 26th, 2014San Jose
Endre Laszlo (Oxford) S4289 March 26th, 2014 San Jose 1 / 25
Outline for GPU developers
1 Batch scalar-tridiagonal solversI ADI (Alternating Directions Implicit) methodI Thomas algorithmI Multi-dimensional data structures - access patternsI Optimization: local data transposition in shared memoryI Optimization: local data transposition with shfl()I Thomas-PCR hybridI Comparison to CPU, Xeon Phi and LAPACK tridiagonal solver
2 Batch block-tridiagonal solverI Block tridiagonal data structure - access patternsI Work-sharing on the GPUI Comparison to CPU and LAPACK banded solver
Conclusion
Endre Laszlo (Oxford) S4289 March 26th, 2014 San Jose 2 / 25
Example: Solving the heat equation with ADI
The heat-diffusion equation is the PDE that is solved with the method:
du
dt= ∇2u (1)
ADI (Alternating Directions Implicit) methodI Classical FD schemeI Computationally cheaper then Crank-NicolsonI Relies on approximate factorizationI O(∆t2,∆x2) order accurate in both space and timeI Unconditionally stable if parameters chosen right positiveI Introduced by Peaceman and Rachford 1
1D. W. Peaceman and J. Rachford, H. H., ”The numerical solution of parabolic and elliptic differential equations,” Journal of the Society for Industrial and
Applied Mathematics, vol. 3, no. 1, pp. pp. 2841, 1955.
Endre Laszlo (Oxford) S4289 March 26th, 2014 San Jose 3 / 25
Example: Solving the heat equation with ADI
3 tridiagonal solves along dimensions X,Y,Z
preproc : u(0) = λ(δ2x + δ2
y + δ2z
)un
x− dim :(1− λδ2
x
)u(1) = u(0)
y − dim :(1− λδ2
y
)u(2) = u(1)
z− dim :(1− λδ2
z
)∆u = u(2)
add : un+1 = un + ∆u
The upcoming discussion of tridiagonal solvers is in the context of the ADI method
Endre Laszlo (Oxford) S4289 March 26th, 2014 San Jose 4 / 25
A tridiagonal system
Storage:
3 coefficient arrays
1 solution arrays
1 RHS array
All stored in a cubic datastructureb0 c0 0 0 · · · 0a1 b1 c1 0 · · · 00 a2 b2 c2 · · · 0...
......
.... . .
...0 0 0 · · · aN−1 bN−1
u0
u1
u2
u3...
uN−1
=
d0
d1
d2
d3...
dN−1
Endre Laszlo (Oxford) S4289 March 26th, 2014 San Jose 5 / 25
Solving tridiagonal systems
Assumptions:
Stems from real-world CFD and financial applications
Computation domain: structured multidimensional
”Hypercubic-ish”: Ω = RN0×N1×···ND , D = 2..8
Large number of systems:I N2 on an N3 cubeI Enough to saturate GPU
System sizes in the order of 100s − 1000s
Each system has its own coefficients and RHS
No pivoting: diagonal dominance required
Endre Laszlo (Oxford) S4289 March 26th, 2014 San Jose 6 / 25
Batch scalar-tridiagonal solvers
cusparse?gtsvBathStride()I Inefficient with the previous assumptionsI CR-PCR hybridI Lack of multidimensional support
F Global data transpose needed in Y and Z dimensions
I Extra space requirements: 768MB for a 2563 SP problemI Uses two kernel calls
Enough parallelism is batch problems
CR/PCR not necessarily needed
Tesla K40 has 12GB device memoryI Multidimensional problem domain Nd
I N = d
√34 × 109 for dimensions d = 2..8:
d N # parallel systems
2 27386 273863 908 8244644 165 44921255 59 121173616 30 243000007 18 340122248 12 35831808
Endre Laszlo (Oxford) S4289 March 26th, 2014 San Jose 7 / 25
Thomas algorithm
Algorithm 1 Thomas algorithm
Require: thomas(a,b, c,d)1: d∗0 := d0/b0
2: c∗0 := c0/b0
3: for i = 1, . . . ,N−1 do4: r := 1 / (bi − ai ci−1)5: d∗i := r (di − ai di−1)6: c∗i := r ci7: end for8: for i = N−2, . . . , 0 do9: di := d∗i − c∗i ui+1
10: end for11: return d
Endre Laszlo (Oxford) S4289 March 26th, 2014 San Jose 8 / 25
Multi-dimensional data structures - access patternsData layout: idx = k ∗NX ∗NY + j ∗NX + i
Performance depends on how the threads are mapped to the domain
Different efficiency along different dimensions
Assume sequential dependence in algorithm iterating along dim.:I X: stride = 1 → worst performanceI Y: stride = NX → best performance
I Z: stride = NX ∗ NY → good performance if TLB miss rate is avoided
Endre Laszlo (Oxford) S4289 March 26th, 2014 San Jose 9 / 25
Mapping threads to the domain: X/Y/Z-dimension solves
0
0.5
1
1.5
2
2.5
3
PreProc X-solve Y-solve Z-solve
Tim
e /
grid
ele
men
t [n
s]
SP
DP
x16.5
X: offset = 1, stride = NX
→ 4byte/32byte = 12.5% cache line utilization in SP
Y: offset = NX , stride = 1→ perfectly coalesced, 100% utilization
Z: offset = NX ∗ NY , stride = 1→ perfectly coalesced, 100% utilization + moderate TLB hit rate
Endre Laszlo (Oxford) S4289 March 26th, 2014 San Jose 10 / 25
Mapping threads to the domain: X/Y/Z-dimension solvesNvidia Tesla K40 (GK 110B): 288 GB/s
0
50
100
150
200
250
300
X-solve Y-solve Z-solve
GB
/s
SP
DP
Endre Laszlo (Oxford) S4289 March 26th, 2014 San Jose 11 / 25
TLB (Translation Lookaside Buffer) miss rate
CUDA uses Unified Virtual Address Space
Virtual address space uses memory pages
Memory page frame pointers are cached in TLB
TLB is a ”coarser” cache thatI works with LLCI translates address tag to frame pointerI caches frame pointers from main memory
On NVIDIA devices TLB is hardware implemented and page sizes can not be changed
Small page size + long-stride → high TLB miss rate
NVVP implicitly reports it within the ”Global memory replay overhead” counter
753 clock latency in case of TLB page miss 2
2Measure on GT200 by Wong et al. in ”Demystifying GPU microarchitecture through microbenchmarking,”Performance Analysis of Systems and Software (ISPASS), 2010
Endre Laszlo (Oxford) S4289 March 26th, 2014 San Jose 12 / 25
How to cope with TLB miss rate and coalescence?
TLB is easy: remap your solver for better localityI Change 2D thread block mapping into 1D thread blocksI So that threads within a block will solve the closest neighboring set of systemsI Perform cache/register blocking
Coalesced memory access is more difficult:I Only a problem in the X-dimI Need for cache blocking:
F Local transpose in shared memory orF Local transpose with register shuffle ( shfl() intrinsic) orF Caching a whole system → Thomas-PCR hybrid
Endre Laszlo (Oxford) S4289 March 26th, 2014 San Jose 13 / 25
Thomas with shared memory transpose
Forward pass:1 Wrap a warp (32 threads) into 4x8 blocks
to perform non-caching (32byte) loads
2 Load 32x8 size tiles into shared memory:I 8 steps of 4x8 blocks loads
3 Transpose data by putting values into
registers:I float a[8]; is compiled to 8 registers
if array indexing is known in compiletime
4 Perform calculation with the 8 valuesalong X dimension
5 Repeat from 2 until end of X-dim isreached
Backward pass:1 Do the same backward: transpose + store
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31
32byte
0 1 2
8 9 10
16 17 18
24 25 26
0 1 2
8 9 10
16 17 18
24 25 26
x
y
Step 0
32 rows
8 columns
Transpose
Thread 0: float reg[8]
Thread 1: float reg[8]
Thread 2: float reg[8]
Thread 3: float reg[8]
Thread 4: float reg[8]
Thread 5: float reg[8]
Thread 6: float reg[8]
Thread 7: float reg[8]
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Thread 6
Thread 7
Step 1
Shared memory
Register file
Endre Laszlo (Oxford) S4289 March 26th, 2014 San Jose 14 / 25
Thomas with register shuffle transposeForward pass:
1 Wrap 32 threads into 8x4 blocks toperform 4 x float4 vector loads
2 Load 32x16 size tiles into registers:I 4 threads read 4 consecutive float4
vectors = 64 bytesI Do this 4 times for rows under each
other
3 Transpose data within 4 threads:I 4 threads exchange data on a 4x4
2D array with shfl(float4)I Each element in the 2D array is a
float4 vector
4 Perform calculation with the 16 valuesalong X dimension
5 Repeat from 2 until end of X-dim isreached
Backward pass:1 Do the same backward: transpose + store
4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7
4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7
4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7
4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7
0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3
0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3
0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3
0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3
float4 float4 float4 float4
float a[16]
x
y
32 rows
16 columns 64 bytes
Transpose
Step 1
Step 2
Step 3
Step 4
Step 1
Step 2
Step 3
Step 4
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
4 4 4 4
4 4 4 4
4 4 4 4
4 4 4 4
Read
steps
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
Thread 0: float reg[16] =
Thread 1: float reg[16] =
Thread 2: float reg[16] =
Thread 3: float reg[16] =
Thread 4: float reg[16] =
Thread 5: float reg[16] =
Thread 6: float reg[16] =
Thread 7: float reg[16] =
Read
Register file
Register file
Endre Laszlo (Oxford) S4289 March 26th, 2014 San Jose 15 / 25
Performance comparison
0
0.5
1
1.5
2
2.5
3
Naïve Sharedtranspose
Registershuffle
ThomasPCRhybrid
cuSPARSEv5.5.22
2 socketXeon
LAPACKEv.3.5.0
Xeon Phi
Tim
e /
grid
ele
men
t [n
s]
Trid-X SP
Trid-X DP
Trid-Y SP
Trid-Y DP
CPU: Intel Xeon E5-2680, 2 socket, 40MB, 16 core (32 HT), 102 GB/sGPU: Nvidia K40m, 288 GB/s
Endre Laszlo (Oxford) S4289 March 26th, 2014 San Jose 17 / 25
Scalar-tridiagonal library use with OpenACC
void main()
int n = NX*NY*NZ;
float* u = (float *) malloc(sizeof(float)*n);
float* ax = (float *)acc_malloc(sizeof(float)*n);
...
#pragma acc data copy(u[n]) deviceptr(ax,bx,cx,ay,by,cy,az,bz,cz,du)
for(it=0; it<iter; it++)
... // calculate r.h.s. and set tri-diagonal coefficients
int ndim = 3;
int dims[3] = 256,256,256;
int pads[3] = 320,320,320;
solvedim = 0; // X-solve
tridSmtsvStridedBatch(ax, bx, cx, du, u, ndim, solvedim, dims, pads);
solvedim = 1; // Y-solve
tridSmtsvStridedBatch(ay, by, cy, du, u, ndim, solvedim, dims, pads);
...
Endre Laszlo (Oxford) S4289 March 26th, 2014 San Jose 18 / 25
Batch block-tridiagonal solver
Motivation for block solver:I State variables in CFD/finance PDEs have inter-depenedenceI Block matrices with block sizes of 2− 8
Sub-problems to be solved:I Inverting, multiplying blocks (matrices) involves branchingI Data storage shortage limits the number of systems on the device
Optimization strategies on GPUs:I Data storage - for better data localityI Work sharing - to increase parallelism
F Inter-thread communication with shared memoryF Inter-thread communication with register shuffle
Endre Laszlo (Oxford) S4289 March 26th, 2014 San Jose 19 / 25
Batch block-tridiagonal solver work sharing
Threads within a warp compute:I Matrix-matrix productI Matrix-vector productI Gauss-Jordan block solve
A thread stores a column of a block and a scalar value of a vector
Need to pay special attention to help register allocation
Algorithms are implemented with shared memory or shfl() instrinsic communication
In worst case (M = 7) 4 threads out of 32 are idle
Endre Laszlo (Oxford) S4289 March 26th, 2014 San Jose 20 / 25
Batch block-tridiagonal matrix storage
Blocks are stored in a row major format
Blocks of different problems are interleaved for better data locality
A00 A1
0 A20 ... AP−1
0 A01 A1
1 A11 ... AP−1
1 A02 A1
2 A12 ... (2)
Endre Laszlo (Oxford) S4289 March 26th, 2014 San Jose 21 / 25
Batch block-tridiagonal solver
Two versions: shared memory, register shuffle
Register spill above 8x8 DP block size
Approx. 8-16k threads saturate GPU
Low shared memory use 576(1125) bytes/ThreadBlock SP(DP) → good occupancy
Shared SP 8x8I Shared memory efficiency 84.5%I Shared memory load/store throughput 1700 GB/sI L2 Hit Rate (L1 Reads) 51.9%I Executed IPC 1.68I Texture Cache hit rate 50%
In SP shuffle is better
In DP shared memory is better
Endre Laszlo (Oxford) S4289 March 26th, 2014 San Jose 22 / 25
Data and compute throughput
0
50
100
150
200
250
2 4 6 8
GB
/s
M - block size
SP
DP
(a) Effective data throughput
0
50
100
150
200
250
300
2 4 6 8
GFL
OP
S
M - block size
SP
DP
(b) Compute throughput
CPU: Intel Xeon E5-2690, 2 socket, 40MB, 16 core (32 HT), 102 GB/sGPU: Nvidia K40m, 288 GB/s
Endre Laszlo (Oxford) S4289 March 26th, 2014 San Jose 23 / 25
Performance comparison
Baseline: Multi-threaded LAPACKE ?gbsv work() banded solver
0
2
4
6
8
10
12
14
16
2 4 6 8
Spee
du
p o
ver
LAPA
CK
E
M - block size
CPU
GPU - Shared
GPU - Shuffle
(a) Single Precision
0
2
4
6
8
2 4 6 8
Spee
du
p o
ver
LAPA
CK
E
M - block size
CPU
GPU - Shared
GPU - Shuffle
(b) Double Precision
Endre Laszlo (Oxford) S4289 March 26th, 2014 San Jose 24 / 25
Conclusion
Batch tridiagonal solvers
Scalar solverI Different optimization strategies:
F Thomas with shared memory transposeF Thomas with register shuffle transposeF Thomas/PCR hybrid
I Library quality solution for scalar tridiagonal solvers
Block solverI High throughput solverI Higher performance than:
F Vectorized, multi-threaded CPU block solver orF Banded LAPACK(E) solver
I Contributions of Nvidia funded summer interns is acknowledged:
→ James Whittle, Catherine Hastings
Endre Laszlo (Oxford) S4289 March 26th, 2014 San Jose 25 / 25