Multifrontal sparse QR factorization on the GPU
Tim Davis, Sanjay Ranka,Sharanyan Chetlur, Nuri Yeralan
University of Florida
Feb 2012
GPU-based Multifrontal QR factorization
why sparse QR?
multifrontal sparse QR in a nutshell
multi-threaded sparse QR
sparse multifrontal QR on the GPU
our strategywork in progress
Why multifrontal sparse QR factorization?
wide applicability of QR
numerically stable
better parallelism.independent problems decoupled, unlike LU or CholeskyCommunication-Avoiding QR (CAQR)
orthogonal methods have higher flops per memory reference
QR assembly step is GPU-friendly
related to other direct methods (LU, Cholesky, LDLT )
Multifrontal sparse QR factorization in a nutshell
rows can be operated on in any order
group together rows with left-most nonzeros in the samecolumn
factorize each block of rows independently
each block of rows takes on the same nonzero pattern (afrontal matrix)
merger of frontal matrices: copy, not add (unlike LU,Cholesky)
repeat until the matrix becomes upper triangular
X . x x
X x . .
X . x .
X x x . x
X . x . x
X . . x .
X x . . x
X . x . x
X . x x .
X x x
X x .
The dots: union of the nonzero patterns of all rows in each block.
Householder Sparse QR
Sort the rows of A by column of leftmost nonzero, and annihilate
X . x x
X x . .
X . x .
X x x . x
X . x . x
X . . x .
X x . . x
X . x . x
X . x x .
X x x
X x .
→
r r r r
0 ∗ ∗ ∗0 0 ∗ ∗
X x x . x
X . x . x
X . . x .
X x . . x
X . x . x
X . x x .
X x x
X x .
Can do the other blocks at the same time.
Householder Sparse QR
Key observation: each block of rows to annihilate has the samenonzero pattern. So place them in a dense submatrix and usedense matrix kernels. For column 1:
1 2 3 4 5 6 7
X . x x
X x . .
X . x .
X x x . x
X . x . x
X . . x .
X x . . x
X . x . x
X . x x .
X x x
X x .
use this:
1 2 4 6
X . x x
X x . .
X . x .
Multifrontal QR factorization in a nutshell
Group rows of A with nonzero in same leftmost column
. . x . . x .
. . x x . . x
. . x . x . x
. . x . x x .
Apply Householder to reduce each group to upper triangular,one row becomes a row of R
. . r r r r r
. . . x x x x
. . . . x x x
. . . . . x x
Append remainder to the group for the next nonzero column
next: a tree of columns (the column elimination tree)
Lump adjacent columns together if their rows of R have thesame nonzero pattern (supernodes)
the column elimination tree
r r r r r
r r r r
1 2 3 4 5 6 7 8 9 10 11 12
r r r r r
r r r r r
r r r
rr
r
r
r
r
r
r
r
r
r
r
r
r
r
r r r r
rrrr
r r r
rr
r
r
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
x x x
x
x x
x
x
x
x
x
x x
x
x
x
x
x
.
.
.
.
.
.
.
.
.
x
x
x
x
x
x x
x
x
x
x
x
x
x
. . .
.
.
. .
.
.
.
.
x
x
x
x
x
x
x
x x
x x
x
x
.
.
.
.
.
..
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
xx
x
x x x x x
1 2 3 4 5 6 7 8 9 10 11 12
.
.
.
.
. .
..
. .
.
.
.
.
.
. .
.
.
.
.
1
2
3 4
5
6
8
7
9
10
12
11
the QR factor Rthe matrix A
QR factorization of a leaf frontal matrix
1 6 8 112
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x x
1
3
4
5
6 x .
.
.
.
.
.
..
.2
1 6 8 112
h
h
h
h
c c
c
c c c
r r
r
r r
rrr
r
h
h
h
h h
h hh
factorized front 1rows of A for front 1
QR for a non-leaf frontal matrix
1c 1c 1c3
1c 1c4
1c5
6 8 11
child front 15 7 9 11
15 3c 3c
14 3c 3c 3c3c 3c 3c 3c13
child 3
6 87 12
10
11
9 2c 2c 2c
8 2c 2c 2c 2c
2c 2c
2c
child 2
1
2
3 4
5
6
8
7
9
10
12
11
Non-leaf frontal matrix: children to assemble
5 6 7 8 9 11 12x x x x x. .16
x xx . . . .17
x xx xx . .18
x x x x..19
x x x...20
x xx..21
x xxxx22
1c 1c 1c3
1c 1c4
1c5
6 8 11
child front 15 7 9 11
15 3c 3c
14 3c 3c 3c3c 3c 3c 3c13
child 3
6 87 12
10
11
9 2c 2c 2c
8 2c 2c 2c 2c
2c 2c
2c
child 2
rows of A for front 4
Assembly: shuffle the above data into a single matrix (next slide)
Frontal matrix assembly: no read-modify-write
3c3cc33c
2c2c2c
1c
2c1c1c
2c 2c 2c
3c 3c 3c
1c 1c
2c2c
3c 3c
1c
2c
5 6 7 8 9 11 12
9
3
13
8
14
41015511
x x x..20.
.
..
. .
. .....
. ...
...
x x x x x. .16x x. . . .17x xx xx . .18
x x x x..19
x xx..21x xxxx22
x
5 7 9 11child 3
15 3c 3c14 3c 3c 3c
3c 3c 3c 3c13
1c 1c 1c3
1c 1c4
1c5
6 8 11child front 1
6 87 12
1011
9 2c 2c 2c8 2c 2c 2c 2c
2c 2c
2c
child 2
5 6 7 8 9 11 12x x x x x. .16x xx . . . .17x xx xx . .18
x x x x..19x x x...20
x xx..21x xxxx22
assembled front 4rows of A for front 4
Frontal matrix: after factorization
5 6 7 8 9 11 12r r r r r r r
r rr
rr r r
rrrr
hhh
hhhhhh
h
hhhhhhhh
c c cccc
ccc
chhh h hh h h h
hhh h
hh
h hhh
hhh
h hhh
hh
hhhh
h
h
hh h
h
h
h c: contribution to parent
R factor
(Q)h: Householder vectors
factorized front 4
Multifrontal QR: assembly step is GPU-friendly
Assembly for multifrontal LU, Cholesky
requires data shuffling and additionin parallel: two thread blocks would need to synchronize thesummation (read-modify-write)
for multifrontal QR
requires just data shuffling; no additionin parallel: no read-modify-write
Frontal matrix QR factorization on the GPU
Kernel 1: suppose one thread block can factorize one stripewith a fixed maximum number of rows.
If stripe has too many columns, slice it and apply Q aftercomputed.
Frontal matrix QR factorization on the GPU
Kernel 2: take two stripes; annihilate below diagonal. Before:
after:
numerically stable because of orthogonal operations
Frontal matrix QR factorization on the GPU
Combine with more stripes:
Frontal matrix QR factorization on the GPU
Rinse and repeat, always working on pairs of stripes at a time,where two pairs fit in shared memory:
Frontal matrix QR factorization on the GPU
Kernel dependencies for a frontal matrix of 4 stripes and 6 blocksof columns:
Frontal matrix QR factorization on the GPU
Further pipelining for additional parallelism
Algorithm outline
symbolic analysis: on the CPU
numeric factorization: on the GPU
Symbolic analysis: on the CPU
Fill-reducing ordering (typically O(|A|) time)
Symbolic analysis (nearly O(|A|), without forming ATA
find the column etreerow counts of R
find relaxed supernodes
sort rows of A
task assignment (subtrees = parallel subtasks)
total time: about O(|A|)
Numerical factorization: on the GPU
numerical factorization of subtrees:
frontal matrix assemblyfrontal matrix factorizationcontribution block stacked for parent
the challenge of heterogeneous computations within a subtree
some factorize while others assemblefronts vary wildly in size from tiny to hugetree driven by matrix; not simple balanced binary tree
staging:
factorize one subtree while transfering anotherCPU ↔ GPU
multi-GPU:
each GPU handles independent subtrees.if front is huge, treat like multi-GPU dense QR factorizationwith blocking/striping
Performance results: pre-GPU method
Least squares problem: 2 million by 110 thousand
Method ordering procs time
x=A\b COLMMD 1 ?x=A\b AMD 1 11 daysMA49 AMD 1 3.5 hoursSuiteSparseQR AMD 1 1.5 hoursSuiteSparseQR METIS 1 45 minutesSuiteSparseQR METIS 16 7.3 minutes
Algorithmic speedup vs x=A\b: 375x
Parallel speedup: 5.75x on 16 cores
Total: 2,155x (14 Gflops on 70 Gflops machine)
Single core: 2.5 Gflop peak, same as LAPACK QR
Gflop vs LAPACK (single core)
100
101
102
103
0.1
0.2
0.3
0.4
0.50.60.70.80.9
1
2
3
4
Flop count / memory usage in bytes
GF
lops
SuiteSparseQRDense QR (DGEQRF)
n=4000n=1000
n=100
Multifrontal QR on the GPU
Tesla C2050
double-precision frontal matrix QR (65 GFlops)
fronts remain on the GPU
frontal matrix assembly
in-progress:
strip-mining schedulingnodes of tree = one frontal matrixsplit each front into a subtreeparallel assembly of some fronts while others are factorized
Multifrontal QR on the GPU: strip-mining
4
1
2
35
8
6 7
Multifrontal QR on the GPU: strip-mining
1st kernel launch
2nd
3rd
4th
5th
etc ...
1
2
5 6 73a 3b
3c 3d
3e
A AA
A
4a 4b 4c 4d
A A
A
8a 8b
8c
A A
Multifrontal Sparse QR on the GPU: Summary
Fast symbolic analysis and fill-reducing ordering (∼ O(|A|))
Dense matrix kernels to exploit tightly-coupled regularparallelism within the GPU
Elimination tree for loosely-coupled irregular parallelism
High performance in pre-GPU version
peak Gflop rate same as LAPACKample parallel speedupAppears as the built-in x=A\b and qr in MATLAB R2009a.If A is all nonzero, x=sparse(A)\b can be faster than x=A\b.
GPU method in progress
dense QR for frontal matrices: Sharanyan Chetlurassembly: by the GPU. Regular memory traffic to/from globalmemory; all irregular traffic in shared memory within eachthread block: Nuri Yeralan.strip-mining scheduling of the expanded frontal matrix treestaging subtrees to handle large problems (> 6 GB)
Acknowledgements
Postscript
Please send me your matrices!
http://www.cise.ufl.edu/dropbox/www