Download - Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

Multifrontal sparse QR factorization on the GPU

Tim Davis, Sanjay Ranka,Sharanyan Chetlur, Nuri Yeralan

University of Florida

Feb 2012

GPU-based Multifrontal QR factorization

why sparse QR?

multifrontal sparse QR in a nutshell

multi-threaded sparse QR

sparse multifrontal QR on the GPU

our strategywork in progress

Why multifrontal sparse QR factorization?

wide applicability of QR

numerically stable

better parallelism.independent problems decoupled, unlike LU or CholeskyCommunication-Avoiding QR (CAQR)

orthogonal methods have higher flops per memory reference

QR assembly step is GPU-friendly

related to other direct methods (LU, Cholesky, LDLT )

Multifrontal sparse QR factorization in a nutshell

rows can be operated on in any order

group together rows with left-most nonzeros in the samecolumn

factorize each block of rows independently

each block of rows takes on the same nonzero pattern (afrontal matrix)

merger of frontal matrices: copy, not add (unlike LU,Cholesky)

repeat until the matrix becomes upper triangular

X . x x

X x . .

X . x .

X x x . x

X . x . x

X . . x .

X x . . x

X . x . x

X . x x .

X x x

X x .

The dots: union of the nonzero patterns of all rows in each block.

Householder Sparse QR

Sort the rows of A by column of leftmost nonzero, and annihilate

X . x x

X x . .

X . x .

X x x . x

X . x . x

X . . x .

X x . . x

X . x . x

X . x x .

X x x

X x .

→

r r r r

0 ∗ ∗ ∗0 0 ∗ ∗

X x x . x

X . x . x

X . . x .

X x . . x

X . x . x

X . x x .

X x x

X x .

Can do the other blocks at the same time.

Householder Sparse QR

Key observation: each block of rows to annihilate has the samenonzero pattern. So place them in a dense submatrix and usedense matrix kernels. For column 1:

1 2 3 4 5 6 7

X . x x

X x . .

X . x .

X x x . x

X . x . x

X . . x .

X x . . x

X . x . x

X . x x .

X x x

X x .

use this:

1 2 4 6

X . x x

X x . .

X . x .

Multifrontal QR factorization in a nutshell

Group rows of A with nonzero in same leftmost column

. . x . . x .

. . x x . . x

. . x . x . x

. . x . x x .

Apply Householder to reduce each group to upper triangular,one row becomes a row of R

. . r r r r r

. . . x x x x

. . . . x x x

. . . . . x x

Append remainder to the group for the next nonzero column

next: a tree of columns (the column elimination tree)

Lump adjacent columns together if their rows of R have thesame nonzero pattern (supernodes)

the column elimination tree

r r r r r

r r r r

1 2 3 4 5 6 7 8 9 10 11 12

r r r r r

r r r r r

r r r

rr

r

r

r

r

r

r

r

r

r

r

r

r

r

r r r r

rrrr

r r r

rr

r

r

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

x x x

x

x x

x

x

x

x

x

x x

x

x

x

x

x

.

.

.

.

.

.

.

.

.

x

x

x

x

x

x x

x

x

x

x

x

x

x

. . .

.

.

. .

.

.

.

.

x

x

x

x

x

x

x

x x

x x

x

x

.

.

.

.

.

..

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x x

x

x

x

x

x

xx

x

x x x x x

1 2 3 4 5 6 7 8 9 10 11 12

.

.

.

.

. .

..

. .

.

.

.

.

.

. .

.

.

.

.

1

2

3 4

5

6

8

7

9

10

12

11

the QR factor Rthe matrix A

QR factorization of a leaf frontal matrix

1 6 8 112

x

x

x

x

x

x

x

x

x

x x

x

x

x

x

x x

1

3

4

5

6 x .

.

.

.

.

.

..

.2

1 6 8 112

h

h

h

h

c c

c

c c c

r r

r

r r

rrr

r

h

h

h

h h

h hh

factorized front 1rows of A for front 1

QR for a non-leaf frontal matrix

1c 1c 1c3

1c 1c4

1c5

6 8 11

child front 15 7 9 11

15 3c 3c

14 3c 3c 3c3c 3c 3c 3c13

child 3

6 87 12

10

11

9 2c 2c 2c

8 2c 2c 2c 2c

2c 2c

2c

child 2

1

2

3 4

5

6

8

7

9

10

12

11

Non-leaf frontal matrix: children to assemble

5 6 7 8 9 11 12x x x x x. .16

x xx . . . .17

x xx xx . .18

x x x x..19

x x x...20

x xx..21

x xxxx22

1c 1c 1c3

1c 1c4

1c5

6 8 11

child front 15 7 9 11

15 3c 3c

14 3c 3c 3c3c 3c 3c 3c13

child 3

6 87 12

10

11

9 2c 2c 2c

8 2c 2c 2c 2c

2c 2c

2c

child 2

rows of A for front 4

Assembly: shuffle the above data into a single matrix (next slide)

Frontal matrix assembly: no read-modify-write

3c3cc33c

2c2c2c

1c

2c1c1c

2c 2c 2c

3c 3c 3c

1c 1c

2c2c

3c 3c

1c

2c

5 6 7 8 9 11 12

9

3

13

8

14

41015511

x x x..20.

.

..

. .

. .....

. ...

...

x x x x x. .16x x. . . .17x xx xx . .18

x x x x..19

x xx..21x xxxx22

x

5 7 9 11child 3

15 3c 3c14 3c 3c 3c

3c 3c 3c 3c13

1c 1c 1c3

1c 1c4

1c5

6 8 11child front 1

6 87 12

1011

9 2c 2c 2c8 2c 2c 2c 2c

2c 2c

2c

child 2

5 6 7 8 9 11 12x x x x x. .16x xx . . . .17x xx xx . .18

x x x x..19x x x...20

x xx..21x xxxx22

assembled front 4rows of A for front 4

Frontal matrix: after factorization

5 6 7 8 9 11 12r r r r r r r

r rr

rr r r

rrrr

hhh

hhhhhh

h

hhhhhhhh

c c cccc

ccc

chhh h hh h h h

hhh h

hh

h hhh

hhh

h hhh

hh

hhhh

h

h

hh h

h

h

h c: contribution to parent

R factor

(Q)h: Householder vectors

factorized front 4

Multifrontal QR: assembly step is GPU-friendly

Assembly for multifrontal LU, Cholesky

requires data shuffling and additionin parallel: two thread blocks would need to synchronize thesummation (read-modify-write)

for multifrontal QR

requires just data shuffling; no additionin parallel: no read-modify-write

Frontal matrix QR factorization on the GPU

Kernel 1: suppose one thread block can factorize one stripewith a fixed maximum number of rows.

If stripe has too many columns, slice it and apply Q aftercomputed.


Kernel 2: take two stripes; annihilate below diagonal. Before:

after:

numerically stable because of orthogonal operations


Combine with more stripes:


Rinse and repeat, always working on pairs of stripes at a time,where two pairs fit in shared memory:


Kernel dependencies for a frontal matrix of 4 stripes and 6 blocksof columns:


Further pipelining for additional parallelism

Algorithm outline

symbolic analysis: on the CPU

numeric factorization: on the GPU

Symbolic analysis: on the CPU

Fill-reducing ordering (typically O(|A|) time)

Symbolic analysis (nearly O(|A|), without forming ATA

find the column etreerow counts of R

find relaxed supernodes

sort rows of A

task assignment (subtrees = parallel subtasks)

total time: about O(|A|)

Numerical factorization: on the GPU

numerical factorization of subtrees:

frontal matrix assemblyfrontal matrix factorizationcontribution block stacked for parent

the challenge of heterogeneous computations within a subtree

some factorize while others assemblefronts vary wildly in size from tiny to hugetree driven by matrix; not simple balanced binary tree

staging:

factorize one subtree while transfering anotherCPU ↔ GPU

multi-GPU:

each GPU handles independent subtrees.if front is huge, treat like multi-GPU dense QR factorizationwith blocking/striping

Performance results: pre-GPU method

Least squares problem: 2 million by 110 thousand

Method ordering procs time

x=A\b COLMMD 1 ?x=A\b AMD 1 11 daysMA49 AMD 1 3.5 hoursSuiteSparseQR AMD 1 1.5 hoursSuiteSparseQR METIS 1 45 minutesSuiteSparseQR METIS 16 7.3 minutes

Algorithmic speedup vs x=A\b: 375x

Parallel speedup: 5.75x on 16 cores

Total: 2,155x (14 Gflops on 70 Gflops machine)

Single core: 2.5 Gflop peak, same as LAPACK QR

Gflop vs LAPACK (single core)

100

101

102

103

0.1

0.2

0.3

0.4

0.50.60.70.80.9

1

2

3

4

Flop count / memory usage in bytes

GF

lops

SuiteSparseQRDense QR (DGEQRF)

n=4000n=1000

n=100

Multifrontal QR on the GPU

Tesla C2050

double-precision frontal matrix QR (65 GFlops)

fronts remain on the GPU

frontal matrix assembly

in-progress:

strip-mining schedulingnodes of tree = one frontal matrixsplit each front into a subtreeparallel assembly of some fronts while others are factorized

Multifrontal QR on the GPU: strip-mining

4

1

2

35

8

6 7

Multifrontal QR on the GPU: strip-mining

1st kernel launch

2nd

3rd

4th

5th

etc ...

1

2

5 6 73a 3b

3c 3d

3e

A AA

A

4a 4b 4c 4d

A A

A

8a 8b

8c

A A

Multifrontal Sparse QR on the GPU: Summary

Fast symbolic analysis and fill-reducing ordering (∼ O(|A|))

Dense matrix kernels to exploit tightly-coupled regularparallelism within the GPU

Elimination tree for loosely-coupled irregular parallelism

High performance in pre-GPU version

peak Gflop rate same as LAPACKample parallel speedupAppears as the built-in x=A\b and qr in MATLAB R2009a.If A is all nonzero, x=sparse(A)\b can be faster than x=A\b.

GPU method in progress

dense QR for frontal matrices: Sharanyan Chetlurassembly: by the GPU. Regular memory traffic to/from globalmemory; all irregular traffic in shared memory within eachthread block: Nuri Yeralan.strip-mining scheduling of the expanded frontal matrix treestaging subtrees to handle large problems (> 6 GB)

Acknowledgements

Postscript

Please send me your matrices!

http://www.cise.ufl.edu/dropbox/www