What is the most important kernel of sparse linear solvers for heterogeneous supercomputers?

transcript

23/4/19 SNSCC'12, shengxin.zhu@maths.ox.ac.uk 1

What is the most important kernel of sparse linear

solvers for heterogeneous supercomputers?

Shengxin ZhuThe University of Oxford

Prof. Xingping Liu and Prof. Tongxiang Gu

National Key Laboratory of Computational Physics Institute of Applied Physics and Computational Mathematics

Outlines

Brief introduction on Heterogeneous supper-computers Computation kernels of Krylov methods Influence of communications Case study: GPBiCG(m,l) Challenging problems Conclusion

Introduction to heterogeneous supper-computers

Dawning5000A Nodes: Bandwidth: Memory:

Dawning 5000Ranking history

11/2008 11th

06/2009 15th

11/2009 19th

06/2010 24th

11/2010 35th

06/2011 40th

11/2011 58th

2011/ Nov : top500

1st K (JP)

2st NUDT (CN)

3rd Cray (US)

4th Dawning (CN)

Computational kernels of Krylov methods

Vector update: parallel in nature

Mat-vec ： Computation intensive; multi-core technology CUDA/O

Inner product: Communication intensive (CPU/MPI).

Influence of communicationfirst glance

S Zhu, MSc Thesis, CAEP, 2010

Computation cheap

Communication expensive

Based on Aztec by Prof. Tuminaro et al @ Sandia

Real reason for time-consuming communications

Small workshops: focus less preparing time

Conference: diversity more preparing time

k dots

i tkNtt kP t

vector update 2 /vec flt Nt P

_mat_vec 2 / -1m v z fln Nt Pt

bandwidth :Latency :

Strategies for minimizing communications

Replacing dot by others (semi-Chebyshev ) : workshop only no conference if possible. Inner product free , Gu, Liu, Mo(2002)

Reorganizing algorithm such that: (reduce number of conference and each conference accept more talks) residual replacement strategies due to Von de Vorst (2000s). CA –KSMs, Demmel et al (2008)

Overlapping communication over computation

A case study, Paralleling GPBiCG(m,l) (S. Fujino, 2002)

GPBiCG(1,0) BiCGSTAB

GPBiCG(0,1) GPBiCG

GPBiCG(1,1) BiCGSTAB2

Could be used to design breakdown free BiCGSTAB method.

GPBiCG(m,l) (S. Fujino, 2002)

0 0 1 1

1. , 0,

2. 0,1,...,

3. ( ),

k k k k k

k k k k k k

k k k k k

r b Ax t w

k r tol

p r p u

q Ar r

t r q s At

y t t w

mod (k,m l

for do

k k k k k

k k k k

)< m or k = 0

, , , ,,

, , , ,

k k k k k k k kk

k k k k k k k k

k k k k k k k kk

k k k k k

k k k k

s t y t y s s t

s s y y y s s y

y y s t y t s y

u q t r

s y y y s s y

k k k k k

k k k k k k

k k k k k

k k k k

r t y At

x x p zw s q

GPBiCG(m,l) (S. Fujino, 2002)

0 0 1 1

1. , 0,

2. 0,1,...,

3. ( ),

k k k k k

k k k k k k

k k k k k

r b Ax t w

k r tol

p r p u

q Ar r

t r q s At

y t t w

mod (k,m l

for do

k k k k k

k k k k

)< m or k = 0

, , , ,,

, , , ,

k k k k k k k kk

k k k k k k k k

k k k k k k k kk

k k k k k

k k k k

s t y t y s s t

s s y y y s s y

y y s t y t s y

u q t r

s y y y s s y

k k k k k

k k k k k k

k k k k k

k k k k

r t y At

x x p zw s q

0 0 1 1 0 0 0 0

0 00 0 0 0

1. , 0, , ,

, , , /

2. 0,1,...,

k k k k k k k

k k k k

r b Ax t w f A r p r

q Aq fp rr rr fp

k r tol

tem t t r q s At

y tem t w

mod (k,m+l)< m

for do

7. / 0;

compu te k k k k kk k

k kk k

or k = 0

st ss ,

st ,ss

rt ,rs , fs , f t , fq fp

16. , , ; ,

k k k k k

k k k k

k k k kk

st ,ss ,sy yt yy rt ,ry rs ,

fs , fy f t , fh

h t r u

compute

, k k kfq fp

k k kk k k k kk k

k kk k k k k k

k kk k k

k kk k

ss yt sy st st yy yt sy

ss yy ys sy ss yy ys sy

rt rs ry

ft fs fy

27. 28. 29.

k k k k k k k k

k k k k k k k

k k k k k k

k k k k k

u q t r u

z r z u

r t y At

x x p zw

( ); 30.

k k k+1 k

p r p ufp

= + f-

: , direct computed

indirect computed

xy x y

xy := (x, y)

Algorithm Design of PGPBiCG(m,l) Method

PGPBiCG(m,l) Method(reduce # global commun. )

Algorithm reconstruct: three GobalCs to one ！

Global synch.

Global synch.reconstruct

Performance

Based on Aztec by Prof. R.S. Tuminaro et al @ Sandia

Convergence analysis

Residual replacements strategies

Backward stable analysis

Our methods (1

, , 200

k k k k k

k k k k

k k kk k k

kk k k k

IBiCGSTAB Yang

rr rr rq f s

PGPBICG

rr rt f s

fr ft fs

fp fr fp f

fr fr fq f s

fp fr fp fu

Challenging problemAccurate compute dot

Why Mindless by Kahan Accurate compute inner product.

Ogita and Rump –et-al, Accurate sum and dot product, SIAM Sci Compt. 2005 cited 188 times. (but) ….

PLASMA team Backward stable analysis of residual replacement methods.

Carson and Demmel, A residual replacement strategy for improving the maximum attainable accuracy of communication avoiding Krylov subspace Methods, April 20 2012

Reliable dot computation algorithm

Conclusion: Avoiding communication Reliable computation Inner product computation is very likely to be the most challengin

g kernel for HHPC, while Mat_vec important for both… Software abstraction and threads programming are helpful, toge

ther with re-designing algorithms will do better

Math/Algorithm CS/Performance Applications interfaceAztec

POSKIPOSKI Hyper, PETSc; Trilinos

(Parallel Optimized Sparse Kernel Interface LIbrary) Poski v.1.0 May 02/2012

Thanks !

More than ten thousand processors are connected by network

Global Communication becomes more and more serious

Initial study on communication complexity

Based on the former two strategies de Sturler and van der Vorst: Parallel GMRES(m) and CG methods

(1995) Bucker and Sauren: Parallel QMR method (1997) Yang and Brent: Improved CGS, BiCG and BiCGSTAB methods

(2002-03) Gu and Liu et al.: ICR, IBiCR, IBiCGSTAB(2) and PQMRCGST

AB methods (2004-2010) Demmel et al CA-KSMs (2008---)

Gu, Liu and Mo: MSD-CG: multiple search direction conjugate gradient method (2004) replaced the inner products computation by solving linear systems

with small size. Eliminates global inner products completely. The idea have been generated to MPCG by Grief and Bridson (200

Methods in literatures

Comparison of computational count of two Algorithms

GPBiCG( , )

PGPBiCG( , )

No._innMethod Mat_vec vect_update Syn_poin

2 18 2 51 2

2 18 0 0

5 19 1

vector update ti

communication time

mat - vec time

k inner products

computation kernals compute time

2 -1 /

sn t k

2Nt /P

Comparison of computational count of two Algorithms

2 / 2log ( )H 1

4 / 2log ( 2 )M 2GPBiCG ,

10 / 2 log ( 5 )L 5

The time of inner product operations of GPBiCG(m,l) and PGPbiCG(m, l)

Methods position No. time

4 / 2log

2 )T 2

fl s w

t N P P t t

t N P P t tm l

t N P P t t

18 / 2log ( 9 )M 9ICG( , )

30 / 2 log ( 15 )L 15fl s w

fl s w

t N P P t tm l

t N P P t t

Mathematical model of the time consummation

32 46 2 1

6 10 16

40 60 2 1

2 18 30

m l m l n Nt

m l t m l t

m l m l n Nt

T P m lP

T Tt t

Scalability analysis

Scaled Speedu,1

PGC S P

(1 ) 1

ficiency analysis

over P

N = f P E

T N P PT N P

ET N T N P

Em l EPEP P

m l EPN

fi xed

The optimal number of processors

12 2 1

32 46 2 1 ln 2log 2

6 10 16

40 60 2 1 ln 2log 2

2 18 30

Brief proof

z floptG G

z flop

tPG PG

m l m l n NtT P m l P

P m l t m l t

m l m l n NtT P m l P

m l t m l t

x x Cx

Opti mal number

0 C=const

ln 2' 0, '' x 0 x x

Convergence Analysis

N Let x,y R , is the inner product computed by computer, and is

the real value , where is machine prec

nT Ti ii=1

fl x y - x y 1.01n x y

， then

Our methods (1

, , 200

k k k k k

k k k k

k k kk k k

kk k k k

IBiCGSTAB Yang

rr rr rq f s

PGPBICG

rr rt f s

fr ft fs

fp fr fp f

fr fr fq f s

fp fr fp fu

When n is very large then might be Co mun ch larger thclusion. an .T Tfl x y - x y u

Numerical Experiments: timing and improvements

0 00 1

0, , (0,1)

0, 0, 0, 10

1512, 1, 0

y yx x

u u u ua b c d eu x y

x y x yu u

u ux x

a b c d e

Experiment I Each CPU 3600

Experiment II

problem size 960

G IGc c

GPBiCG

comm/ al l ：

speed up

Numerical Experiments: Speedup

PGPBiCG(m,l) method is more scalable and parallel for solving large sparse unsymmetrical linear systems on distributed parallel architectures

Performance, isoefficiency analysis and numerical experiments have been done for PGPBiCG(m,l) and GPBiCG(m,l) methods

The parallel communication performance can be improved by a factor of larger than 3.

The PGPBiCG(m,l) method has better parallel speed up compared with the GPBiC(m,l) method.

For further performance improvements: overlap of computation with communication, numerical stability.

Conclusions

What is the most important kernel of sparse linear solvers for heterogeneous supercomputers?

Documents