What is the most important kernel of sparse linear solvers for heterogeneous supercomputers?

23/4/19 SNSCC'12, [email protected] 1

What is the most important kernel of sparse linear

solvers for heterogeneous supercomputers?

Shengxin ZhuThe University of Oxford

Prof. Xingping Liu and Prof. Tongxiang Gu

National Key Laboratory of Computational Physics Institute of Applied Physics and Computational Mathematics


Outlines

Brief introduction on Heterogeneous supper-computers Computation kernels of Krylov methods Influence of communications Case study: GPBiCG(m,l) Challenging problems Conclusion


Introduction to heterogeneous supper-computers

Dawning5000A Nodes: Bandwidth: Memory:

3

Dawning 5000Ranking history

11/2008 11th

06/2009 15th

11/2009 19th

06/2010 24th

11/2010 35th

06/2011 40th

11/2011 58th

2011/ Nov : top500

1st K (JP)

2st NUDT (CN)

3rd Cray (US)

4th Dawning (CN)


Computational kernels of Krylov methods

Vector update: parallel in nature

Mat-vec ： Computation intensive; multi-core technology CUDA/O

penMP

Inner product: Communication intensive (CPU/MPI).


Influence of communicationfirst glance

S Zhu, MSc Thesis, CAEP, 2010

Computation cheap

Communication expensive

Based on Aztec by Prof. Tuminaro et al @ Sandia


Real reason for time-consuming communications

Small workshops: focus less preparing time

Conference: diversity more preparing time

2

2

k dots

g

2 /

lo

f s w

w

k lnn

s

i tkNtt kP t

t t P

vector update 2 /vec flt Nt P

_mat_vec 2 / -1m v z fln Nt Pt

bandwidth :Latency :

w

s

tt


Strategies for minimizing communications

Replacing dot by others (semi-Chebyshev ) : workshop only no conference if possible. Inner product free , Gu, Liu, Mo(2002)

Reorganizing algorithm such that: (reduce number of conference and each conference accept more talks) residual replacement strategies due to Von de Vorst (2000s). CA –KSMs, Demmel et al (2008)

Overlapping communication over computation


A case study, Paralleling GPBiCG(m,l) (S. Fujino, 2002)

GPBiCG(1,0) BiCGSTAB

GPBiCG(0,1) GPBiCG

GPBiCG(1,1) BiCGSTAB2

Could be used to design breakdown free BiCGSTAB method.


GPBiCG(m,l) (S. Fujino, 2002)

*0

*0

0 0 1 1

1 1 1

1 1

1. , 0,

2. 0,1,...,

3. ( ),

4. ;

5. ;

6.

7.

,

(

,

k

k k k k k

k k

k k k k k k

k k k k k

k

k

k

r b Ax t w

k r tol

p r p u

q Ar r

p

t r q s At

y t t w

mod (k,m l

r q

+

if

for do

1

)

8.

9.

10.

11.

,

,

k kk

k k

k k k

k k k k k

k k k k

)< m or k = 0

u q

z r u

r t

s

s

s t

s

then

1 1 1

, , , ,,

, , , ,

, , , ,

12.

13.

14.

15.

, , ,

,

k k k k k k k kk

k k k k k k k k

k k k k k k k kk

k k k

k k k k k

k k k k

k k

k

k

k

k

s t y t y s s t

s s y y y s s y

y y s t y t s y

u q t r

s

u

z

s y y y s s y

else

1

1

*0 1

*0

16.

17.

18.

19.

,

20.

21

.

,

k k k k k

k k k k k k

k k k k k

kkk

k k

k k k k

r r

r

r z u

r t y At

x x p zw s q

r

e

endif

nddo


GPBiCG(m,l) (S. Fujino, 2002)

*0

*0

0 0 1 1

1 1 1

1 1

1. , 0,

2. 0,1,...,

3. ( ),

4. ;

5. ;

6.

7.

,

(

,

k

k k k k k

k k

k k k k k k

k k k k k

k

k

k

r b Ax t w

k r tol

p r p u

q Ar r

p

t r q s At

y t t w

mod (k,m l

r q

+

if

for do

1

)

8.

9.

10.

11.

,

,

k kk

k k

k k k

k k k k k

k k k k

)< m or k = 0

u q

z r u

r t

s

s

s t

s

then

1 1 1

, , , ,,

, , , ,

, , , ,

12.

13.

14.

15.

, , ,

,

k k k k k k k kk

k k k k k k k k

k k k k k k k kk

k k k

k k k k k

k k k k

k k

k

k

k

k

s t y t y s s t

s s y y y s s y

y y s t y t s y

u q t r

s

u

z

s y y y s s y

else

1

1

*0 1

*0

16.

17.

18.

19.

,

20.

21

.

,

k k k k k

k k k k k k

k k k k k

kkk

k k

k k k k

r r

r

r z u

r t y At

x x p zw s q

r

e

endif

nddo


0 0 1 1 0 0 0 0

0 00 0 0 0

1

1

1. , 0, , ,

, , , /

2. 0,1,...,

3. ;

;

4.

5. (

T

k

k k k k k k k

k k k k

r b Ax t w f A r p r

q Aq fp rr rr fp

k r tol

tem t t r q s At

y tem t w

mod (k,m+l)< m

for do

if

)

6.

7. / 0;

8.

9.

10.

11.

compu te k k k k kk k

k kk k

k kk

k k

kk k

k

k+1

k+1

1

k

k

+

or k = 0

st ss ,

r

st ,ss

fu

t rs

fq

f

,

r

t

rt ,rs , fs , f t , fq fp

r fsf

u

r

then

1

1 1 1

12.

13.

14.

15.

16. , , ; ,

,

k k

k k k k k

k k k k

k k k

k k k kk

k k

k k k

kk k

st ,ss ,sy yt yy rt ,ry rs ,

fs , fy f t , fh

q

z r u

r t s

h t r u

else

compute

, k k kfq fp

1

17. ,

18.

19.

20.

22.

k

k

k k kk k k k kk k

k kk k k k k k

k kk k k

k kk k

k k

+

k1 k

1

k k+

f

ss yt sy st st yy yt sy

ss yy ys sy ss yy ys sy

rt rs ry

fq fh

ft fs fy

rr

u

f r

1

1 1 1

1

1

23.

24.

25.

26.

27. 28. 29.

k

k k k k k k k k

k k k k k k k

k k k k k k

k k k k k

k

k

k

k

kk

k k

r

u q t r u

z r z u

r t y At

x x p zw

r

q

r

s

r

endif

1

1

1

1 1

1

( ); 30.

31.

3

2.

k+1

k k k

k

k k k+1 k

k

+

k

k

+1 kk

1

k

p r p ufp

rr

q = A

f

f rp

f

= + f-

p

p u

enddo

: , direct computed

indirect computed

xy x y

xy := (x, y)

Algorithm Design of PGPBiCG(m,l) Method


PGPBiCG(m,l) Method(reduce # global commun. )

Algorithm reconstruct: three GobalCs to one ！

Global synch.

Global synch.

Global synch.

Global synch.reconstruct


Performance

Based on Aztec by Prof. R.S. Tuminaro et al @ Sandia


Convergence analysis

Residual replacements strategies

Backward stable analysis

1

1

1

1

1

1

1

1

Our methods (1

, , 200

,0)

2

k k k k k

k k k k

kk k

k

k

kk k

k k kk k k

kk

k k k

kk k k k

k k k

IBiCGSTAB Yang

rr rr rq f s

f

PGPBICG

rr rt f s

fu fq

fr ft fs

fp fr fp f

u fq

fr fr fq f s

fp fr fp fu

u


Challenging problemAccurate compute dot

Why Mindless by Kahan Accurate compute inner product.

Ogita and Rump –et-al, Accurate sum and dot product, SIAM Sci Compt. 2005 cited 188 times. (but) ….

PLASMA team Backward stable analysis of residual replacement methods.

Carson and Demmel, A residual replacement strategy for improving the maximum attainable accuracy of communication avoiding Krylov subspace Methods, April 20 2012

Reliable dot computation algorithm


Conclusion: Avoiding communication Reliable computation Inner product computation is very likely to be the most challengin

g kernel for HHPC, while Mat_vec important for both… Software abstraction and threads programming are helpful, toge

ther with re-designing algorithms will do better

Math/Algorithm CS/Performance Applications interfaceAztec

POSKIPOSKI Hyper, PETSc; Trilinos

(Parallel Optimized Sparse Kernel Interface LIbrary) Poski v.1.0 May 02/2012


Thanks !


More than ten thousand processors are connected by network

Global Communication becomes more and more serious

Initial study on communication complexity


Based on the former two strategies de Sturler and van der Vorst: Parallel GMRES(m) and CG methods

(1995) Bucker and Sauren: Parallel QMR method (1997) Yang and Brent: Improved CGS, BiCG and BiCGSTAB methods

(2002-03) Gu and Liu et al.: ICR, IBiCR, IBiCGSTAB(2) and PQMRCGST

AB methods (2004-2010) Demmel et al CA-KSMs (2008---)

Gu, Liu and Mo: MSD-CG: multiple search direction conjugate gradient method (2004) replaced the inner products computation by solving linear systems

with small size. Eliminates global inner products completely. The idea have been generated to MPCG by Grief and Bridson (200

6)

Methods in literatures


Comparison of computational count of two Algorithms

GPBiCG( , )

PGPBiCG( , )

No._innMethod Mat_vec vect_update Syn_poin

M Lts

H T

2 18 2 51 2

2 18 0 0

3

5 19 1

m l

m l

_

+

vector update ti

communication time

2

me

mat - vec time

k inner products

computation kernals compute time

2 -1 /

2 /

fvec

m v

in

l

z

k

l

fl w

f

sn t k

2Nt /P

n

t

Nt P

kNt

=

t P t

t

：

：

：


Comparison of computational count of two Algorithms

2

2

2

2

2 / 2log ( )H 1

4 / 2log ( 2 )M 2GPBiCG ,

10 / 2 log ( 5 )L 5

The time of inner product operations of GPBiCG(m,l) and PGPbiCG(m, l)

Methods position No. time

4 / 2log

PGP

2 )T 2

B

(

fl s w

fl s w

fl s w

fl s w

t N P P t t

t N P P t tm l

t N P P t t

t N P P t t

2

2

18 / 2log ( 9 )M 9ICG( , )

30 / 2 log ( 15 )L 15fl s w

fl s w

t N P P t tm l

t N P P t t


Mathematical model of the time consummation

12 2

1

2

1

1

,

2

2

2

2

=log

32 46 2 1

6 10 16

40 60 2 1

2 18 30

66%

log

2

log

( )

2

z fl

s w

G

z fl

s w

PGs w

G

l w

P

sf

G

G

T P

m l m l n Nt

m l t m l t

m l m l n Nt

m l t

t tN

tP

T P m

m l t

lP

T P m lP

T Tt t

T


Scalability analysis

Scaled Speedu,1

p 3 ,

PGC S P

PPGP

TTS

T P

S

T S

：

2 2

1 1

2 2

1 1

( , )

,

Isoef

,

( ) ,

/ 3

12log

(1 ) 1

2log

(1

ficiency analysis

) 1

E

over P

S

G

over

G

PG

PG

N = f P E

T N P PT N P

ET N T N P

Em l EPEP P

NE E

m l EPN

N

EP

E E

N

P

fi xed

：


The optimal number of processors

12 2

12 2

12 2 1

32 46 2 1 ln 2log 2

6 10 16

40 60 2 1 ln 2log 2

2 18 30

Brief proof

log

/

,

3

,

z floptG G

s w

z flop

IG G

tPG PG

s w

m l m l n NtT P m l P

P m l t m l t

m l m l n NtT P m l P

P

P P

m l t m l t

x x Cx

Opti mal number

：

2

1

2

0 C=const

ln 2' 0, '' x 0 x x

Popt


Convergence Analysis

N Let x,y R , is the inner product computed by computer, and is

the real value , where is machine prec

Lemm

ision

a.

.

T

nT Ti ii=1

x y y

fl x y - x y 1.01n x y

u u

Tfl x

， then

1

1

1

1

1

1

1

1

Our methods (1

, , 200

,0)

2

k k k k k

k k k k

kk k

k

k

kk k

k k kk k k

kk

k k k

kk k k k

k k k

IBiCGSTAB Yang

rr rr rq f s

f

PGPBICG

rr rt f s

fu fq

fr ft fs

fp fr fp f

u fq

fr fr fq f s

fp fr fp fu

u

When n is very large then might be Co mun ch larger thclusion. an .T Tfl x y - x y u


Numerical Experiments: timing and improvements

2 2

2 2

0 00 1

0, , (0,1)

0, 0, 0, 10

1512, 1, 0

y yx x

u u u ua b c d eu x y

x y x yu u

u ux x

a b c d e

Experiment I Each CPU 3600

Experiment II

problem size 960

(1,0)

(0,1)

(1,1)

(2

96

,8)

( ,

0

8 2)

cc

G IG

G

G IGc c

c Gc

TR =

T

T T

T

T T

T

GPBiCG

GPBiCG

GPBiCG

GPBiCG

GPBiCG

fit

comm/ al l ：

speed up

：


Numerical Experiments: Speedup


PGPBiCG(m,l) method is more scalable and parallel for solving large sparse unsymmetrical linear systems on distributed parallel architectures

Performance, isoefficiency analysis and numerical experiments have been done for PGPBiCG(m,l) and GPBiCG(m,l) methods

The parallel communication performance can be improved by a factor of larger than 3.

The PGPBiCG(m,l) method has better parallel speed up compared with the GPBiC(m,l) method.

For further performance improvements: overlap of computation with communication, numerical stability.

Conclusions

Date post:	31-Dec-2015
Category:	Documents
Upload:	nicholas-lamb
View:	22 times
Download:	1 times

What is the most important kernel of sparse linear solvers for heterogeneous supercomputers?

Documents