Date post: | 31-Dec-2015 |
Category: |
Documents |
Upload: | nicholas-lamb |
View: | 22 times |
Download: | 1 times |
23/4/19 SNSCC'12, [email protected] 1
What is the most important kernel of sparse linear
solvers for heterogeneous supercomputers?
Shengxin ZhuThe University of Oxford
Prof. Xingping Liu and Prof. Tongxiang Gu
National Key Laboratory of Computational Physics Institute of Applied Physics and Computational Mathematics
23/4/19 SNSCC'12, [email protected] 2
Outlines
Brief introduction on Heterogeneous supper-computers Computation kernels of Krylov methods Influence of communications Case study: GPBiCG(m,l) Challenging problems Conclusion
23/4/19 SNSCC'12, [email protected] 3
Introduction to heterogeneous supper-computers
Dawning5000A Nodes: Bandwidth: Memory:
3
Dawning 5000Ranking history
11/2008 11th
06/2009 15th
11/2009 19th
06/2010 24th
11/2010 35th
06/2011 40th
11/2011 58th
2011/ Nov : top500
1st K (JP)
2st NUDT (CN)
3rd Cray (US)
4th Dawning (CN)
23/4/19 SNSCC'12, [email protected] 4
Computational kernels of Krylov methods
Vector update: parallel in nature
Mat-vec : Computation intensive; multi-core technology CUDA/O
penMP
Inner product: Communication intensive (CPU/MPI).
23/4/19 SNSCC'12, [email protected] 5
Influence of communicationfirst glance
S Zhu, MSc Thesis, CAEP, 2010
Computation cheap
Communication expensive
Based on Aztec by Prof. Tuminaro et al @ Sandia
23/4/19 SNSCC'12, [email protected] 6
Real reason for time-consuming communications
Small workshops: focus less preparing time
Conference: diversity more preparing time
2
2
k dots
g
2 /
lo
f s w
w
k lnn
s
i tkNtt kP t
t t P
vector update 2 /vec flt Nt P
_mat_vec 2 / -1m v z fln Nt Pt
bandwidth :Latency :
w
s
tt
23/4/19 SNSCC'12, [email protected] 7
Strategies for minimizing communications
Replacing dot by others (semi-Chebyshev ) : workshop only no conference if possible. Inner product free , Gu, Liu, Mo(2002)
Reorganizing algorithm such that: (reduce number of conference and each conference accept more talks) residual replacement strategies due to Von de Vorst (2000s). CA –KSMs, Demmel et al (2008)
Overlapping communication over computation
23/4/19 SNSCC'12, [email protected] 8
A case study, Paralleling GPBiCG(m,l) (S. Fujino, 2002)
GPBiCG(1,0) BiCGSTAB
GPBiCG(0,1) GPBiCG
GPBiCG(1,1) BiCGSTAB2
Could be used to design breakdown free BiCGSTAB method.
23/4/19 SNSCC'12, [email protected] 9
GPBiCG(m,l) (S. Fujino, 2002)
*0
*0
0 0 1 1
1 1 1
1 1
1. , 0,
2. 0,1,...,
3. ( ),
4. ;
5. ;
6.
7.
,
(
,
k
k k k k k
k k
k k k k k k
k k k k k
k
k
k
r b Ax t w
k r tol
p r p u
q Ar r
p
t r q s At
y t t w
mod (k,m l
r q
+
if
for do
1
)
8.
9.
10.
11.
,
,
k kk
k k
k k k
k k k k k
k k k k
)< m or k = 0
u q
z r u
r t
s
s
s t
s
then
1 1 1
, , , ,,
, , , ,
, , , ,
12.
13.
14.
15.
, , ,
,
k k k k k k k kk
k k k k k k k k
k k k k k k k kk
k k k
k k k k k
k k k k
k k
k
k
k
k
s t y t y s s t
s s y y y s s y
y y s t y t s y
u q t r
s
u
z
s y y y s s y
else
1
1
*0 1
*0
16.
17.
18.
19.
,
20.
21
.
,
k k k k k
k k k k k k
k k k k k
kkk
k k
k k k k
r r
r
r z u
r t y At
x x p zw s q
r
e
endif
nddo
23/4/19 SNSCC'12, [email protected] 10
GPBiCG(m,l) (S. Fujino, 2002)
*0
*0
0 0 1 1
1 1 1
1 1
1. , 0,
2. 0,1,...,
3. ( ),
4. ;
5. ;
6.
7.
,
(
,
k
k k k k k
k k
k k k k k k
k k k k k
k
k
k
r b Ax t w
k r tol
p r p u
q Ar r
p
t r q s At
y t t w
mod (k,m l
r q
+
if
for do
1
)
8.
9.
10.
11.
,
,
k kk
k k
k k k
k k k k k
k k k k
)< m or k = 0
u q
z r u
r t
s
s
s t
s
then
1 1 1
, , , ,,
, , , ,
, , , ,
12.
13.
14.
15.
, , ,
,
k k k k k k k kk
k k k k k k k k
k k k k k k k kk
k k k
k k k k k
k k k k
k k
k
k
k
k
s t y t y s s t
s s y y y s s y
y y s t y t s y
u q t r
s
u
z
s y y y s s y
else
1
1
*0 1
*0
16.
17.
18.
19.
,
20.
21
.
,
k k k k k
k k k k k k
k k k k k
kkk
k k
k k k k
r r
r
r z u
r t y At
x x p zw s q
r
e
endif
nddo
23/4/19 SNSCC'12, [email protected] 11
0 0 1 1 0 0 0 0
0 00 0 0 0
1
1
1. , 0, , ,
, , , /
2. 0,1,...,
3. ;
;
4.
5. (
T
k
k k k k k k k
k k k k
r b Ax t w f A r p r
q Aq fp rr rr fp
k r tol
tem t t r q s At
y tem t w
mod (k,m+l)< m
for do
if
)
6.
7. / 0;
8.
9.
10.
11.
compu te k k k k kk k
k kk k
k kk
k k
kk k
k
k+1
k+1
1
k
k
+
or k = 0
st ss ,
r
st ,ss
fu
t rs
fq
f
,
r
t
rt ,rs , fs , f t , fq fp
r fsf
u
r
then
1
1 1 1
12.
13.
14.
15.
16. , , ; ,
,
k k
k k k k k
k k k k
k k k
k k k kk
k k
k k k
kk k
st ,ss ,sy yt yy rt ,ry rs ,
fs , fy f t , fh
q
z r u
r t s
h t r u
else
compute
, k k kfq fp
1
17. ,
18.
19.
20.
22.
k
k
k k kk k k k kk k
k kk k k k k k
k kk k k
k kk k
k k
+
k1 k
1
k k+
f
ss yt sy st st yy yt sy
ss yy ys sy ss yy ys sy
rt rs ry
fq fh
ft fs fy
rr
u
f r
1
1 1 1
1
1
23.
24.
25.
26.
27. 28. 29.
k
k k k k k k k k
k k k k k k k
k k k k k k
k k k k k
k
k
k
k
kk
k k
r
u q t r u
z r z u
r t y At
x x p zw
r
q
r
s
r
endif
1
1
1
1 1
1
( ); 30.
31.
3
2.
k+1
k k k
k
k k k+1 k
k
+
k
k
+1 kk
1
k
p r p ufp
rr
q = A
f
f rp
f
= + f-
p
p u
enddo
: , direct computed
indirect computed
xy x y
xy := (x, y)
Algorithm Design of PGPBiCG(m,l) Method
23/4/19 SNSCC'12, [email protected] 12
PGPBiCG(m,l) Method(reduce # global commun. )
Algorithm reconstruct: three GobalCs to one !
Global synch.
Global synch.
Global synch.
Global synch.reconstruct
23/4/19 SNSCC'12, [email protected] 13
Performance
Based on Aztec by Prof. R.S. Tuminaro et al @ Sandia
23/4/19 SNSCC'12, [email protected] 14
Convergence analysis
Residual replacements strategies
Backward stable analysis
1
1
1
1
1
1
1
1
Our methods (1
, , 200
,0)
2
k k k k k
k k k k
kk k
k
k
kk k
k k kk k k
kk
k k k
kk k k k
k k k
IBiCGSTAB Yang
rr rr rq f s
f
PGPBICG
rr rt f s
fu fq
fr ft fs
fp fr fp f
u fq
fr fr fq f s
fp fr fp fu
u
23/4/19 SNSCC'12, [email protected] 15
Challenging problemAccurate compute dot
Why Mindless by Kahan Accurate compute inner product.
Ogita and Rump –et-al, Accurate sum and dot product, SIAM Sci Compt. 2005 cited 188 times. (but) ….
PLASMA team Backward stable analysis of residual replacement methods.
Carson and Demmel, A residual replacement strategy for improving the maximum attainable accuracy of communication avoiding Krylov subspace Methods, April 20 2012
Reliable dot computation algorithm
23/4/19 SNSCC'12, [email protected] 16
Conclusion: Avoiding communication Reliable computation Inner product computation is very likely to be the most challengin
g kernel for HHPC, while Mat_vec important for both… Software abstraction and threads programming are helpful, toge
ther with re-designing algorithms will do better
Math/Algorithm CS/Performance Applications interfaceAztec
POSKIPOSKI Hyper, PETSc; Trilinos
(Parallel Optimized Sparse Kernel Interface LIbrary) Poski v.1.0 May 02/2012
23/4/19 SNSCC'12, [email protected] 18
More than ten thousand processors are connected by network
Global Communication becomes more and more serious
Initial study on communication complexity
23/4/19 SNSCC'12, [email protected] 19
Based on the former two strategies de Sturler and van der Vorst: Parallel GMRES(m) and CG methods
(1995) Bucker and Sauren: Parallel QMR method (1997) Yang and Brent: Improved CGS, BiCG and BiCGSTAB methods
(2002-03) Gu and Liu et al.: ICR, IBiCR, IBiCGSTAB(2) and PQMRCGST
AB methods (2004-2010) Demmel et al CA-KSMs (2008---)
Gu, Liu and Mo: MSD-CG: multiple search direction conjugate gradient method (2004) replaced the inner products computation by solving linear systems
with small size. Eliminates global inner products completely. The idea have been generated to MPCG by Grief and Bridson (200
6)
Methods in literatures
23/4/19 SNSCC'12, [email protected] 20
Comparison of computational count of two Algorithms
GPBiCG( , )
PGPBiCG( , )
No._innMethod Mat_vec vect_update Syn_poin
M Lts
H T
2 18 2 51 2
2 18 0 0
3
5 19 1
m l
m l
_
+
vector update ti
communication time
2
me
mat - vec time
k inner products
computation kernals compute time
2 -1 /
2 /
fvec
m v
in
l
z
k
l
fl w
f
sn t k
2Nt /P
n
t
Nt P
kNt
=
t P t
t
:
:
:
23/4/19 SNSCC'12, [email protected] 21
Comparison of computational count of two Algorithms
2
2
2
2
2 / 2log ( )H 1
4 / 2log ( 2 )M 2GPBiCG ,
10 / 2 log ( 5 )L 5
The time of inner product operations of GPBiCG(m,l) and PGPbiCG(m, l)
Methods position No. time
4 / 2log
PGP
2 )T 2
B
(
fl s w
fl s w
fl s w
fl s w
t N P P t t
t N P P t tm l
t N P P t t
t N P P t t
2
2
18 / 2log ( 9 )M 9ICG( , )
30 / 2 log ( 15 )L 15fl s w
fl s w
t N P P t tm l
t N P P t t
23/4/19 SNSCC'12, [email protected] 22
Mathematical model of the time consummation
12 2
1
2
1
1
,
2
2
2
2
=log
32 46 2 1
6 10 16
40 60 2 1
2 18 30
66%
log
2
log
( )
2
z fl
s w
G
z fl
s w
PGs w
G
l w
P
sf
G
G
T P
m l m l n Nt
m l t m l t
m l m l n Nt
m l t
t tN
tP
T P m
m l t
lP
T P m lP
T Tt t
T
23/4/19 SNSCC'12, [email protected] 23
Scalability analysis
Scaled Speedu,1
p 3 ,
PGC S P
PPGP
TTS
T P
S
T S
:
2 2
1 1
2 2
1 1
( , )
,
Isoef
,
( ) ,
/ 3
12log
(1 ) 1
2log
(1
ficiency analysis
) 1
E
over P
S
G
over
G
PG
PG
N = f P E
T N P PT N P
ET N T N P
Em l EPEP P
NE E
m l EPN
N
EP
E E
N
P
fi xed
:
23/4/19 SNSCC'12, [email protected] 24
The optimal number of processors
12 2
12 2
12 2 1
32 46 2 1 ln 2log 2
6 10 16
40 60 2 1 ln 2log 2
2 18 30
Brief proof
log
/
,
3
,
z floptG G
s w
z flop
IG G
tPG PG
s w
m l m l n NtT P m l P
P m l t m l t
m l m l n NtT P m l P
P
P P
m l t m l t
x x Cx
Opti mal number
:
2
1
2
0 C=const
ln 2' 0, '' x 0 x x
Popt
23/4/19 SNSCC'12, [email protected] 25
Convergence Analysis
N Let x,y R , is the inner product computed by computer, and is
the real value , where is machine prec
Lemm
ision
a.
.
T
nT Ti ii=1
x y y
fl x y - x y 1.01n x y
u u
Tfl x
, then
1
1
1
1
1
1
1
1
Our methods (1
, , 200
,0)
2
k k k k k
k k k k
kk k
k
k
kk k
k k kk k k
kk
k k k
kk k k k
k k k
IBiCGSTAB Yang
rr rr rq f s
f
PGPBICG
rr rt f s
fu fq
fr ft fs
fp fr fp f
u fq
fr fr fq f s
fp fr fp fu
u
When n is very large then might be Co mun ch larger thclusion. an .T Tfl x y - x y u
23/4/19 SNSCC'12, [email protected] 26
Numerical Experiments: timing and improvements
2 2
2 2
0 00 1
0, , (0,1)
0, 0, 0, 10
1512, 1, 0
y yx x
u u u ua b c d eu x y
x y x yu u
u ux x
a b c d e
Experiment I Each CPU 3600
Experiment II
problem size 960
(1,0)
(0,1)
(1,1)
(2
96
,8)
( ,
0
8 2)
cc
G IG
G
G IGc c
c Gc
TR =
T
T T
T
T T
T
GPBiCG
GPBiCG
GPBiCG
GPBiCG
GPBiCG
fit
comm/ al l :
speed up
:
23/4/19 SNSCC'12, [email protected] 28
PGPBiCG(m,l) method is more scalable and parallel for solving large sparse unsymmetrical linear systems on distributed parallel architectures
Performance, isoefficiency analysis and numerical experiments have been done for PGPBiCG(m,l) and GPBiCG(m,l) methods
The parallel communication performance can be improved by a factor of larger than 3.
The PGPBiCG(m,l) method has better parallel speed up compared with the GPBiC(m,l) method.
For further performance improvements: overlap of computation with communication, numerical stability.
Conclusions