Preconditioned Iterative Linear Preconditioned Iterative Linear Solvers for Unstructured Grids Solvers for Unstructured Grids
on the Earth Simulatoron the Earth Simulator
Kengo NakajimaKengo [email protected]@cc.u-tokyo.ac.jp
Supercomputing Division, Information Technology Center, Supercomputing Division, Information Technology Center, The University of TokyoThe University of Tokyo
Japan Agency for Marine-Earth Science and TechnologyJapan Agency for Marine-Earth Science and Technology
April 22, 2008April 22, 2008
208-APR22
• Background– Vector/Scalar Processors– GeoFEM Project– Earth Simulator– Preconditioned Iterative Linear Solvers
• Optimization Strategy on the Earth Simulator– BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications– Matrix Assembling
• Summary & Future Works
Overview
08-APR22 3
Scalar Processor Big gap between clock rate and memory bandwidth. Very low sustained/peak performance ratio (<10%)
Large-Scale Computing
08-APR22 4
Scalar ProcessorsCPU-Cache-Memory: Hierarchical Structure
CPU
Main Memory
Cache
Register
FAST
SLOW
Small Capacity (MB)ExpensiveLarge (100M of Transistors)
Large Capacity (GB)Cheap
08-APR22 5
Scalar Processor Big gap between clock rate and memory bandwidth. Very low sustained/peak performance ratio (<10%)
Vector Processor Very high sustained/peak performance ratio
e.g. 35 % for FEM applications on the Earth Simulator requires …
very special tuning sufficiently long loops (= large-scale problem size) for certain performance
Suitable for simple computation.
Large-Scale Computing
08-APR22 6
Vector ProcessorsVector Register & Fast Memory
Main Memory
VeryFAST
Vector Processor
VectorRegister
• Parallel Processing for Simple DO Loops.• Suitable for Simple & Large Computation
do i= 1, N A(i)= B(i) + C(i)enddo
08-APR22 7
Scalar Processor Big gap between clock rate and memory bandwidth. Very low sustained/peak performance ratio (<10%)
Vector Processor Very high sustained/peak performance ratio
e.g. 35 % for FEM applications on the Earth Simulator requires …
very special tuning sufficiently long loops (= large-scale problem size) for certain performance
Suitable for simple computation.
Large-Scale Computing
08-APR22 8
Earth Simulator
8.00Peak Performance
GFLOPS
26.6Measured Memory
BW (STREAM)(GB/sec/PE)
2.31-3.24(28.8-40.5)
EstimatedPerformance/Core
GFLOPS (% of peak)
2.93(36.6)
MeasuredPerformance/Core
BYTE/FLOP
HitachiSR8000/MPP
(U.Tokyo)
1.80
2.85
.291-.347(16.1-19.3)
.335(18.6)
1.5833.325
Comm. BW(GB/sec/Node) 1.6012.3
MPI Latency(sec) 6-20
HitachiSR11000/J2(U.Tokyo)
9.20
8.00
.880-.973(9.6-10.6)
1.34(14.5)
0.870
12.0
4.7 *5.6-7.7
* IBM p595J.T.Carter et al.
IBM SP3(LBNL)
1.50
0.623
.072-.076(4.8-5.1)
.122(8.1)
0.413
1.00
16.3
08-APR22 9
Typical Behavior …
Earth Simulator:Performance is good for large scale problems due to long vector length.
IBM-SP3:Performance is good for small problems due to cache effect.
0.00
0.50
1.00
1.50
2.00
2.50
3.00
1.0E+04 1.0E+05 1.0E+06 1.0E+07
DOF: Problem Size
GF
LO
PS
1.0E-01
1.0E+00
1.0E+01
1.0E+02
1.0E+04 1.0E+05 1.0E+06 1.0E+07
DOF: Problem Size
GF
LO
PS
8 % of peak
40 % of peak
08-APR22 10
Parallel ComputingStrong Scaling (Fixed Prob. Size)
PE#
Pe
rfo
rma
nc
e
Ideal
PE#
Pe
rfo
rma
nc
e
Ideal
Earth Simulator:Performance decreases for many PE’s due to comm. overhead and small vector length.
IBM-SP3:Super-scalar effect for small numberof PE’s. Performance decreases for many PE’s due to comm. overhead.
08-APR22 11
Improvement of Memory Performance on IBM SP3 ⇒ Hitachi SR11000/J2
0.00
1.00
2.00
3.00
1.E+04 1.E+05 1.E+06 1.E+07
DOF
GF
LO
PS
IBM SP-3IBM SP-3POWER3POWER3
SR11000/J2SR11000/J2POWER5+POWER5+
■ Flat-MPI/DCRS□ Hybrid/DCRS
375 MHz1.0 GB/sec8M L2 cache/PE
2.3 GHz8.0 GB/sec18M L3 cache/PE
0.0
5.0
10.0
15.0
1.E+04 1.E+05 1.E+06 1.E+07
DOF
GF
LO
PS
Memory PerformanceMemory Performance(BWDTH, latency etc.)(BWDTH, latency etc.)
08-APR22 12
My History with Vector Computers• Cray-1S (1985-1988)
– Mitsubishi Research Institute (MRI)
• Cray-YMP (1988-1995)– MRI, University of Texas at Austin
• Fujitsu VP, VPP series (1985-1999)– JAERI, PNC
• NEC SX-4 (1997-2001)– CCSE/JAERI
• Hitachi SR2201 (1997-2004)– University of Tokyo, CCSE/JAERI
• Hitachi SR8000 (2000-2007)– University of Tokyo
• Earth Simulator (2002-)
1308-APR22
• Background– Vector/Scalar Processors– GeoFEM Project– Earth Simulator– Preconditioned Iterative Linear Solvers
• Optimization Strategy on the Earth Simulator– BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications– Matrix Assembling
• Summary & Future Works
08-APR22 14
• Parallel FEM platform for solid earth simulation.– parallel I/O, parallel linear solvers, parallel visualization– solid earth: earthquake, plate deformation, mantle/core convection, etc.
• Part of national project by STA/MEXT for large-scale earth science simulations using the Earth Simulator.• Strong collaborations between HPC and natural science (solid earth) communities.
GeoFEM: FY.1998-2002http://geofem.tokyo.rist.or.jp/
08-APR22 15
System Configuration ofGeoFEM
Visualization dataGPPView
One-domain mesh
Utilities Pluggable Analysis Modules
PEs
Partitioner
Equationsolvers
VisualizerParallelI/O
構造計算(Static linear)構造計算(Dynamic linear)構造計算(
Contact)
Partitioned mesh
PlatformSolverI/ F
Comm.I/ F
Vis.I/ F
Structure
FluidWave
Visualization dataGPPView
One-domain mesh
Utilities Pluggable Analysis Modules
PEs
Partitioner
Equationsolvers
VisualizerParallelI/O
構造計算(Static linear)構造計算(Dynamic linear)構造計算(
Contact)
Partitioned mesh
PlatformSolverI/ F
Comm.I/ F
Vis.I/ F
Structure
FluidWave
1608-APR22
Results on Solid Earth Simulation
Magnetic Field of the Earth : MHD codeMagnetic Field of the Earth : MHD codeComplicated Plate Model around Japan IslandsComplicated Plate Model around Japan Islands
Simulation of Earthquake Generation CycleSimulation of Earthquake Generation Cyclein Southwestern Japanin Southwestern Japan
TSUNAMI !!TSUNAMI !!
Transportation by Groundwater Flow Transportation by Groundwater Flow through Heterogeneous Porous Mediathrough Heterogeneous Porous Media
h=5.00
h=1.25
T=100 T=200 T=300 T=400 T=500
1708-APR22
Results by GeoFEM
08-APR22 18
Features of FEM applications (1/2)
• Local “element-by-element” operations– sparse coefficient matrices
– suitable for parallel computing
• HUGE “indirect” accesses– IRREGULAR sparse matrices
– memory intensive
do i= 1, N jS= index(i-1)+1 jE= index(i) do j= jS, jE in = item(j) Y(i)= Y(i) + AMAT(j)*X(in) enddoenddo
08-APR22 19
• In parallel computation …– comm. with ONLY neighbors (except “dot products” etc.)– amount of messages are relatively small because only values on domain-boundary are
exchanged. – communication (MPI) latency is critical
Features of FEM applications (2/2)
2008-APR22
• Background– Vector/Scalar Processors– GeoFEM Project– Earth Simulator– Preconditioned Iterative Linear Solvers
• Optimization Strategy on the Earth Simulator– BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications– Matrix Assembling
• Summary & Future Works
08-APR22 21
Earth Simulator (ES)http://www.es.jamstec.go.jp/
• 640×8= 5,120 Vector Processors– SMP Cluster-Type Architecture– 8 GFLOPS/PE– 64 GFLOPS/Node– 40 TFLOPS/ES
• 16 GB Memory/Node, 10 TB/ES• 640×640 Crossbar Network
– 16 GB/sec×2
• Memory BWTH with 32 GB/sec.• 35.6 TFLOPS for LINPACK (2002-March)• 26 TFLOPS for AFES (Climate Simulation)
08-APR22 22
• GeoFEM Project (FY.1998-2002)
• FEM-type applications with complicated unstructured grids (not LINPACK, FDM …) on the Earth Simulator (ES)– Implicit Linear Solvers
– Hybrid vs. Flat MPI Parallel Programming Model
Motivations
08-APR22 23
Flat MPI vs. Hybrid
PE
PE
PE
PE
Memory
PE
PE
PE
PE
Memory
Hybrid: Hierarchal Structure
PE
PE
PE
PE
Memory
PE
PE
PE
PE
Memory
Flat-MPI: Each PE -> Independent
2408-APR22
• Background– Vector/Scalar Processors– GeoFEM Project– Earth Simulator– Preconditioned Iterative Linear Solvers
• Optimization Strategy on the Earth Simulator– BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications– Matrix Assembling
• Summary & Future Works
08-APR22 25
• Direct Methods– Gaussian Elimination/LU Factorization.
• compute A-1 directly.
– Robust for wide range of applications.– More expensive than iterative methods (memory, CPU)– Not suitable for parallel and vector computation due to its global operations.
• Iterative Methods– CG, GMRES, BiCGSTAB– Less expensive than direct methods, especially in memory.– Suitable for parallel and vector computing.– Convergence strongly depends on problems, boundary conditions (condition number etc.)
• Preconditioning is required
Direct/Iterative Methods for Linear Equations
08-APR22 26
• Convergence rate of iterative solvers strongly depends on the spectral properties (eigenvalue distribution) of the coefficient matrix A.
• A preconditioner M transforms the linear system into one with more favorable spectral properties – In "ill-conditioned" problems, "condition number" (ratio of max/min eigenvalue if A is symmetric) is large.– M transforms original equation Ax=b into A'x=b' where A'=M-1A, b'=M-1b
• ILU (Incomplete LU Factorization) or IC (Incomplete Cholesky Factorization) are well-known preconditioners.
Preconditioning for Iterative Methods
08-APR22 27
• Iterative method is the ONLY choice for large-scale parallel computing.
• Preconditioning is important– general methods, such as ILU(0)/IC(0), cover wide range of applications.– problem specific methods.
Strategy in GeoFEM
2808-APR22
• Background– Vector/Scalar Processors– GeoFEM Project– Earth Simulator– Preconditioned Iterative Linear Solvers
• Optimization Strategy on the Earth Simulator– BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications– Matrix Assembling
• Summary & Future Works
08-APR22 29
Block IC(0)-CG Solver on the Earth Simulator
• 3D Linear Elastic Problems (SPD)
• Parallel Iterative Linear Solver– Node-based Local Data Structure– Conjugate-Gradient Method (CG): SPD– Localized Block IC(0) Preconditioning (Block Jacobi)
• Modified IC(0): Non-diagonal components of original [A] are kept
– Additive Schwartz Domain Decomposition(ASDD)– http://geofem.tokyo.rist.or.jp/
• Hybrid Parallel Programming Model– OpenMP+MPI– Re-Ordering for Vector/Parallel Performance– Comparison with Flat MPI
08-APR22 30
Flat MPI vs. Hybrid
PE
PE
PE
PE
Memory
PE
PE
PE
PE
Memory
Hybrid: Hierarchal Structure
PE
PE
PE
PE
Memory
PE
PE
PE
PE
Memory
Flat-MPI: Each PE -> Independent
08-APR22 31
Local Data StructureNode-based Partitioning
internal nodes - elements - external nodes
1 2 3 4 5
21 22 23 24 25
1617 18 19
20
1112 13 14
15
67 8 9
10
1 2 3
4 5
6 7
8 9 11
10
14 13
15
12
PE#0
7 8 9 10
4 5 6 12
3111
2
PE#1
7 1 2 3
10 9 11 12
568
4
PE#2
34
8
69
10 12
1 2
5
11
7
PE#3
1 2 3 4 5
21 22 23 24 25
1617 18 19
20
1112 13 14
15
67 8 9
10
PE#0PE#1
PE#2PE#3
1 2 3 4 5
21 22 23 24 25
1617 18 19
20
1112 13 14
15
67 8 9
10
08-APR22 32
1 SMP node => 1 domain for Hybrid Programming Model
MPI communication among domains
Node-0 Node-1
Node-2 Node-3
PEPEPEPEPEPEPEPE
MemoryNode-0 Node-1
Node-2 Node-3PEPEPEPEPEPEPEPE
Memory
PEPEPEPEPEPEPEPE
Memory
PEPEPEPEPEPEPEPE
Memory
Node-0 Node-1
Node-2 Node-3
PEPEPEPEPEPEPEPE
Memory
PEPEPEPEPEPEPEPE
MemoryNode-0 Node-1
Node-2 Node-3PEPEPEPEPEPEPEPE
Memory
PEPEPEPEPEPEPEPE
Memory
PEPEPEPEPEPEPEPE
Memory
PEPEPEPEPEPEPEPE
Memory
PEPEPEPEPEPEPEPE
Memory
PEPEPEPEPEPEPEPE
Memory
08-APR22 33
Basic Strategy for Parallel Programming on the Earth Simulator• Hypothesis
– Explicit ordering is required for unstructured grids in order to achieve higher performance in factorization processes on SMP node and vector processors.
08-APR22 34
ILU(0)/IC(0) Factorization
do i= 2, n do k= 1, i-1 if ((i,k)∈ NonZero(A)) thenif ((i,k)∈ NonZero(A)) then aaik ik := a:= aikik/a/akkkk
endifendif do j= k+1, n if ((i,j)∈ NonZero(A)) thenif ((i,j)∈ NonZero(A)) then aaij ij := a:= aijij - a - aikik*a*akjkj
endifendif enddo enddo enddo
08-APR22 35
ILU/IC Preconditioning
M = (L+D)D-1(D+U)
Mz= r
D-1(D+U)= z1
Forward Substitution
(L+D)z1= r : z1= D-1(r-Lz1)
Backward Substitution
(I+ D-1 U)zNEW = z1
z= z - D-1Uz
need to solve this equation
08-APR22 36
ILU/IC Preconditioning
do i= 1, N WVAL= R(i) do j= 1, INL(i) WVAL= WVAL - AL(i,j) * Z(IAL(i,j)) enddo Z(i)= WVAL / D(i) enddo
do i= N, 1, -1 SW = 0.0d0 do j= 1, INU(i) SW= SW + AU(i,j) * Z(IAU(i,j)) enddo Z(i)= Z(i) – SW / D(i) enddo
Forward SubstitutionForward Substitution (L+D)z= r : z= D(L+D)z= r : z= D-1-1(r-Lz)(r-Lz)
Backward SubstitutionBackward Substitution (I+ D(I+ D-1-1 U)z U)znewnew= z= zoldold : z= z - D : z= z - D-1-1UzUz
Dependency…You need the most recent value of “z”of connected nodes.Vectorization/parallelization is difficult.
M =(L+D)DM =(L+D)D-1-1(D+U)(D+U)L,D,U: AL,D,U: A
08-APR22 37
Basic Strategy for Parallel Programming on the Earth Simulator
• Hypothesis– Explicit ordering is required for unstructured grids in order to achieve higher performance in factorization processes on SMP node and vector pr
ocessors.• Re-Ordering for Highly Parallel/Vector Performance
– Local operation and no global dependency– Continuous memory access– Sufficiently long loops for vectorization
08-APR22 38
ILU/IC Preconditioning
do i= 1, N WVAL= R(i) do j= 1, INL(i) WVAL= WVAL - AL(i,j) * Z(IAL(i,j)) enddo Z(i)= WVAL / D(i) enddo
do i= N, 1, -1 SW = 0.0d0 do j= 1, INU(i) SW= SW + AU(i,j) * Z(IAU(i,j)) enddo Z(i)= Z(i) – SW / D(i) enddo
Forward SubstitutionForward Substitution (L+D)z= r : z= D(L+D)z= r : z= D-1-1(r-Lz)(r-Lz)
Backward SubstitutionBackward Substitution (I+ D(I+ D-1-1 U)z U)znewnew= z= zoldold : z= z - D : z= z - D-1-1UzUz
Dependency…You need the most recent value of “z”of connected nodes.Vectorization/parallelization is difficult.
Reordering:Directly connectednodes do not appearin RHS.
M =(L+D)DM =(L+D)D-1-1(D+U)(D+U)L,D,U: AL,D,U: A
08-APR22 39
Basic Strategy for Parallel Programming on the Earth Simulator
• Hypothesis– Explicit ordering is required for unstructured grids in order to achieve higher performance in factorization processes on SMP node and vector processors.
• Re-Ordering for Highly Parallel/Vector Performance– Local operation and no global dependency– Continuous memory access– Sufficiently long loops for vectorization
• 3-Way Parallelism for Hybrid Parallel Programming– Inter Node : MPI– Intra Node : OpenMP– Individual PE : Vectorization
08-APR22 40
SMP ParallelSMP Parallel
VectorVector
Re-Ordering Technique for Vector/Parallel Architectures
Cyclic DJDS(RCM+CMC) Re-Ordering(Doi, Washio, Osoda and Maruyama (NEC))
VectorVectorSMP ParallelSMP Parallel
1. RCM (Reverse Cuthil-Mckee) 2. CMC (Cyclic Multicolor)
3. DJDS re-ordering
4. Cyclic DJDS for SMP unit
These processes can be substituted by traditional multi-coloring (MC).
08-APR22 41
Reordering = Coloring• COLOR: Unit of independent sets.• Elements grouped in the same “color” are independent from
each other, thus parallel/vector operation is possible.• Many colors provide faster convergence, but shorter vector
length.: Trade-Off !!
Red-Black (2 colors) 4 colors RCM (Reverse CM)
08-APR22 42
1D-Storage (CRS)memory saved, short vector length
2D-Storagelong vector length, many ZERO’s
Large Scale Sparse Matrix Storage for Unstructured Grids
08-APR22 43
Re-Ordering in each Coloraccording to Non-Zero Off-Diag. Component #
Elements on the same color are independent, thereforeElements on the same color are independent, thereforeintra-hyperplane re-ordering does not affect results intra-hyperplane re-ordering does not affect results DJDS : Descending-Order Jagged Diagonal StorageDJDS : Descending-Order Jagged Diagonal Storage
08-APR22 44
Cyclic DJDS (MC/CM-RCM) Cyclic Re-Ordering for SMP units
Load-balancing among PEs
iS+1
iE
iv0+1
do iv= 1, NCOLORS!$omp parallel dodo ip= 1, PEsmpTOT iv0= STACKmc(PEsmpTOT*(iv-1)+ip- 1) do j= 1, NLhyp(iv) iS= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip-1) iE= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip )!cdir nodep do i= iv0+1, iv0+iE-iS k= i+iS - iv0 kk= IAL(k) (Important Computations) enddo enddoenddoenddonpLX1= NLmax * PEsmpTOTINL(0:NLmax*PEsmpTOT*NCOLORS)
08-APR22 45
Difference betweenFlat MPI & Hybrid
• Most of the efforts of re-ordering are for vectorization.
• If you have a long vector, just divide it and distribute the segments to PEs on SMP nodes.
• Source codes of Hybrid and Flat MPI are not so different.
– Flat MPI corresponds to Hybrid where PE/SMP node=1.– In other words, Flat MPI code is sufficiently complicated.
08-APR22 46
Cyclic DJDS (MC/CM-RCM) for Forward/Backward Substitution in BILU Factorization
SMPparallel
do iv= 1, NCOLORS!$omp parallel do private (iv0,j,iS,iE… etc.)do ip= 1, PEsmpTOTiv0= STACKmc(PEsmpTOT*(iv-1)+ip- 1)do j= 1, NLhyp(iv)iS= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip-1)iE= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip )
!CDIR NODEPdo i= iv0+1, iv0+iE-iSk= i+iS - iv0kk= IAL(k)X(i)= X(i) - A(k)*X(kk)*DINV(i) etc.
enddoenddo
enddoenddo
Vectorized
SMPparallel
do iv= 1, NCOLORS!$omp parallel do private (iv0,j,iS,iE… etc.)do ip= 1, PEsmpTOTiv0= STACKmc(PEsmpTOT*(iv-1)+ip- 1)do j= 1, NLhyp(iv)iS= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip-1)iE= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip )
!CDIR NODEPdo i= iv0+1, iv0+iE-iSk= i+iS - iv0kk= IAL(k)X(i)= X(i) - A(k)*X(kk)*DINV(i) etc.
enddoenddo
enddoenddo
Vectorized
08-APR22 47
Simple 3D Cubic Model
x
y
z
Uz=0 @ z=Zmin
Ux=0 @ x=Xmin
Uy=0 @ y=Ymin
Uniform Distributed Force in z-dirrection @ z=Zmin
Ny-1
Nx-1
Nz-1
08-APR22 48
Effect of Ordering
08-APR22 49
Effect of Re-Ordering
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
Long LoopsContinuous Access
Short LoopsContinuous Access
Short LoopsIrregular Access
08-APR22 50
Matrix Storage, Loops• DJDS (Descending order
Jagged Diagonal Storage) with long innermost loops is suitable for vector processors.
DJDSDJDS
DCRSDCRS do i= 1, N SW= WW(i,Z) isL= index_L(i-1)+1 ieL= index_L(i) do j= isL, ieL k = item_L(j) SW= SW - AL(j)*Z(k) enddo
Z(i)= SW/DD(i)enddo
do iv= 1, NVECT iv0= STACKmc(iv-1) do j= 1, NLhyp(iv) iS= index_L(NL*(iv-1)+ j-1) iE= index_L(NL*(iv-1)+ j ) do i= iv0+1, iv0+iE-iS k= i+iS - iv0 kk= item_L(k) Z(i)= Z(i) - AL(k)*Z(kk) enddo
iS= STACKmc(iv-1) + 1 iE= STACKmc(iv ) do i= iS, iE Z(i)= Z(i)/DD(i) enddo enddoenddo
• Reduction type loop of DCRS is more suitable for cache-based scalar processor because of its localized operation.
08-APR22 51
Effect of Re-OrderingResults on 1 SMP node
Color #: 99 (fixed)Re-Ordering is REALLY required !!!
1.E-02
1.E-01
1.E+00
1.E+01
1.E+02
1.E+04 1.E+05 1.E+06 1.E+07
DOF
GF
LO
PS
Effect of Vector LengthX10
+ Re-Ordering X100
22 GFLOPS, 34% of the Peak
Ideal Performance: 40%-45%for Single CPU
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
08-APR22 52
Effect of Re-OrderingResults on 1 SMP node
Color #: 99 (fixed)Re-Ordering is REALLY required !!!
1.E-02
1.E-01
1.E+00
1.E+01
1.E+02
1.E+04 1.E+05 1.E+06 1.E+07
DOF
GF
LO
PS
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
80x80x80 case (1.5M DOF)
● 212 iter’s, 11.2 sec.■ 212 iter’s, 143.6 sec.▲ 203 iter’s, 674.2 sec.
08-APR22 53
3D Elastic SimulationProblem Size~GFLOPS
Earth Simulator1 SMP node (8 PE’s)
Flat-MPI23.4 GFLOPS, 36.6 % of Peak
1.0E-01
1.0E+00
1.0E+01
1.0E+02
1.0E+04 1.0E+05 1.0E+06 1.0E+07
DOF: Problem Size
GF
LO
PS
1.0E-01
1.0E+00
1.0E+01
1.0E+02
1.0E+04 1.0E+05 1.0E+06 1.0E+07
DOF: Problem Size
GF
LO
PS
Hybrid (OpenMP)21.9 GFLOPS, 34.3 % of Peak
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
Flat-MPI is better.Flat-MPI is better.Nice Intra-Node MPI.Nice Intra-Node MPI.
08-APR22 54
Earth Simulator
8.00Peak Performance
GFLOPS
26.6Measured Memory
BW (STREAM)(GB/sec/PE)
2.31-3.24(28.8-40.5)
EstimatedPerformance/Core
GFLOPS (% of peak)
2.93(36.6)
MeasuredPerformance/Core
BYTE/FLOP
HitachiSR8000/MPP
(U.Tokyo)
1.80
2.85
.291-.347(16.1-19.3)
.335(18.6)
1.5833.325
Comm. BW(GB/sec/Node) 1.6012.3
MPI Latency(sec) 6-20
HitachiSR11000/J2(U.Tokyo)
9.20
8.00
.880-.973(9.6-10.6)
1.34(14.5)
0.870
12.0
4.7 *5.6-7.7
* IBM p595J.T.Carter et al.
IBM SP3(LBNL)
1.50
0.623
.072-.076(4.8-5.1)
.122(8.1)
0.413
1.00
16.3
08-APR22 55
3D Elastic SimulationProblem Size~GFLOPS
Hitachi-SR8000-MPP withPseudo Vectorization1 SMP node (8 PE’s)
0.00
0.50
1.00
1.50
2.00
2.50
3.00
1.0E+04 1.0E+05 1.0E+06 1.0E+07
DOF: Problem Size
GF
LO
PS
0.00
0.50
1.00
1.50
2.00
2.50
3.00
1.0E+04 1.0E+05 1.0E+06 1.0E+07
DOF: Problem Size
GF
LO
PS
Flat-MPI2.17 GFLOPS, 15.0 % of Peak
Hybrid (OpenMP)2.68 GFLOPS, 18.6 % of Peak
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
Hybrid is better.Hybrid is better.Low Intra-Node MPI.Low Intra-Node MPI.
08-APR22 56
3D Elastic SimulationProblem Size~GFLOPS
IBM-SP3 (NERSC) 1 SMP node (8 PE’s)
0.00
0.50
1.00
1.50
2.00
2.50
3.00
1.0E+04 1.0E+05 1.0E+06 1.0E+07
DOF: Problem Size
GF
LO
PS
Flat-MPI Hybrid (OpenMP)
0.00
0.50
1.00
1.50
2.00
2.50
3.00
1.0E+04 1.0E+05 1.0E+06 1.0E+07
DOF: Problem Size
GF
LO
PS
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
Cache is well-utilizedCache is well-utilizedin Flat-MPI.in Flat-MPI.
08-APR22 57
3D Elastic SimulationProblem Size~GFLOPS
Hitachi SR11000/J2 (U.Tokyo) 1 SMP node (8 PE’s)
Flat-MPI Hybrid (OpenMP)
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
0.0
5.0
10.0
15.0
1.E+04 1.E+05 1.E+06 1.E+07
DOF
GF
LO
PS
0.0
5.0
10.0
15.0
1.E+04 1.E+05 1.E+06 1.E+07
DOF
GF
LO
PS
08-APR22 58
SMP node # > 10up to 176 nodes (1408 PEs)
Problem size for each SMP node is fixed.PDJDS-CM/RCM, Color #: 99
08-APR22 59
3D Elastic Model (Large Case)256x128x128/SMP node, up to 2,214,592,512 DOF
●: Flat MPI, ○: Hybrid
GFLOPS rate Parallel Work Ratio
0
1000
2000
3000
4000
0 16 32 48 64 80 96 112 128 144 160 176 192
NODE#
GF
LO
PS
40
50
60
70
80
90
100
0 16 32 48 64 80 96 112 128 144 160 176 192
NODE#
Pa
rall
el
Wo
rk R
ati
o:
%
3.8TFLOPS for 2.2G DOF 3.8TFLOPS for 2.2G DOF 176 nodes (33.8% of peak)176 nodes (33.8% of peak)
●Flat MPI
○Hybrid
08-APR22 60
3D Elastic Model (Small Case)64x64x64/SMP node, up to 125,829,120 DOF
●: Flat MPI, ○: Hybrid
0
1000
2000
3000
4000
0 16 32 48 64 80 96 112 128 144 160 176 192
NODE#
GF
LO
PS
40
50
60
70
80
90
100
0 16 32 48 64 80 96 112 128 144 160 176 192
NODE#
Pa
rall
el
Wo
rk R
ati
o:
%
GFLOPS rate Parallel Work Ratio
●Flat MPI
○Hybrid
08-APR22 61
Hybrid outperforms Flat-MPI• when ...
– number of SMP node (PE) is large.– problem size/node is small.
• because flat-MPI has ...– as 8 times as many communication processes– as TWICE as large communication/computation ratio
• Effect of communication becomes significant if number of SMP node (or PE) is large.
• Performance Estimation by D.Kerbyson (LANL) – LA-UR-02-5222– relatively larger communication latency of ES
08-APR22 62
Flat-MPI and Hybrid
Flat MPI Hybrid
Problem size/each MPI Process(N=number of FEM nodes in onedirection of cube geometry
3N3 38N3
Size of messages on each surfaces with each neighboringdomain
3N2 34N2
Ratio ofcommunication/computation
1/N 1/(2N)
N
N
NFlat-MPI Hybrid
08-APR22 63
Earth Simulator
8.00Peak Performance
GFLOPS
26.6Measured Memory
BW (STREAM)(GB/sec/PE)
2.31-3.24(28.8-40.5)
EstimatedPerformance/Core
GFLOPS (% of peak)
2.93(36.6)
MeasuredPerformance/Core
BYTE/FLOP
HitachiSR8000/MPP
(U.Tokyo)
1.80
2.85
.291-.347(16.1-19.3)
.335(18.6)
1.5833.325
Comm. BW(GB/sec/Node) 1.6012.3
MPI Latency(sec) 6-20
HitachiSR11000/J2(U.Tokyo)
9.20
8.00
.880-.973(9.6-10.6)
1.34(14.5)
0.870
12.0
4.7 *5.6-7.7
* IBM p595J.T.Carter et al.
IBM SP3(LBNL)
1.50
0.623
.072-.076(4.8-5.1)
.122(8.1)
0.413
1.00
16.3
08-APR22 64
• latency of network• finite bandwidth of network• synchronization at SEND/RECV, ALLREDUCE etc.
• memory performance in boundary communications (memory COPY)
Why Communication Overhead ?
08-APR22 65
Domain-to-Domain CommunicationExchange Boundary Information (SEND/RECV)
subroutine SOLVER_SEND_RECV & (N, NEIBPETOT, NEIBPE, & IMPORT_INDEX, IMPORT_NODE, & EXPORT_INDEX, EXPORT_NODE, & WS, WR, X, SOLVER_COMM, my_rank) implicit REAL*8 (A-H,O-Z)include 'mpif.h'parameter (KREAL= 8)integer IMPORT_INDEX(0:NEIBPETOT), IMPORT_NODE(N)integer EXPORT_INDEX(0:NEIBPETOT), EXPORT_NODE(N)integer SOLVER_COMM, my_rankinteger req1(NEIBPETOT), req2(NEIBPETOT)integer sta1(MPI_STATUS_SIZE, NEIBPETOT)integer sta2(MPI_STATUS_SIZE, NEIBPETOT)real(kind=KREAL) X(N), NEIBPE(NEIBPETOT), WS(N), WR(N)
do neib= 1, NEIBPETOT istart= EXPORT_INDEX(neib-1) inum = EXPORT_INDEX(neib ) - istart do k= istart+1, istart+inum WS(k)= X(EXPORT_NODE(k)) enddo call MPI_ISEND (WS(istart+1), inum, MPI_DOUBLE_PRECISION, & NEIBPE(neib), 0, SOLVER_COMM, & req1(neib), ierr)enddo
SENDSEND
do neib= 1, NEIBPETOT istart= IMPORT_INDEX(neib-1) inum = IMPORT_INDEX(neib ) - istart call MPI_IRECV (WR(istart+1), inum, MPI_DOUBLE_PRECISION, & NEIBPE(neib), 0, SOLVER_COMM, & req2(neib), ierr)enddo
call MPI_WAITALL (NEIBPETOT, req2, sta2, ierr)
do neib= 1, NEIBPETOT istart= IMPORT_INDEX(neib-1) inum = IMPORT_INDEX(neib ) - istart do k= istart+1, istart+inum X(IMPORT_NODE(k))= WR(k) enddoenddo
call MPI_WAITALL (NEIBPETOT, req1, sta1, ierr)
returnend
RECEIVERECEIVE
08-APR22 66
Domain-to-Domain CommunicationExchange Boundary Information (SEND/RECV)
subroutine SOLVER_SEND_RECV & (N, NEIBPETOT, NEIBPE, & IMPORT_INDEX, IMPORT_NODE, & EXPORT_INDEX, EXPORT_NODE, & WS, WR, X, SOLVER_COMM, my_rank) implicit REAL*8 (A-H,O-Z)include 'mpif.h'parameter (KREAL= 8)integer IMPORT_INDEX(0:NEIBPETOT), IMPORT_NODE(N)integer EXPORT_INDEX(0:NEIBPETOT), EXPORT_NODE(N)integer SOLVER_COMM, my_rankinteger req1(NEIBPETOT), req2(NEIBPETOT)integer sta1(MPI_STATUS_SIZE, NEIBPETOT)integer sta2(MPI_STATUS_SIZE, NEIBPETOT)real(kind=KREAL) X(N), NEIBPE(NEIBPETOT), WS(N), WR(N)
do neib= 1, NEIBPETOT istart= EXPORT_INDEX(neib-1) inum = EXPORT_INDEX(neib ) - istart do k= istart+1, istart+inum WS(k)= X(EXPORT_NODE(k)) enddo call MPI_ISEND (WS(istart+1), inum, MPI_DOUBLE_PRECISION, & NEIBPE(neib), 0, SOLVER_COMM, & req1(neib), ierr)enddo
SENDSEND
do neib= 1, NEIBPETOT istart= IMPORT_INDEX(neib-1) inum = IMPORT_INDEX(neib ) - istart call MPI_IRECV (WR(istart+1), inum, MPI_DOUBLE_PRECISION, & NEIBPE(neib), 0, SOLVER_COMM, & req2(neib), ierr)enddo
call MPI_WAITALL (NEIBPETOT, req2, sta2, ierr)
do neib= 1, NEIBPETOT istart= IMPORT_INDEX(neib-1) inum = IMPORT_INDEX(neib ) - istart do k= istart+1, istart+inum X(IMPORT_NODE(k))= WR(k) enddoenddo
call MPI_WAITALL (NEIBPETOT, req1, sta1, ierr)
returnend
RECEIVERECEIVE
08-APR22 67
Communication Overhead
MemoryCopy
Comm.BandWidth
Comm.Latency
08-APR22 68
Communication Overhead
MemoryCopy
Comm.BandWidth
Comm.Latency
depends onmessage size
depends onmessage size
08-APR22 69
Communication OverheadEarth Simulator
Comm.Latency
08-APR22 70
Communication OverheadHitachi SR11000, IBM SP3 etc.
MemoryCopy
Comm.BandWidth
08-APR22 71
Communication Overhead= Synchronization Overhead
08-APR22 72
Communication Overhead= Synchronization Overhead
MemoryCopy
Comm.Latency
Comm.BWTH
08-APR22 73
Communication Overhead= Synchronization Overhead
Earth SimulatorComm.Latency
08-APR22 74
Communication OverheadWeak Scaling: Earth Simulator
0.00
0.02
0.04
0.06
10 100 1000 10000
PE#
Co
mm
. Ov
erh
ea
d (
se
c.)
●○ 3x503 DOF/PE▲ △ 3x323 DOF/PE●▲ Flat-MPI ○△ Hybrid
Effect of message sizeEffect of message sizeis small. Effect of latencyis small. Effect of latencyis large.is large.
Memory-copy is so fast.Memory-copy is so fast.
08-APR22 75
Communication OverheadWeak Scaling: IBM SP-3
●○ 3x503 DOF/PE▲ △ 3x323 DOF/PE●▲ Flat-MPI ○△ Hybrid
Effect of message sizeEffect of message sizeis more significant.is more significant.
0.00
0.10
0.20
0.30
0.40
10 100 1000 10000
PE#
Co
mm
. Ov
erh
ea
d (
se
c.)
08-APR22 76
Communication OverheadWeak Scaling:
Hitachi SR11000/J2 (8cores/node)
●○ 3x503 DOF/PE▲ △ 3x323 DOF/PE●▲ Flat-MPI ○△ Hybrid
0.00
0.02
0.04
0.06
0.08
0.10
10 100 1000
cores
Co
mm
. Ov
erh
ea
d (
se
c.)
08-APR22 77
Summary
• Hybrid Parallel Programming Model on SMP Cluster Architecture with Vector Processors for Unstructured Grids
• Nice parallel performance for both inter/intra SMP node on ES, 3.8TFLOPS for 2.2G DOF on 176 nodes (33.8%) in 3D linear-elastic problem using BIC(0)-CG method.– N.Kushida (student of Prof.Okuda) attained >10 TFLOPS
using 512 nodes for >3G DOF problem.
• Re-Ordering is really required
08-APR22 78
Summary (cont.)
• Hybrid vs. Flat MPI – Flat-MPI is better for small number of SMP nodes.
– Hybrid is better for large number of SMP nodes: Especially when problem size is rather small.
– Flat MPI: Communication, Hybrid: Memory
– depends on application, problem size etc.
– Hybrid is much more sensitive to color numbers than flat MPI due to synchronization overhead of OpenMP.
• In Mat-Vec. operations, difference is not so significant.
7908-APR22
• Background– Vector/Scalar Processors– GeoFEM Project– Earth Simulator– Preconditioned Iterative Linear Solvers
• Optimization Strategy on the Earth Simulator– BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications– Matrix Assembling
• Summary & Future Works
08-APR22 80
“CUBE” Benchmark• 3D linear elastic applications on cubes for a wide range of problem si
ze.• Hardware
– Single CPU– Earth Simulator– AMD Opteron (1.8GHz)
x
y
z
Uz=0 @ z=Zmin
Ux=0 @ x=Xmin
Uy=0 @ y=Ymin
Uniform Distributed Force in z-dirrection @ z=Zmin
Ny-1
Nx-1
Nz-1
x
y
z
Uz=0 @ z=Zmin
Ux=0 @ x=Xmin
Uy=0 @ y=Ymin
Uniform Distributed Force in z-dirrection @ z=Zmin
Ny-1
Nx-1
Nz-1
08-APR22 81
Time for 3x643=786,432 DOF
Matrix34.2(240)
21.7(3246)
Solver
DJDS originalsec.
(MFLOPS)
55.9Total
28.6(291)
360(171)
DCRSsec.
(MFLOPS)
389
Matrix12.4(663)
271(260)
Solver
283Total
10.2(818)
225(275)
235
ES8.0 GFLOPS
Opteron1.8GHz
3.6 GFLOPS
DJDSDJDSDCRSDCRS
08-APR22 82
Matrix+Solver
0
10
20
30
40
50
60
41472 98304 192000 375000 786432
DOF
se
c.
0
100
200
300
41472 98304 192000 375000 786432
DOF
se
c.
DJDS on ESoriginal
DJDS on Opteronoriginal
MatrixSolver
08-APR22 83
Computation Time vs. Problem Size
0
10
20
30
40
41472 98304 192000 375000 786432
DOF
se
c.
0
100
200
300
41472 98304 192000 375000 786432
DOF
se
c.
0
100
200
300
41472 98304 192000 375000 786432
DOF
se
c.
ES (DJDS original)
Opteron (DJDS original)
Opteron (DCRS)
Matrix Solver
Total
08-APR22 84
Matrix assembling/formation part is rather expensive
• This part should be also optimized for vector processors.• For example, in nonlinear simulations such as elasto-plastic solid simulations,
or fully coupled Navier-Stokes flow simulations, matrices must be updated for every nonlinear iterations.
• This part strongly depends on applications/physics, therefore it’s very difficult to develop general libraries, such as those of iterative linear solvers.
– also includes complicated processes which are difficult to be “vectorized”
08-APR22 85
Typical Procedure for Calculating Coefficient Matrix in FEM
• Apply Galerkin’s method on each element.• Integration over each element, and get element-matrix.• Element matrices are accumulated to each node, and glo
bal matrices are obtained => Global linear equations
• Matrix assembling/formation is embarrassingly parallel procedure due to its element-by-element feature
08-APR22 86
Element-by-Element Operations• Integration over each element => element-matrix• Element matrices are accumulated to each node
=> global-matrix• Linear equations for each node
19
13
7
1
20
14
8
2
21
15
9
3
22
16
10
4
23
17
11
5
24
18
12
6
Elements
08-APR22 87
Element-by-Element Operations• Integration over each element => element-matrix• Element matrices are accumulated to each node
=> global-matrix• Linear equations for each node
Nodes19
13
7
1
20
14
8
2
21
15
9
3
22
16
10
4
23
17
11
5
24
18
12
6
29 30 31 32 33 34 35
22
15
8
1
23
16
9
2
24
17
10
3
25
18
11
4
26
19
12
5
27
20
13
6
28
21
14
7
08-APR22 88
Element-by-Element Operations• Integration over each element => element-matrix• Element matrices are accumulated to each node
=> global-matrix• Linear equations for each node
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
08-APR22 89
Element-by-Element Operations• Integration over each element => element-matrix• Element matrices are accumulated to each node
=> global-matrix• Linear equations for each node
44434241
34333231
24232221
14131211
EEEE
EEEE
EEEE
EEEE
08-APR22 90
Element-by-Element Operations• Integration over each element => element-matrix• Element matrices are accumulated to each node
=> global-matrix• Linear equations for each node
29
22
15
8
1
30
23
16
9
2
31
24
17
10
3
32
25
18
11
4
33
26
19
12
5
34
27
20
13
6
35
28
21
14
7
08-APR22 91
Element-by-Element Operations• Integration over each element => element-matrix• Element matrices are accumulated to each node
=> global-matrix• Linear equations for each node
35
34
33
3
2
1
35
34
33
3
2
1
35,3534,35
35,3434,34
33,33
3,3
2,21,2
2,11,1
.........
f
f
f
f
f
f
u
u
u
u
u
u
aa
aa
a
a
aa
aa
08-APR22 92
Element-by-Element Operations
13
7
14
8
22
15
8
23
16
9
24
17
10
• If you calculate a23,16 and a16,23, you have to consider
contribution by both of 13th and 14th elements.
35
34
33
3
2
1
35
34
33
3
2
1
35,3534,35
35,3434,34
33,33
3,3
2,21,2
2,11,1
.........
f
f
f
f
f
f
u
u
u
u
u
u
aa
aa
a
a
aa
aa
08-APR22 93
Current Approach
13
7
14
8
22
15
8
23
16
9
24
17
10
do icel= 1, ICELTOT do ie= 1, 4 do je= 1, 4 - assemble element-matrix - accumulated element-matrix to global-matrix enddo enddo enddo
08-APR22 94
Current Approach
13
7
14
8
22
15
8
23
16
9
24
17
10
do icel= 1, ICELTOT do ie= 1, 4 do je= 1, 4 - assemble element-matrix - accumulat element-matrix to global-matrix enddo enddo enddo
Local Node ID
1 2
4 3
Local Node IDfor each bi-linear 4-nodeelement
08-APR22 95
Current Approach
13
7
14
8
22
15
8
23
16
9
24
17
10
• Nice for cache reuse because of localized operations• Not suitable for vector processors
– a16,23 and a23,16 might not be calculated properly.
– Short innermost loops– There are many “if-then-else” s
do icel= 1, ICELTOT do ie= 1, 4 do je= 1, 4 - assemble element-matrix - accumulat element-matrix to global-matrix enddo enddo enddo
Local Node ID
Matrix34.2(240)
21.7(3246)
Solver
DJDS originalsec.
(MFLOPS)
55.9Total
28.6(291)
360(171)
DCRSsec.
(MFLOPS)
389
Matrix12.4(663)
271(260)
Solver
283Total
10.2(818)
225(275)
235
ES8.0 GFLOPS
Opteron1.8GHz
3.6 GFLOPS
Matrix34.2(240)
21.7(3246)
Solver
DJDS originalsec.
(MFLOPS)
55.9Total
28.6(291)
360(171)
DCRSsec.
(MFLOPS)
389
Matrix12.4(663)
271(260)
Solver
283Total
10.2(818)
225(275)
235
ES8.0 GFLOPS
Opteron1.8GHz
3.6 GFLOPS
08-APR22 96
Inside the loop: integration at Gaussian quadrature points
do jpn= 1, 2do ipn= 1, 2 coef= dabs(DETJ(ipn,jpn))*WEI(ipn)*WEI(jpn)
PNXi= PNX(ipn,jpn,ie) PNYi= PNY(ipn,jpn,ie)
PNXj= PNX(ipn,jpn,je) PNYj= PNY(ipn,jpn,je)
a11= a11 + (valX*PNXi*PNXj + valB*PNYi*PNYj)*coef a22= a22 + (valX*PNYi*PNYj + valB*PNXi*PNXj)*coef a12= a12 + (valA*PNXi*PNYj + valB*PNXj*PNYi)*coef a21= a21 + (valA*PNYi*PNXj + valB*PNYj*PNXi)*coefenddoenddo
08-APR22 97
Remedy
• a16,23 and a23,16 might not be calculated properly. – coloring the elements: elements which do not share any nodes are
in same color.
13
7
14
8
22
15
8
23
16
9
24
17
10
do icel= 1, ICELTOT do ie= 1, 4 do je= 1, 4 - assemble element-matrix - accumulat element-matrix to global-matrix enddo enddo enddo
08-APR22 98
Coloring of Elements
29
22
15
8
1
30
23
16
9
2
31
24
17
10
3
32
25
18
11
4
33
26
19
12
5
34
27
20
13
6
35
28
21
14
7
08-APR22 99
Coloring of Elements
29
1
30
16
2
31
3
32
25
18
11
4
33
26
19
12
5
34
27
20
13
6
35
28
21
14
7
22
15
8
23
9
24
17
10
Elements sharing the 16th node are assigned to different colors
08-APR22 100
Remedy
• a16,23 and a23,16 might not be calculated properly. – coloring the elements: elements which do not share any nodes are in same color.
• Short innermost loops– loop exchange
13
7
14
8
22
15
8
23
16
9
24
17
10
do icel= 1, ICELTOT do ie= 1, 4 do je= 1, 4 - assemble element-matrix - accumulat element-matrix to global-matrix enddo enddo enddo
08-APR22 101
Remedy
• a16,23 and a23,16 might not be calculated properly. – coloring the elements: elements which do not share any nodes are in same color.
• Short innermost loops– loop exchange
• There are many “if-then-else” s– define ELEMENT-to-MATRIX array
13
7
14
8
22
15
8
23
16
9
24
17
10
do icel= 1, ICELTOT do ie= 1, 4 do je= 1, 4 - assemble element-matrix - accumulat element-matrix to global-matrix enddo enddo enddo
08-APR22 102
Define ELEMENT-to-MATRIX array
13
7
14
8
22
15
8
23
16
9
24
17
10
13
22
15
23
16
14
23
16
24
17
ELEMmat(icel, ie, je)
① ②
④ ③
① ②
④ ③
Element ID
Local Node ID
Local Node ID
if kkU= index_U(16-1+k) and item_U(kkU)= 23 then
ELEMmat(13,2,3)= +kkU ELEMmat(14,1,4)= +kkUendif
if kkL= index_L(23-1+k) and item_L(kkL)= 16 then
ELEMmat(13,3,2)= -kkL ELEMmat(14,4,1)= -kkLendif
08-APR22 103
Define ELEMENT-to-MATRIX array
13
7
14
8
22
15
8
23
16
9
24
17
10
if kkU= index_U(16-1+k) and item_U(kkU)= 23 then
ELEMmat(13,2,3)= +kkU ELEMmat(14,1,4)= +kkUendif
13
22
15
23
16
14
23
16
24
17
if kkL= index_L(23-1+k) and item_L(kkL)= 16 then
ELEMmat(13,3,2)= -kkL ELEMmat(14,4,1)= -kkLendif
① ②
④ ③
① ②
④ ③
Local Node ID
“ELEMmat” specifies relationship between node pairs of each node of each element and address of global coefficient matrix.
08-APR22 104
Optimized Proceduredo icol= 1, NCOLOR_E_tot do ie= 1, 4 do je= 1, 4 do ic0= index_COL(icol-1)+1, indexCOL_(icol) icel= item_COL(ic0) - define “ELEMmat” array enddo enddo enddoenddo
do icol= 1, NCOLOR_E_tot do ie= 1, 4 do je= 1, 4 do ic0= index_COL(icol-1)+1, indexCOL_(icol) icel= item_COL(ic0) - assemble element-matrix enddo enddo enddo
do ie= 1, 4 do je= 1, 4 do ic0= index_COL(icol-1)+1, indexCOL_(icol) icel= item_COL(ic0) - accumulate element-matrix to global-matrix enddo enddo enddoenddo
Extra Storage for• ELEMmat array• element-matrix components
for elements in each color• < 10% increase
Extra Computation for• ELEMmat
08-APR22 105
Optimized Proceduredo icol= 1, NCOLOR_E_tot do ie= 1, 4 do je= 1, 4 do ic0= index_COL(icol-1)+1, indexCOL_(icol) icel= item_COL(ic0) - define “ELEMmat” array enddo enddo enddoenddo
do icol= 1, NCOLOR_E_tot do ie= 1, 4 do je= 1, 4 do ic0= index_COL(icol-1)+1, indexCOL_(icol) icel= item_COL(ic0) - assemble element-matrix enddo enddo enddo
do ie= 1, 4 do je= 1, 4 do ic0= index_COL(icol-1)+1, indexCOL_(icol) icel= item_COL(ic0) - accumulate element-matrix to global-matrix enddo enddo enddoenddo
PART I“Integer” operations for “ELEMmat”In nonlinear cases, this part should be done just once (before initial iteration), as long as mesh connectivity does not change.
08-APR22 106
Optimized Proceduredo icol= 1, NCOLOR_E_tot do ie= 1, 4 do je= 1, 4 do ic0= index_COL(icol-1)+1, indexCOL_(icol) icel= item_COL(ic0) - define “ELEMmat” array enddo enddo enddoenddo
do icol= 1, NCOLOR_E_tot do ie= 1, 4 do je= 1, 4 do ic0= index_COL(icol-1)+1, indexCOL_(icol) icel= item_COL(ic0) - assemble element-matrix enddo enddo enddo
do ie= 1, 4 do je= 1, 4 do ic0= index_COL(icol-1)+1, indexCOL_(icol) icel= item_COL(ic0) - accumulate element-matrix to global-matrix enddo enddo enddoenddo
PART I“Integer” operations for “ELEMmat”In nonlinear cases, this part should be done just once (before initial iteration), as long as mesh connectivity does not change.
PART II“Floating” operations for matrix assembling/accumulation.
In nonlinear cases, this part is repeated for every nonlinear iteration.
08-APR22 107
Time for 3x643=786,432 DOF
Matrix34.2(240)
21.7(3246)
Solver
DJDS originalsec.
(MFLOPS)
55.9Total
12.5(643)
21.7(3246)
DJDS improvedsec.
(MFLOPS)
34.2
28.6(291)
360(171)
DCRSsec.
(MFLOPS)
389
Matrix12.4(663)
271(260)
Solver
283Total
21.2(381)
271(260)
292
10.2(818)
225(275)
235
ES8.0 GFLOPS
Opteron1.8GHz
3.6 GFLOPS
08-APR22 108
Time for 3x643=786,432 DOF
Matrix34.2(240)
21.7(3246)
Solver
DJDS originalsec.
(MFLOPS)
55.9Total
12.5(643)
21.7(3246)
DJDS improvedsec.
(MFLOPS)
34.2
28.6(291)
360(171)
DCRSsec.
(MFLOPS)
389
Matrix12.4(663)
271(260)
Solver
283Total
21.2(381)
271(260)
292
10.2(818)
225(275)
235
ES8.0 GFLOPS
Opteron1.8GHz
3.6 GFLOPS
Slower than originalbecause of long innermost loops (data locality has been lost)
08-APR22 109
Matrix+Solver
0
10
20
30
40
50
60
41472 98304 192000 375000 786432
DOF
se
c.
0
100
200
300
41472 98304 192000 375000 786432
DOF
se
c.
0
100
200
300
41472 98304 192000 375000 786432
DOF
se
c.
0
10
20
30
40
50
60
41472 98304 192000 375000 786432
DOF
se
c.
DJDS on ESoriginal
DJDS on Opteronoriginal
MatrixSolver
DJDS on ESimproved
DJDS on Opteronimproved
08-APR22 110
Computation Time vs. Problem Size
0
100
200
300
41472 98304 192000 375000 786432
DOF
se
c.
0
10
20
30
40
41472 98304 192000 375000 786432
DOF
se
c.
0
100
200
300
41472 98304 192000 375000 786432
DOF
se
c.
ES (DJDS improved)
Opteron (DJDS improved)
Matrix Solver
Total
Opteron (DCRS)
08-APR22 111
“Matrix” computation time for improved version of DJDS
0
5
10
15
20
25
41472 98304 192000 375000 786432
DOF
se
c.
0
5
10
15
20
25
41472 98304 192000 375000 786432
DOF
se
c.
IntegerFloating
ES Opteron
08-APR22 112
Optimization of “Matrix” assembling/formation on ES
• DJDS has been much improved compared to the original one, but it’s still slower than DCRS version on Opteron.
• “Integer” operation part is slower.• But, “floating” operation is much faster than Opteron.
• In nonlinear simulations, “integer” operation is executed only once (just before initial iteration), therefore, ES outperforms Opteron if the number of nonlinear iterations is more than 2.
08-APR22 113
Suppose “virtual” mode where …
• On scalar processor– “Integer” operation part
• On vector processor– “floating” operation part– linear solvers
• Scalar performance of ES (500MHz) is smaller than that of Pentium III
08-APR22 114
Time for 3x643=786,432 DOF
Matrix1.88
(4431)
21.7(3246)
Solver
DJDS virtualsec.
(MFLOPS)
23.6Total
12.5(643)
21.7(3246)
DJDS improvedsec.
(MFLOPS)
34.2
28.6(291)
360(171)
DCRSsec.
(MFLOPS)
389
Matrix
Solver
Total
21.2(381)
271(260)
292
10.2(818)
225(275)
235
ES8.0 GFLOPS
Opteron1.8GHz
3.6 GFLOPS
08-APR22 115
Summary: Vectorization of FEM appl.• NOT so easy• FEM’s good features of local operations are not necessarily suitable for vector processors.
– Preconditioned iterative solvers can be vectorized rather easier, because their target is “global” matrix.
• Sometimes, major revision of original codes are required– Usually, more memory, more lines, additional operations …
• Performance of optimized codes for vector processor is not necessarily good on scalar processors (e.g. matrix assembling of FEM)