Integrating Trilinos Integrating Trilinos Solvers to SEAM codeSolvers to SEAM code
Dagoberto A.R. Justo – UNMDagoberto A.R. Justo – UNM
Tim Warburton – UNMTim Warburton – UNM
Bill Spotz – SandiaBill Spotz – Sandia
SEAM SEAM (NCAR(NCAR))SpectralSpectral
ElementElement
AtmosphericAtmospheric
MethodMethod
AztecOOAztecOO EpetraEpetra NoxNox IfpackIfpack PETScPETSc KomplexKomplex
Trilinos Trilinos (Sandia (Sandia Lab)Lab)
AztecOOAztecOO
SolversSolvers– CG, CGS, BICGStab, GMRES, TfqmrCG, CGS, BICGStab, GMRES, Tfqmr
PreconditionersPreconditioners– Diagonal Jacobi, Least Square, Neumann, Diagonal Jacobi, Least Square, Neumann,
Domain Decomposition, Symmetric Gauss-Domain Decomposition, Symmetric Gauss-Seidel Seidel
Matrix Free implementationMatrix Free implementation C++ (Fortran interface)C++ (Fortran interface) MPIMPI
ImplementationImplementation
SEAM CODE
.
.
. Pcg_solver
.
.
(F90)
Pcg_solver
.
.
Aztec_solvers( )
.
(F90)
Sub Aztec_solvers
.
AZ_Iterate( )
(C)
Matrix_vector_C
(C)
Matrix_vector
.
(F90)
Prec_Jacobi
.
(F90)
Prec_Jacobi_C
(C)
A
Z
T
E
C
Machines usedMachines used
Pentium III Notebook (serial)Pentium III Notebook (serial)– Linux, LAM-MPI, Intel CompilersLinux, LAM-MPI, Intel Compilers
Los Lobos at HPC@UNMLos Lobos at HPC@UNM– Linux ClusterLinux Cluster– 256 nodes256 nodes– IBM Pentium III 750 MHz, 256 KB L2 Cache, IBM Pentium III 750 MHz, 256 KB L2 Cache,
1 Gb RAM1 Gb RAM– Portland Group compilerPortland Group compiler– MPICH for Myrinet interconnectionsMPICH for Myrinet interconnections
Graphical Results from Graphical Results from SEAMSEAM
Energy
Mass
MemoryMemory(in Mbytes per processor)(in Mbytes per processor)
0
5
10
15
20
25
30
p=2 p=4 p=8 p=16
SEAM 6x6x6
SEAM+Aztec6x6x6SEAM12x12x6SEAM+Aztec12x12x6
Speed UpSpeed Up
From 1 to 160 processors.From 1 to 160 processors. Time of SimulationTime of Simulation
144 time iterations144 time iterations
x 300 s = 12 h simulationx 300 s = 12 h simulation Verify results using mass, energy,Verify results using mass, energy,
……– (Different result for 1 proc)(Different result for 1 proc)
Speed Up – SEAMSpeed Up – SEAMselecting # of elements ne=24x24x6selecting # of elements ne=24x24x6
Speed Up – SEAMSpeed Up – SEAMselecting order np=6selecting order np=6
Speed Up – Speed Up – SEAM+AztecSEAM+Aztecbest: cgs solverbest: cgs solver
Speed Up – Speed Up – SEAM+AztecSEAM+Aztecbest: cgs solver + Least Square best: cgs solver + Least Square preconditionerpreconditioner
Speed Up – Speed Up – SEAM+AztecSEAM+Aztecincreasing np -> increases speedupincreasing np -> increases speedup
Upshot – SEAMUpshot – SEAM(One CG iteration)(One CG iteration)
Upshot – SEAMUpshot – SEAM(matrix times vector communication)(matrix times vector communication)
Upshot – SEAM+AztecUpshot – SEAM+Aztec(One CG iteration)(One CG iteration)
Upshot – SEAM+AztecUpshot – SEAM+Aztec(Matrix times vector (Matrix times vector communication)communication)
Upshot – SEAM+AztecUpshot – SEAM+Aztec(Vector Reduction)(Vector Reduction)
Time (24x24x6 elements, 2 proc.)Time (24x24x6 elements, 2 proc.)
SolverSolver Iter.Iter. Time Time (loop) (loop)
Time/iterTime/iter
SEAM p=6SEAM p=6 33.0 it33.0 it 7.48 s7.48 s 0.22 s/it0.22 s/it
SEAM p=12SEAM p=12 56.9 it56.9 it 81.2 s81.2 s 1.42 s/it1.42 s/it
Cg p=6Cg p=6 87.1 it87.1 it 28.2 s28.2 s 0.32 s/it0.32 s/it
Cgs p=6Cgs p=6 74.1 it74.1 it 28.6 s28.6 s 0.38 s/it0.38 s/it
Tfqmr p=6Tfqmr p=6 75.2 it75.2 it 31.1 s31.1 s 0.41 s/it0.41 s/it
Bicg p=6Bicg p=6 94.1 it94.1 it 29.4 s29.4 s 0.31 s/it0.31 s/it
Cgs ls p=6Cgs ls p=6 35.1 it35.1 it 42.0 s42.0 s 1.19 s/it1.19 s/it
CG Jacobi CG Jacobi p=6p=6
45.8 it45.8 it 17.2 s17.2 s 0.37 s/it0.37 s/it
Cgs Cgs Jacobip=6Jacobip=6
31.7 it31.7 it 15.3 s15.3 s 0.48 s/it0.48 s/it
Cgs p=12Cgs p=12 60.4 it60.4 it 274. S274. S 4.53 s/it4.53 s/it
Conclusions &Conclusions &Suggested Future Suggested Future EffortsEfforts SEAM+Aztec works!SEAM+Aztec works! SEAM+Aztec is 2x slowerSEAM+Aztec is 2x slower
difference in CG algorithmsdifference in CG algorithms
SEAM+Aztec time-iteration is 50% SEAM+Aztec time-iteration is 50% slowerslower
0.1% of time lost in calls, preparation 0.1% of time lost in calls, preparation for Aztec.for Aztec.
More time More time better tune-up. better tune-up. Domain decomposition Domain decomposition
PreconditionersPreconditioners
SEAM + Aztec works!SEAM + Aztec works! More time More time better tune-up. better tune-up.
Conclusions &Conclusions &Suggested Future Suggested Future EffortsEfforts