An Extreme-Scale Implicit Solver for Complex PDEs:Highly Heterogeneous Flow in Earth’s Mantle
Johann Rudi∗, A. Cristiano I. Malossi†, Tobin Isaac∗, Georg Stadler‖, Michael Gurnis‡,Peter W. J. Staar†, Yves Ineichen†, Costas Bekas†, Alessandro Curioni†, Omar Ghattas∗¶
∗Institute for Computational Engineering and Sciences, The University of Texas at Austin, USA†Foundations of Cognitive Solutions, IBM Research – Zurich, Switzerland‖Courant Institute of Mathematical Sciences, New York University, USA‡Seismological Laboratory, California Institute of Technology, USA
¶Jackson School of Geosciences and Dept. of Mechanical Engineering, The University of Texas at Austin, USA
Abstract—Mantle convection is the fundamental phys-ical process within earth’s interior responsible for thethermal and geological evolution of the planet, includingplate tectonics. The mantle is modeled as a viscous,incompressible, non-Newtonian fluid. The wide range ofspatial scales, extreme variability and anisotropy in ma-terial properties, and severely nonlinear rheology havemade global mantle convection modeling with realisticparameters prohibitive. Here we present a new implicitsolver that exhibits optimal algorithmic performance andis capable of extreme scaling for hard PDE problems,such as mantle convection. To maximize accuracy andminimize runtime, the solver incorporates a numberof advances, including aggressive multi-octree adaptivity,mixed continuous-discontinuous discretization, arbitrarily-high-order accuracy, hybrid spectral/geometric/algebraicmultigrid, and novel Schur-complement preconditioning.These features present enormous challenges for extremescalability. We demonstrate that—contrary to conventionalwisdom—algorithmically optimal implicit solvers can bedesigned that scale out to 1.5 million cores for severelynonlinear, ill-conditioned, heterogeneous, and anisotropicPDEs.
Submission Category: Scalability
Permission to make digital or hard copies of all or part of thiswork for personal or classroom use is granted without fee providedthat copies are not made or distributed for profit or commercialadvantage and that copies bear this notice and the full citation onthe first page. Copyrights for components of this work owned byothers than the author(s) must be honored. Abstracting with credit ispermitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee.Request permissions from [email protected] ’15, November 15 - 20, 2015, Austin, TX, USA.Copyright is held by the owner/author(s). Publication rights licensedto ACM.ACM 978-1-4503-3723-6/15/11...$15.00DOI: http://dx.doi.org/10.1145/2807591.2807675
I. EARTH’S MANTLE CONVECTION
Earth is a dynamic system in which mantle convectiondrives plate tectonics and continental drift and, in turn,controls much activity ranging from the occurrence ofearthquakes and volcanoes to mountain building andlong-term sea level change. Despite its central role insolid earth dynamics, we have enormous first-order gapsin our knowledge of mantle convection, with questionsthat are as basic as what are the principal driving andresisting forces on plate tectonics to what is the energybalance of the planet as a whole. Indeed, understandingmantle convection has been designated one of the “10Grand Research Questions in Earth Sciences” in a recentNational Academies report [1]. We seek to address suchfundamental questions as: (i) What are the main driversof plate motion—negative buoyancy forces or convectiveshear traction? (ii) What is the key process governing theoccurrence of great earthquakes—the material propertiesbetween the plates or the tectonic stress?
Addressing these questions requires global models ofearth’s mantle convection and associated plate tectonics,with realistic parameters and high resolutions down tofaulted plate boundaries. Historically, modeling at thisscale has been out of the question due to the enormouscomputational complexity associated with numerical so-lution of the underlying mantle flow equations. However,with the advent of multi-petaflops supercomputers aswell as significant advances in seismic tomography andspace geodesy placing key observational constraints onmantle convection, we now have the opportunity toaddress these fundamental questions.
Instantaneous flow of the mantle is modeled by the
µmax µmin plate decoupling
strain ratestrain rateweakeningweakening
plasticplasticyieldingyielding
Figure 1: Cross section through a subducting slab of the Pacificplate showing effective mantle viscosity (colors) in a simulation ofnonlinear mantle flow. Viscosity of plates reaches µmax (dark blue),except at the trench due to plastic yielding. In the thin plate boundaryregion, viscosity drops to µmin (dark red) creating a contrast of 106.Strain rate weakening reduces the viscosity underneath the platesand, combined with plastic yielding, is the reason for earth’s highlynonlinear rheology.
nonlinear incompressible Stokes equations:
−∇ ·[µ(T,u) (∇u+∇u>)
]+∇p = f(T ), (1a)
∇ · u = 0, (1b)
where u, T , and p are the velocity, temperature, and pres-sure fields, respectively; f is the temperature-dependentforcing derived from the Boussinesq approximation; andthe temperature- and velocity-dependent effective viscos-ity µ is characterized by the constitutive law
µ(T,u) = µmin+ (1c)
min
(τyield
2ε̇II(u), wmin
(µmax, a(T ) ε̇II(u)
1/n−1))
.
The effective viscosity depends on a power of the squareroot of the second invariant of the strain rate tensorε̇II :=
12(ε̇ : ε̇)
1/2, where “:” represents the innerproduct of second order tensors and ε̇ := 12(∇u+∇u
>).The viscosity decays exponentially with temperature viathe Arrhenius relationship, symbolized by a(T ). Theconstitutive relation incorporates plastic yielding withyield stress τyield, lower/upper bounds on viscosity µminand µmax, and a decoupling factor w(x) to model plateboundaries. For whole earth mantle flow models, (1) areaugmented with free-slip conditions (i.e., no tangentialtraction, no normal flow) at the core–mantle and topsurface boundaries.
Successful solution of realistic mantle flow prob-lems must overcome a number of computational chal-lenges due to the severe nonlinearity, heterogeneity, andanisotropy of earth’s rheology. Nonlinear behavior atnarrow plate boundary regions influences the motion
of whole plates at continental scales, resulting in awide range of spatial scales. Crucial features are highlylocalized with respect to earth’s radius (∼6371 km),including plate thickness of order ∼50 km and shearzones at plate boundaries of order ∼5 km. Desiredresolution at plate boundaries is below ∼1 km. However,a mesh of earth’s mantle with uniform resolution of0.5 km would result in O(1013) degrees of freedom(DOF), which would be prohibitive for models with suchcomplexity. Thus adaptive methods are essential. Sixorders of magnitude viscosity contrast is characteristicof the shear zones at plate boundaries, yielding sharpviscosity gradients and leading to severe ill-conditioning.Furthermore, the viscosity’s dependence on a power ofthe second invariant of the strain rate tensor and plasticyielding phenomena lead to severely nonlinear behavior.
Overcoming major obstacles in adaptivity [2], weadvanced toward these challenges with global modelshaving 20 km thick plate boundaries, nonlinear viscosity,and yielding, and in turn demonstrated an unantici-pated level of coupling between plate motion and thedeep mantle [3], bounds on energy dissipation withinplates [4], and rapid motion of small tectonic platesadjacent to large ones [5]. Nevertheless, such modelsdid not close the gap between the fine-scale (∼1 to10 km) patterns of earthquakes, stress, and topographyalong plate boundaries with plate motions. Narrowing thelocal-to-global divide is essential for extracting the keyobservations allowing one to reach a new understandingof the physics of earth processes (Figure 1).
The central computational challenge to doing sois to design implicit solvers and implementations forhigh-resolution realistic mantle flow models that canhandle the resulting extreme degrees of nonlinearityand ill-conditioning, the wide ranges of length scalesand material properties, and the highly adapted meshesand required advanced discretizations, while also scalingto the O(106) cores characteristic of leadership classsupercomputers. While the conventional view is thatthis goal is impossible, we demonstrate that with acareful redesign of discretization, algorithms, solvers,and implementation, this goal is indeed possible. Theseadvances open the door to merging two fundamentalgeophysical questions: the origin of great earthquakesand the balance of forces driving plate motions.
II. STATE OF THE ART
Earth’s mantle convection is one of a large numberof complex PDE problems that require implicit solution
on extreme-scale systems. The complexity arises fromthe presence of a wide range of length scales and strongheterogeneities, as well as localizations and anisotropies.Complex PDE problems often require aggressive adap-tive mesh refinement, such as that provided by the paral-lel forest-of-octree library p4est [6] that originated in ourgroup. They also often require advanced discretizations,such as the high-order, hanging-node, mixed continuous-velocity/discontinuous-pressure element pair employedhere. The physics complexities combined with the dis-cretization complexities conspire to present enormouschallenges for the design of solvers that are not onlyalgorithmically optimal, but also scale well in parallel.These challenges are well documented in a number ofblue ribbon panel reports (e.g., [7]).
In our context of time-independent non-Newtonianflows, implicit solvers means a combination of nonlinearand linear solvers and preconditioners. We employ New-ton’s method, the gold standard for nonlinear solvers.It can deliver asymptotic quadratic convergence, inde-pendent of problem size, for many problems. However,differentiating complex constitutive laws such as (1c)to obtain the linearized Newton operator creates aneven more complex system to be solved. Combining theNewton method with an appropriately truncated Krylovlinear solver permits avoidance of oversolving far fromthe region of fast convergence [8]. The crucial pointis then the preconditioner, which must simultaneouslyglobalize information to maximize algorithmic efficiency,while localizing it to maximize parallel performance. Forpreconditioning, we target multilevel solvers, which arealgorithmically optimal for many problems (i.e., theyrequire O(n) work, where n is the number of unknowns)and parallelize well (requiring O(log n) depth), at leastfor simple elliptic PDE operators.
The state of the art in extreme-scale multilevel solversis exemplified by the Hybrid Hierarchical Grids (HHG)geometric multigrid (GMG) method [9], the GMG solverunderlying the UG package [10], the algebraic multigrid(AMG) solver BoomerAMG from the hypre library [11],the multilevel balancing domain decomposition solver inFEMPAR [12], and the AMG solver for heterogeneouscoefficients from the DUNE project [13]. These multigridsolvers have all been demonstrated to scale up to severalhundred thousand cores (458K cores in some cases), butonly for constant coefficient linear operators, uniformly-refined meshes, and low-order discretizations (with theexception of the DUNE solver, which has demonstratedscalability on a problem with heterogeneous coeffi-
GMRES iteration
0 100 200 300
Resid
ual
reducti
on
10-6
10-4
10-2
100
GMRES
restart
HMG-LSC-4.0km
HMG-LSC-1.5km
HMG-LSC-0.7km
AMG-mass-4.0km
AMG-mass-1.5km
AMG-mass-0.7km
Figure 2: Comparison of algorithmic performance of conventionalstate-of-the-art (dashed lines) vs. new (solid lines) Stokes solver fora sequence of increasingly difficult problems (indicated by colors),reflecting increasingly narrower plate boundary regions.
cients but otherwise with uniform and low order grids).The complex PDE problems we target—characterizedby advanced high-order discretizations, highly-locallyadapted meshes, extreme (six orders-of-magnitude varia-tion) heterogeneities, anisotropies, and severely nonlinearrheology—are significantly more difficult. We are notaware of any solver today that is capable of solving suchproblems at large scale, with algorithmic and parallelefficiency.
When O(105) cores and beyond are needed for im-plicit solution of such complex PDE problems, the usualapproach has been to retreat to algorithmically subopti-mal but easily-parallelizable solvers (such as explicit orsimply-preconditioned implicit). This is clearly not a ten-able situation, and the performance gap between optimaland suboptimal solvers only increases as problems growlarger. Thus our goal here is to present an implicit solver(significantly going beyond our previous work [14], [15],[2]) that delivers optimal algorithmic complexity whilescaling with high parallel efficiency to the full size ofleadership-class supercomputers for the class of complexPDE problems targeted here, with particular applicationto our driving global mantle convection problem.
Figure 2 illustrates the power of algorithmically op-timal solvers for our mantle convection problem. Thecurves show the reduction in residual as a function ofKrylov iterations for a sequence of increasingly difficultproblems (different colors). The dashed curves representa contemporary, well-regarded solver, such as that foundin the state-of-the-art community mantle convection codeASPECT [16]. This combines AMG to precondition the(1,1) block of the Stokes system along with a diagonalmass matrix approximation of the (2,2) Schur comple-ment. Our new solver (see next section) combines asophisticated hybrid spectral-geometric-algebraic multi-grid (HMG) along with a novel HMG-preconditionedimproved Schur complement approximation. The mas-
sive enhancement in algorithmic performance (over 4orders of magnitude lower residual for the same numberof iterations) seen in the figure is due to the improve-ment of the Schur complement. This is what makes thesolution of the high-fidelity mantle flow models we aretargeting tractable. It increases however the algorithmiccomplexity, but as we will see, we are still able to obtainexcellent scalability out to 1.5M cores, to go with theseveral orders of magnitude improvement in run time.Key to achieving this scalability is: (i) avoiding AMGsetup/communication costs with a spectral and geometricmultigrid approach; (ii) eliminating AMG’s requirementfor matrix assembly and storage for differential operatorsand intergrid transfer operations.
III. INNOVATIVE CONTRIBUTIONS
A. Summary of contributions
Hybrid spectral-geometric-algebraic multigrid(HMG). High-order discretizations on locally refinedmeshes for implicit problems with extreme variationsin coefficients pose challenge for extreme-scale PDEsolvers. We develop a multigrid scheme based onmatrix-free operators that does not require collectivecommunication and repartitions meshes at coarsemultigrid levels. The latter is achieved using ahierarchy of MPI communicators for point-to-pointcommunication. In this way, we obtain optimal time-to-solution.
Preconditioner. The Schur complement approxima-tion in our solver is known to be critical for problemswith extreme variations in the coefficients. We proposea new HMG-based approach for preconditioning theSchur complement of the nonlinear Stokes equations. Itextends a Schur complement method based on discretearguments which limited it application to AMG. SinceAMG is difficult to scale to millions of cores and hasa significant memory footprint, we have developed anHMG method based on a PDE operator that mimics thealgebraic operator occurring in the preconditioner.
Nonlinear solver. For the first time, a grid-continuation, inexact Newton-Krylov method is used fora realistic and severely nonlinear rheology over the entireearth. The nonlinearity originates from power law shearthinning, viscosity bounds, and plastic yielding [17]. Thisnew method enables us to simulate the global instanta-neous mantle flow in the entire earth, with unprecedentedaccuracy.
B. Algorithm overview
We employ an inexact Newton-Krylov method for thenonlinear Stokes equations (1), i.e., we use a sequenceof linearizations of (1) and approximately solve the re-sulting linearized systems using a preconditioned Krylovmethod. The design of the preconditioner required themajority of the algorithmic innovations, but the nonlinearsolver components needed careful consideration as well.In particular, we define the rheology (1c) such that itincorporates bounds for the viscosity in a differentiablemanner, permitting the use of Newton’s method. Tocompute a Newton update (ũ, p̃), we find the (inexact)solution of the linearized Stokes system,
−∇ ·[µ′(∇ũ+∇ũ>)
]+∇p̃ = −rmom,
∇ · ũ = −rmass,(2)
with
µ′ = µ I+ ε̇II∂µ
∂ε̇II
(∇u+∇u>)⊗ (∇u+∇u>)‖(∇u+∇u>)‖2F
, (3)
where the current velocity and pressure are u and p,respectively, and the residuals of the momentum andmass equations appear on the right-hand side of (2). Notethat what plays the role of viscosity in the Newton stepis an anisotropic fourth-order tensor (3).
We discretize earth’s mantle using locally adaptivelyrefined hexahedral meshes. Extreme local refinement iscritical to resolve plate boundaries down to a few hun-dred meters, while away from these regions significantlycoarser meshes can be used that still capture global-scalebehavior. Parallel adaptive forest-of-octrees algorithms,implemented in the p4est parallel AMR library, are usedfor efficient parallel mesh refinement/coarsening, meshbalancing, and repartitioning [2], [6], [18]. In (2), thevelocity is discretized with high-order, non-conforming,continuous nodal finite elements of polynomial order k ≥2, and the pressure with discontinuous modal elements oforder k−1. This velocity-pressure pairing yields optimalasymptotic convergence with decreasing mesh elementsize and conserves mass locally at the element level. Itis provably inf-sup stable and thus avoids stabilizationterms, which can degrade the accuracy of the solution,especially in mantle convection simulations.
This discretization of the Newton step results in anextremely ill-conditioned algebraic system with up tohundreds of billions of unknowns, which requires apreconditioned Krylov iterative method. Such a Krylovmethod needs only the application of the left hand side
HMG hierarchy
pressure space
spectralp-coarsening
geometrich-coarsening
algebraiccoars.
discont. modal
cont. nodalhigh-order F.E.
trilinear F.E.decreasing #cores
#cores < 1000small MPI communicator
single core
HMG V-cycle
p-MG
h-MG
AMG
direct
modal tonodal proj.
high-orderL2-projection
linearL2-projection
linearprojection
Figure 3: Left image: visualization of a simulation and part of the computational domain showing the adaptively refined mesh. The color codingillustrates the effective mantle viscosity and the arrows depict the motion of the tectonic plates. Central diagram: illustration of multigridhierarchy. From top to bottom, first, the multigrid levels are obtained by spectral coarsening. Next, the mesh is geometrically coarsened andrepartitioned on successively fewer cores to minimize communication. Finally, AMG further reduces problem size and core count. The multigridhierarchy used in the Schur complement additionally involves smoothing in the discontinuous modal pressure space (green). Right diagram: themultigrid V-cycle consists of smoothing at each level of the hierarchy (circles) and intergrid transfer operators (arrows downward for restrictionand arrows upward for interpolation). To enhance efficacy of the the V-cycle as a preconditioner, different types of projection operators areemployed for these operators depending on the phase within the V-cycle.
operator in (2) to vectors, which we implement in amatrix-free fashion using elemental loops. We exploitthe tensor-product structure of the element-level basisfunctions, resulting in a reduced number of operations[19]. We use GMRES as the Krylov solver, with rightpreconditioning based on the upper triangular blockmatrix: [
A B>
B 0
]︸ ︷︷ ︸Stokes operator
[Ã B>
0 S̃
]−1︸ ︷︷ ︸
preconditioner
[ũp̃
]=
[r1r2
]. (4)
Here, approximations of the inverse of the viscous block,Ã−1 ≈ A−1, and the inverse of the Schur comple-ment, S̃−1 ≈ (BA−1B>)−1, are required, where Band B> denote the discrete divergence and gradientmatrices. This particular combination of Krylov methodand preconditioner type is known to converge in onlytwo iterations for optimal choices of Ã−1 and S̃−1 [20].
The inverse of the viscous block Ã−1 is approximatedby a multigrid V-cycle, as detailed below. For the inverseof the Schur complement S̃−1, we use an improved ver-sion of the Least Squares Commutator method [21], [22]:S̃−1 = (BD−1B>)−1(BD−1AD−1B>)(BD−1B>)−1.It has been demonstrated to be robust with respectto extreme viscosity variations, as shown in Table I.This approach requires approximating the inverse ofBD−1B>, where D := diag(A).
Multigrid is a natural choice to invert this matrix.However, AMG would require matrix assembly andan expensive setup. The problem lies in the particularcoupling between matrices B and B>. Computing the
product matrix results in large communication require-ments and in a large number of nonzero entries. Thenumber of nonzero entries of the product matrix increasesimilarly as when squaring the matrix of a discreteLaplacian. We thus developed an approach that uses theanalogy between BD−1B> and an anisotropic, variable-coefficient elliptic PDE operator. This operator is dis-cretized with continuous, kth order, nodal finite elements.This continuous, nodal Poisson operator, which we callK, is then inverted with an HMG V-cycle plus additionalsmoothing steps in the discontinuous modal pressurespace. For smoothing in the pressure space we computeand store only the diagonal entries of BD−1B>, whichrequires no communication.
As illustrated in Figure 3, our HMG method is dividedin four stages (or even five stages in the pressure Poissoncase). This hybrid multigrid setup sits at the core of ournonlinear solver and thus a careful design of the intergridtransfer operators was critical for efficiency and per-formance. Our hybrid multigrid method combines high-order L2-restrictions/interpolations, uses the full fourth-order tensor coefficient in the Newton step (3) on alllevels, and employs Chebyshev-accelerated point-Jacobismoothers. This results in optimal algorithmic multigridperformance, i.e., iteration numbers are independent ofmesh size and discretization order, and are robust withrespect to the highly heterogeneous coefficients (six or-ders of magnitude viscosity and nine orders of magnitudePoisson coefficient contrast) occurring in the simulationof mantle flow with plates (see Section IV).
C. Implementation and optimization
From a high-level perspective, the challenge of a par-allel multigrid implementation is to balance the perfor-mance of two critical components: (i) application of dif-ferential operators during smoothing, commonly referredto as MatVecs, and (ii) intergrid transfer operators thatperform restriction and interpolation between multigridlevels. Further, this balance has to be maintained as thenumber of cores grows to extreme scales. Both MatVecsand intergrid operators rely on point-to-point communi-cation such that optimizing the runtime of one deterio-rates the performance of the other. In the case of ourcomplex mantle flow solver, we deal with four differentkinds of MatVecs (viscous stress A, divergence/gradientB/B>, continuous nodal Poisson operator K, and Stokesoperator) and six different intergrid operators (restrictionand interpolation for each of: modal to nodal projection,p-projection in spectral multigrid, and h-projection ingeometric multigrid). Optimization efforts have to targetall of these operators to be successful. Additionally, thistask is highly non-trivial, since the HMG V-cycle hasto be performed on unstructured, highly locally-adaptedmeshes.
In order to obtain optimal load balance for MatVecsduring the V-cycle, we repartition the coarser multigridlevels uniformly across the cores and gradually reducethe size of the MPI communicators as we progressthrough the coarser levels. The reduction of the MPIsize is done such that neither MatVecs nor intergridtransfer operations become a bottleneck at large scale.Moreover, point-to-point communication is overlappedwith computations for optimal scalability. No collec-tive communication is used in the V-cycle. The HMGsetup cost is minimized with a matrix-free approach fordifferential and intergrid operators, which additionallyproduces a lightweight memory footprint.
These key principles were at the foundation ofour extreme-scale multigrid implementation. Further im-provements of time-to-solution and performance werecarried out in a number of successive optimizationsteps (see Figure 4a). Overall, we decreased the timeto solution for the targeted hardware architecture (seeSection V) by a factor of over 1000 and increasedperformance on a compute node by a factor of ∼200.With this performance, our complex mantle flow solveras a whole, including spectral, geometric and algebraicmultigrid phases on highly adaptively refined meshes,is as performant as a routine for sparse matrix-vector
Optimization phase
A B C D E F G H
GF
lop
s/s
per
no
de
0
2.5
5
7.5
10
No
rmal
ized
tim
e [s
]
100
101
102
103
104
GFlops/s
Time
(a)Flops / off-chip bytes
0.1 0.25 1 10 100
GF
lops/
s per
node
1
10
100
1000Roofline model
SpMV
(b)
Figure 4: (a) Performance improvement and time-to-solution reduc-tion over a sequence of optimization steps (time is normalized byGMRES iterations per 1024 BG/Q nodes per billion DOF). Pt. A isbase performance before optimization. Pt. B: reduction of blockingMPI communication. Pt. C: minimizing integer operations in innerMatVec for-loops and reducing the number of cache misses. Pt. D:computation of derivatives by applying precomputed CSR-matrices atthe element level and SIMD vectorization. Pt. E: OpenMP threadingof major loops in MatVecs. Pt. F: MPI communication reduction,overlapping with computations, and OpenMP threading in intergridoperators. Pt. G: low-level optimization of finite element kernelsvia improving flop-byte ratio and consecutive memory access, andbetter pipelining of floating point operations. Pt. H: various low-level optimizations including enforcement of boundary conditions andinterpolation of hanging finite element nodes. (b) BG/Q node rooflinemodel (theoretical peak performance) and SpMV performance withmax flop-byte ratio of 0.25 [24].
multiplications. This is supported by the roofline modelanalysis [23], from which we obtain an optimal perfor-mance of approximately 8 GFlops for sparse MatVecs(Figure 4b). Note that implicit solvers for PDEs inher-ently exhibit a sparsity structure and hence performancewill always be memory-bound. This argument shows thatthe performance of our memory-bound solver is close towhat is optimally achievable. This is further supportedby numerical results in Section VI.
IV. EXPERIMENTAL SETUP AND VERIFICATION OFOPTIMALITY AND ROBUSTNESS OF THE SOLVER
In this section we describe the physical problemand solver parameters used to carry out the parallelperformance analysis in Section VI. We also presentresults of tests of optimal algorithmic scalability androbustness of the solver which together with parallelscalability demonstrate overall scalability of the solver.
The important physical parameter that determinesthe difficulty of the problem is the viscosity field. Inour subsequent performance analysis, we use real earthdata to generate a physically realistic representation ofviscosity (which is a function of T , w, and ε̇II ). Theviscosity varies over six orders of magnitude globally.However, what makes realistic mantle flow problemseven more highly ill-conditioned and nonlinear is the
extremely thin layer in which this contrast develops. Theviscosity drops by six orders of magnitude within a thinlayer between two plates (the plate boundary).
To assess solver robustness and algorithmic scala-bility, we generate plate boundaries down to a thick-ness of 5 km and a factor of 106 viscosity drop over7 km. For the weak and strong scalability measurements,the 106 factor viscosity drop occurs within just 3 km.Since tectonic plates (the largest surface structures) are2,000–14,000 km across, and earth’s circumference is40,075 km, this results in a very wide range of lengthscales of interest. To capture the viscosity variation,the mesh is refined to ∼75 m local resolution in ourlargest simulations, resulting in a mesh with 9 levels ofrefinement. For all performance results, we use a velocitydiscretization with polynomial order k = 2.
The cost of solving a nonlinear earth mantle flowproblem is dominated by the cost of the linear solvein each Newton step (2). The cost of a linear solve isdetermined by the number of MatVecs and HMG inter-grid operations. MatVecs are encountered in the Krylovmethod and in the HMG smoothers. In all subsequentlyreported performance results, we use three smoothingiterations for pre- and post-smoothing within the HMGV-cycle for both the viscous block Ã−1 and in the Schurcomplement S̃−1, which amounts to three V-cycles perapplication of the Stokes preconditioner. Therefore eachGMRES iteration has the same cost and it is sufficientto compare the number of GMRES iterations.
A. Robustness of Stokes solver
The robustness of the HMG preconditioner for theStokes system (1) is assessed by observing the numberof GMRES iterations (see Table I) required for con-vergence, while decreasing the thickness of the plateboundaries. This increases the range of length scalesin the problem and the resulting nonlinearity and ill-conditioning. Our mesh refinement algorithm, which isbased on the norm of the viscosity gradients as well asthe magnitude of ε̇II , locally refines the mesh to resolvethe extreme viscosity variations. This results in an overallincrease in the number of DOF. The third and fourthcolumns in Table I demonstrate the robustness of thesolvers for the (1,1) Stokes block and for the completeStokes solver. The GMRES iterations are seen to scaleindependently of the plate boundary thickness and thusviscosity gradient. Similar independence of viscosity isobserved for nonlinear iterations.
Table I: Robustness with respect to plate boundary thickness ofHMG-preconditioned GMRES solver for the (1,1) block of Stokesand the linear (full) Stokes solver. Number of GMRES iterations toreduce the residual by a factor of 10−6 is reported.
Plate boundary DOF GMRES iterations GMRES iterationsthickness [km] [×109] to solve Au = f to solve Stokes
15 1.16 115 46110 1.41 129 4885 3.01 123 445
B. Algorithmic scalability
Algorithmic scalability, i.e., the independence of thesolver iterations from the resolution of the mesh, iscritical for overall scalability of implicit solvers. Tostudy algorithmic scalability, we consider a nonlinearmantle flow problem with one subducting slab. The plateboundary region between the subducting plate and theoverriding plate has a thickness of 5 km. We refinethe mesh locally in the regions of highest viscosityvariations by tightening the refinement criteria. Thus thetotal number of DOF grows slowly, though significantlygreater resolution is obtained in these regions. The re-quired numbers of linear and nonlinear iterations areshown in Table II, where the cost of the nonlinear solveris measured by the total number of GMRES iterationsacross nonlinear iterations. As can be seen, the linearsolver requires a number of iterations that is largelyindependent of the resolution of the problem.
We have demonstrated how the combination of ourpreconditioner and linear and nonlinear solver yieldsan implicit method whose number of iterations scalesindependent of model fidelity. Here, fidelity is under-stood as the resolution of the mesh with finite ele-ment discretization and the size of the smallest-scalefeatures, which are the plate boundary regions. Thisresults in an algorithmically optimal method, despitethe severely nonlinear rheology, high viscosity gradients,effective anisotropy, and large heterogeneities. Moreover,the cost of the solver is reduced by adaptive meshrefinement, which reduces the number of DOF—in this
Table II: Optimal algorithmic scalability of inexact Newton-Krylovmethod for solving a nonlinear mantle flow problem with one subduct-ing slab and 5 km plate boundary. Simulation cost expressed in totalnumber of GMRES iterations is largely independent of the maximalresolution of the adaptive mesh (10−7 Newton residual reduction usedas stopping criterion). A two times higher resolution increases theDOF of the adaptively refined mesh only by about a factor of 2–3.In contrast, the factor would be eight with uniform refinement.
Max level of Finest resolution DOF Newton GMRESrefinement [m] [×106] iterations iterations
10 2443 0.96 14 140811 1222 2.67 18 116012 611 5.58 21 118513 305 11.82 21 136814 153 36.35 27 1527
case by four orders of magnitude, from the O(1013)needed for a uniform mesh of earth’s mantle, to justthe O(109) required here using aggressive refinement.Further reducing the number of DOF are the third-orderaccurate finite elements employed here, along with amass-conserving discretization. The algorithmic scalabil-ity and the greatly-reduced number of DOF exhibited byour solver are critical for the overall goal of reducingtime-to-solution (for a given accuracy). The remainingcomponent is parallel scalability, which we study next.
V. SYSTEMS AND MEASUREMENT METHODOLOGY
The target architecture in this work is the IBMBlueGene/Q1 (BG/Q) supercomputer [25]. Table III sum-marizes size and peak performance of several systemswe used. The smaller systems were used for testing,optimization, scaling, and full science runs. The largestruns have been performed on the Sequoia supercomputerat the Lawrence Livermore National Laboratory (LLNL).Sequoia consists of 96 IBM Blue Gene/Q racks, reachinga theoretical peak performance of 20.1 PFlops/s. Eachrack consists of 1024 compute nodes, each hosting an18 core A2 chip that runs at 1.6 GHz. Of these 18 cores,16 are devoted to computation, one for the lightweightO/S kernel, and one for redundancy. Every core supports4 H/W threads, thus, in total Sequoia has 1,572,864 coresand can support up to 6,291,456 H/W threads. The totalavailable system memory is 1.458 PBytes. BG/Q nodesare connected by a five-dimensional (5-D) bidirectionalnetwork, with a network bandwidth of 2 GBytes/s forsending and receiving data. Each BG/Q rack features
1IBM and Blue Gene/Q are trademarks of International BusinessMachines Corporation, registered in many jurisdictions worldwide.Other product and service names might be trademarks of IBM orother companies.
Table III: Blue Gene/Q supercomputers
Racks Cores H/W threads Peak [PFlops/s]
AMOS 5 81,920 327,680 1.0Vulcan 24 393,216 1,572,864 5.0JUQUEEN 28 458,752 1,835,008 5.8Sequoia 96 1,572,864 6,291,456 20.1
dedicated I/O nodes with 4 GBytes/s I/O bandwidth. Thesystem implements optimized collective communicationand allows specialized tuning of point-to-point commu-nication. We obtained all timing and performance mea-surements by means of the IBM HPC Toolkit for BG/Q.The toolkit retrieves performance information about theprocessor, memory hierarchy, and interconnect. Powermeasurements are available through hardware sensors foreach node board (accommodating 32 nodes) at a rate of2 samples per second [26]. All runs involved double-precision arithmetic and the code is compiled using theIBM XL C compilers for BG/Q, version 12.1.10.
VI. PERFORMANCE RESULTS
A. Weak and strong scalability
We present weak and strong scalability results on theVulcan and Sequoia BG/Q supercomputers from 1 rackwith 16,384 cores up to 96 racks with 1,572,864 cores.Scalability measurements corresponding to 1, 2, and 4racks were obtained on Vulcan, whereas the remainingruns on 8–96 racks were performed on Sequoia.
The cost of our large-scale nonlinear mantle convec-tion simulations is overwhelmingly dominated by thecost of the GMRES iterations during a linear Stokes solve(everything else, including setup and I/O, is negligible).These GMRES iterations include HMG V-cycles forthe (1,1) Stokes block and in the Schur complementapproximation, as explained earlier. For the extreme-scale runs on Sequoia, we had limited access to thesystem, which allowed just 10 representative GMRESiterations. However, we illustrate the influence of I/Oand setup costs by extrapolating the number of GMRESiterations to those expected for a full nonlinear solution.Note that we did observe that the setup time for HMGis largely bounded independent of the number of cores.
The main result is the weak scalability shown inFigure 5. The solver maintained 97% parallel efficiency(red curve) over a 96-fold increase in problem size, from16K to 1.5M cores of the full Sequoia system. Thelargest problem involved 602 billion DOF. If we take into
Number of Blue Gene/Q cores
16,384 32,768 65,536 131,072 262,144 524,288 1,048,576
Bil
lions
DO
F /
s p
er G
MR
ES
it.
10-1
100
101
102
1,572,864
0.98
0.99
1.03
1.03
1.030.98
0.97
0.98
0.99
1.03
1.03
1.030.98
0.96
Solver scalability
Full code scalability
Ideal scalability
Vulcan Sequoia
Figure 5: Weak scalability results on Vulcan and Sequoia from 1 to96 racks. Performance is normalized by time and number of GMRESiterations. Numbers along the graph lines indicate efficiency w.r.t.ideal speedup (efficiency baseline is the 1 rack result). We report boththe weak scalability for the linear solver only (red) and for projectedtotal runtime of a nonlinear solve (green). The largest problem sizeon 96 racks has 602 billion DOF.
account the setup time of the problem, including I/O ofinput data and solver setup time, we would still arriveat a weak scalability efficiency of 96% (green curve) fortotal runtime of a nonlinear simulation, demonstrating thenegligibility of I/O and setup time. The I/O for writingoutput data has to be performed only once at the end ofa nonlinear solve. The problem sizes used in the weakscalability runs would produce ∼8.5 GBytes of outputper BG/Q I/O node. With an I/O bandwidth of 4 GBytes/swe can also consider the writing of the output to benegligible for overall runtime (note that we did not outputsolution fields, since the full nonlinear simulation couldnot be run to completion due to limited access). Thenegligible time for I/O and problem setup stem from theadvantages of adaptive implicit solvers: adaptivity resultsin the problem itself being generated online as part ofthe solver; implicit means that fewer outputs/checkpointswould be required.
In Figure 6, we show strong scalability results fora mantle convection simulation with 8.3 billion DOF.Starting from one rack with 16,384 cores (granularityof 506K DOF/core), we achieve a 32-fold speedupon 96 racks with 1,572,864 cores (granularity of 5KDOF/core), indicating 33% solver efficiency in strongscalability, an impressive number considering the coarsegranularity of the largest problem.
Contrary to conventional wisdom, this shows thatalgorithmically optimal implicit finite element solversfor severely nonlinear, ill-conditioned, heterogeneous,indefinite PDEs can be designed to scale toO(106) cores.
B. Node performance analysis
The performance results on BG/Q compute nodes fur-ther support our scalability results. The top pie charts of
Number of Blue Gene/Q cores
16,384 32,768 65,536 131,072 262,144 524,288 1,048,576
Spee
dup
1
2
4
8
16
32
64 96
1,572,864
1.03
1.00
0.88
0.76
0.610.43 0.33
1.03
1.00
0.88
0.76
0.610.43 0.32
Solver speedup
Full code speedup
Ideal speedup
Vulcan Sequoia
Figure 6: Strong scalability results on Vulcan and Sequoia from 1to 96 racks. Numbers along the graph lines indicate efficiency withrespect to ideal speedup (efficiency baseline is the 1 rack result). Wereport both the strong scalability for the linear solver only (red) andfor projected total runtime of a nonlinear solve (green).
Figure 7 decompose the overall runtime into the largestcontributors. We can observe that the (highly optimized)matrix-free apply routines dominate with 80.6% in the1 rack case. Furthermore, their portion remains verystable with 78% on 96 racks. This result demonstratesa key component of a highly scalable, parallel multigridimplementation. The percent runtime for intergrid trans-fer operations is low compared to MatVecs and stayslow even at 1.5 million cores. Hence, we have achieveda balance between MatVecs and intergrid operations thatresults in nearly optimal scalability.
MatVecs represent the portion of the code where themaximal performance in terms of flops can be achieved.With their dominance in runtime we are able to increasetotal performance close to its maximum. That way ourimplementation is performing at the limits of the rooflinemodel as predicted in Figure 4b, and this is achieved evenat extreme scales of O(106) cores.
C. MPI communication analysis
Figure 8 summarizes MPI communication time mea-sured during weak and strong scalability runs: taskswith minimum, median, and maximum communicationtime are displayed. Indeed, for weak scalability, weclearly observe that percentage of time spent in MPIcommunication remains nearly constant relative to run-time (Figure 8a). This contributes to the nearly perfectscalability results presented in Figure 5. The increasein median and maximum communication time in the64 racks case can be justified by the lack of 5-Dtorus connectivity in that particular configuration (due tospecific job partitioning). Another reason can be foundin a more aggressive repartitioning of coarser multigridlevels, which leaves a greater amount of cores idle duringa short period of time in the V-cycle. This is suggested bythe higher percentage of MPI_Waitall time on 64 racks in
1 rack(7.5 TFlops)
32 racks(239 TFlops)
64 racks(445 TFlops)
96 racks(687 TFlops)
25.9%
14.1%37%
3.6%
8.6%
25.9%
14% 37.4%
3.7%
8.8%
25.6%
13.6%35.6%
3.5%
9.1%25.1%
13.7%35.7%
3.5%
9.8%
1 2 4 8 16 32 64 96024681012
Blue Gene/Q racks
GFl
ops/
spe
rno
de
A K B/B> Stokes Inter-grid Total
Figure 7: Analysis of MatVecs and intergrid operators within theStokes solves of the weak scalability runs on Vulcan and Sequoia.Pie charts show fraction of time in each routine, while the histogramsshow corresponding average GFlops/s per BG/Q node. The symbolsdenote MatVecs for viscous stress A, continuous, nodal Poissonoperator K, and divergence/gradient B/B> (see Section III). Theempty slices in the pie charts consist of all other routines withgenerally low GFlops/s per node (e.g., GMRES orthogonalization,null space projections).
Number of Blue Gene/Q racks
1 2 4 8 16 32 64 96MP
I call
s ti
me /
ru
n t
ime
0
0.2
0.4
0.6
0.8
1 MPI task with min. comm.MPI task with med. comm.
MPI task with max. comm.
Normalized run time
(a)Number of Blue Gene/Q racks
1 2 4 8 16 32 64 96MP
I call
s ti
me /
ru
n t
ime
0
0.2
0.4
0.6
0.8
1 MPI task with min. comm.MPI task with med. comm.
MPI task with max. comm.
Normalized run time
(b)
Figure 8: MPI communication time relative to total runtime for(a) weak scalability and (b) strong scalability on Vulcan and Sequoia.
Figure 9. However, this does not need to affect scalabilityin a negative way since fewer cores may perform thesame task quicker because of higher granularity of DOF.
For the strong scalability runs, we observe a gradualincrease of relative MPI communication time (Figure 8b),as is expected for implicit solvers. Note that the increasebegins only at 4 racks. Communication time exceeds 50%of overall runtime only at about 1 million cores. At itsmaximum, communication time is still below 30%.
D. Energy consumption analysis
Finally, we analyze the energy efficiency of the scal-ability runs. As expected, during weak scalability energyconsumption increases linearly with the amount of usedresources (Figure 10a). With an estimated cost of $0.06per kWh [27], the energy cost per GMRES iteration on
Min. 1 rack Min. 32 racks Min. 64 racks Min. 96 racks
Max. 1 rack Max. 32 racks Max. 64 racks Max. 96 racks
MPI_Isend + MPI_Irecv MPI_Waitall MPI_Allreduce
17%2.3%
82.4%67.5%
4.8%
27.6%
68.3% 5.0%
26.6%
75.6%
4.3%20%
78.2%
1.6%20.4%
67.3%
1.2%
31.4%
80.2%
0.5%19.3%
76.3%
3.6%20%
Figure 9: MPI routine communication time for the weak scalabilityruns on Vulcan and Sequoia supercomputers.
Number of Blue Gene/Q racks
1 2 4 8 16 32 64 96E
ner
gy
per
GM
RE
S i
ter.
[M
J]10
-3
10-2
10-1
100
101
102
Total
Chip + SRAM
DDR RAM
Network + Link Chip
(a)Number of Blue Gene/Q racks
1 2 4 8 16 32 64 96
Ener
gy p
er G
MR
ES
ite
r. [
MJ]
0
1
2
3
4
5
6Total
Chip + SRAM
DDR RAM
Network + Link Chip
(b)
Figure 10: Scalability of energy (excluding cooling) on Vulcan andSequoia: (a) weak scalability, (b) strong scalability.
96 racks is nearly $1.20 (excluding cooling). On the otherhand, the loss of strong scalability on the full size of themachine is reflected in the energy usage (Figure 10b).The 33% speedup efficiency on 96 racks (see Figure 6)is combined with a 32% energy efficiency; in otherwords, power consumption per node does not changesignificantly, and energy efficiency is mainly driven bytime-to-solution.
VII. IMPLICATIONS FOR MANTLE FLOW MODELING
Building on algorithmic innovations for implicitsolvers for complex PDEs described in this paper, weare able to address the depth and distribution of oceanictrenches—the most extreme topographic features onearth’s surface—for the first time. Trench depth re-flects both the downward pull from plate-driving forces[28] and the variable resistance associated with seis-mic coupling from great earthquakes [29]. We forward-predict the width (∼50 km) and depth (∼10 km) ofoceanic trenches on a global scale while predicting platemotions (Figure 11). The simultaneous prediction ofthese quantities—large-scale flow and fine-scale stressat plate boundaries—in a model with realistic, nonlinearrheology employing scalable, robust solvers opens new
directions for geophysical research. Solver robustness toplate boundary thickness is crucial, as can be seen inFigure 12, where we observe a great sensitivity of thesimulation outcome (in terms of plate velocities) to thethickness of plate boundary regions. The scalable solverpresented in this paper, in combination with adjoints,which are a byproduct of the Newton solver, will allowsystematic inference of uncertain parameters in globalmantle flow systems with tectonic plates. For regionalmantle models that are functionally equivalent to theglobal computation presented here, we have recentlyillustrated a systematic inference approach for the non-linear constitutive parameters n and τyield, and platecoupling factors w(x), for several subduction zones [30].Adjoint-based inversions will require thousands of for-ward model solutions, so that availability of a scalableimplicit solver such as that described here is paramount.
Bringing observations on topography (trench depth),plate motions, and others into a global inversion willallow the merging of two distinct geophysical approachesat different scales addressing different questions. First,what is the degree of coupling associated with greatearthquakes? In particular, we seek to determine whetherthat coupling is due to the frictional properties of theincoming plate or the magnitude of normal stress acrossthe fault driven by tectonic processes [31], [32]. Thesecond question concerns the forces driving and resistingglobal plate motion and the degree to which inter-plate coupling governs plate motions [4], [33], [34].These questions have eluded solution over the past threedecades because of their intimate coupling. By allowingus to bridge the local-to-global scales, with modern datasets, will arguably allow us to make an important leaptoward the simultaneous solution of two of the mostfundamental questions in earth sciences.
ACKNOWLEDGMENTS
We wish to acknowledge the contributions of W. ScottFutral (LLNL) and Roy Musselman (IBM), who wereinstrumental in helping us achieve the scaling resultson Sequoia. Their contributions came after SC’s Julydeadline for finalizing the author list had passed, but theyshould be regarded as co-authors.
We wish to offer our deepest thanks to LawrenceLivermore National Laboratory, Jülich SupercomputingCenter, RPI Center for Computational Innovation, andTexas Advanced Computing Center for granting us thecomputing resources required to prove the pioneer-ing nature of this work. This research was partially
Figure 11: Results of nonlinear mantle flow simulation at the earthsurface: View centered on 180◦W (left) and 90◦W (right) showingnorth–westward motion of the Pacific Plate (black arrows) and thetotal normal stress field (color coded). This stress is proportional tothe dynamic topography and for the first time we are able to forwardpredict narrow (∼50 km in width) ocean trenches (narrow lines withdark blue color) along plate boundaries in a global model with platemotions.
Figure 12: Comparison of earth plate velocities of a low-fidelitymodel (left) and a high-fidelity model (right) with thinner plateboundaries. Significant sensitivity of velocities of the Cocos Plate (incenter) are observed. This illustrates the importance of the solver’sability to handle a wide range of values for plate boundary thickness.
supported by NSF grants CMMI-1028889 and ARC-0941678 and DOE grants DE-FC02-13ER26128 and DE-FG02-09ER25914 as well as the EU FP7 EXA2GREENand NANOSTREAMS projects. We also thank CarstenBurstedde for his dedicated work on the p4est library.
REFERENCES
[1] D. Depaolo, T. Cerling, S. Hemming, A. Knoll, F. Richter,L. Royden, R. Rudnick, L. Stixrude, and J. Trefil, “Origin andEvolution of Earth: Research Questions for a Changing Planet,”National Academies Press, Committee on Grand ResearchQuestions in the Solid Earth Sciences, National ResearchCouncil of the National Academies, 2008.
[2] C. Burstedde, O. Ghattas, M. Gurnis, T. Isaac, G. Stadler,T. Warburton, and L. C. Wilcox, “Extreme-scale AMR,” inProceedings of SC10. ACM/IEEE, 2010.
[3] L. Alisic, M. Gurnis, G. Stadler, C. Burstedde, and O. Ghat-tas, “Multi-scale dynamics and rheology of mantle flow withplates,” Journal of Geophysical Research, vol. 117, p. B10402,2012.
[4] G. Stadler, M. Gurnis, C. Burstedde, L. C. Wilcox, L. Alisic,and O. Ghattas, “The dynamics of plate tectonics and mantleflow: From local to global scales,” Science, vol. 329, no. 5995,pp. 1033–1038, 2010.
[5] L. Alisic, M. Gurnis, G. Stadler, C. Burstedde, L. C. Wilcox,and O. Ghattas, “Slab stress and strain rate as constraints onglobal mantle flow,” Geophysical Research Letters, vol. 37, p.L22308, 2010.
[6] C. Burstedde, L. C. Wilcox, and O. Ghattas, “p4est: Scalablealgorithms for parallel adaptive mesh refinement on forests ofoctrees,” SIAM Journal on Scientific Computing, vol. 33, no. 3,pp. 1103–1133, 2011.
[7] J. Dongarra, J. Hittinger, J. Bell, L. Chacón, R. Falgout,M. Heroux, P. Hovland, E. Ng, C. Webster, and S. Wild, “Ap-plied mathematics research for exascale computing,” Report ofthe DOE/ASCR Exascale Mathematics Working Group, 2014.
[8] S. C. Eisenstat and H. F. Walker, “Choosing the forcing termsin an inexact Newton method,” SIAM Journal on ScientificComputing, vol. 17, pp. 16–32, 1996.
[9] B. Gmeiner, U. Rüde, H. Stengel, C. Waluga, andB. Wohlmuth, “Performance and scalability of HierarchicalHybrid Multigrid solvers for Stokes systems,” SIAM Journalon Scientific Computing, vol. 37, no. 2, pp. C143–C168, 2015.
[10] S. Reiter, A. Vogel, I. Heppner, M. Rupp, and G. Wittum, “Amassively parallel geometric multigrid solver on hierarchicallydistributed grids,” Computing and Visualization in Science,vol. 16, no. 4, pp. 151–164, 2013.
[11] A. H. Baker, R. D. Falgout, T. V. Kolev, and U. M. Yang,“Scaling hypre’s multigrid solvers to 100,000 cores,” in High-Performance Scientific Computing, M. W. Berry, K. A. Gal-livan, E. Gallopoulos, A. Grama, B. Philippe, Y. Saad, andF. Saied, Eds. Springer London, 2012, pp. 261–279.
[12] S. Badia, A. F. Martín, and J. Principe, “A highly scalableparallel implementation of balancing domain decomposition byconstraints,” SIAM Journal on Scientific Computing, vol. 36,no. 2, pp. C190–C218, 2014.
[13] O. Ippisch and M. Blatt, “Scalability test of µϕ and the ParallelAlgebraic Multigrid solver of DUNE-ISTL,” in Jülich BlueGene/P Extreme Scaling Workshop, no. FZJ-JSC-IB-2011-02.Jülich Supercomputing Centre, 2011.
[14] H. Sundar, G. Biros, C. Burstedde, J. Rudi, O. Ghattas, andG. Stadler, “Parallel geometric-algebraic multigrid on unstruc-tured forests of octrees,” in Proceedings of SC12. ACM/IEEE,2012.
[15] C. Burstedde, O. Ghattas, M. Gurnis, E. Tan, T. Tu, G. Stadler,L. C. Wilcox, and S. Zhong, “Scalable adaptive mantle convec-tion simulation on petascale supercomputers,” in Proceedingsof SC08. ACM/IEEE, 2008.
[16] M. Kronbichler, T. Heister, and W. Bangerth, “High accuracymantle convection simulation through modern numerical meth-ods,” Geophysical Journal International, vol. 191, no. 1, pp.12–29, 2012.
[17] G. Ranalli, Rheology of the Earth. Springer, 1995.
[18] T. Isaac, C. Burstedde, L. C. Wilcox, and O. Ghattas,“Recursive algorithms for distributed forests of octrees,”SIAM Journal on Scientific Computing (to appear), 2015,http://arxiv.org/abs/1406.0089.
[19] M. O. Deville, P. F. Fischer, and E. H. Mund, High-OrderMethods for Incompressible Fluid Flow, ser. Cambridge Mono-
graphs on Applied and Computational Mathematics. Cam-bridge, UK: Cambridge University Press, 2002, vol. 9.
[20] M. Benzi, G. H. Golub, and J. Liesen, “Numerical solutionof saddle point problems,” Acta Numerica, vol. 14, pp. 1–137,2005.
[21] H. Elman, V. Howle, J. Shadid, R. Shuttleworth, and R. Tumi-naro, “Block preconditioners based on approximate commuta-tors,” SIAM Journal on Scientific Computing, vol. 27, no. 5,pp. 1651–1668, 2006.
[22] D. A. May and L. Moresi, “Preconditioned iterative methodsfor Stokes flow problems arising in computational geodynam-ics,” Physics of the Earth and Planetary Interiors, vol. 171,pp. 33–47, 2008.
[23] D. Rossinelli, B. Hejazialhosseini, P. Hadjidoukas, C. Bekas,A. Curioni, A. Bertsch, S. Futral, S. J. Schmidt, N. A. Adams,and P. Koumoutsakos, “11 pflop/s simulations of cloud cavita-tion collapse,” in Proceedings of SC13. ACM/IEEE, 2013.
[24] V. Karakasis, T. Gkountouvas, K. Kourtis, G. I. Goumas, andN. Koziris, “An extended compression format for the opti-mization of sparse matrix-vector multiplication,” IEEE Trans.Parallel Distrib. Syst., vol. 24, no. 10, pp. 1930–1940, 2013.
[25] J. Milano and P. Lembke, “IBM system Blue Gene solution:Blue Gene/Q hardware overview and installation planning,”IBM, Tech. Rep. SG24-7872-01, May 2013.
[26] K. Yoshii, K. Iskra, R. Gupta, P. Beckman, V. Vishwanath,C. Yu, and S. Coghlan, “Evaluating power-monitoring capabil-ities on IBM Blue Gene/P and Blue Gene/Q,” in Proc. of theIEEE Int. Conf. Cluster Computing, Beijing, 2012, pp. 36–44.
[27] U.S. Energy Information Administration (EIA), “Electricpower monthly, with data from january 2015,” U.S. Departmentof Energy, Tech. Rep., March 2015. [Online]. Available:http://www.eia.gov/electricity/monthly/pdf/epm.pdf
[28] S. Zhong and M. Gurnis, “Controls on trench topography fromdynamic models of subducted slabs,” J. Geophys. Res., vol. 99,pp. 15 683–15 695, 1994.
[29] T. R. A. Song and M. Simons, “Large trench-parallel gravityvariations predict seismogenic behavior in subduction zones,”Science, vol. 301, pp. 630–633, 2003.
[30] V. Ratnaswamy, G. Stadler, and M. Gurnis, “Adjoint-basedestimation of plate coupling in a non-linear mantle flow model:theory and examples,” Geophysical Journal International, vol.202, no. 2, pp. 768–786, 2015.
[31] L. Ruff and H. Kanamori, “Seismic coupling and uncoupling atsubduction zones,” Tectonophysics, vol. 99, no. 2, pp. 99–117,1983.
[32] C. H. Scholz and J. Campos, “The seismic coupling of subduc-tion zones revisited,” Journal of Geophysical Research: SolidEarth, vol. 117, no. B5, 2012.
[33] D. Forsyth and S. Uyeda, “On the relative importance of thedriving forces of plate motion,” Geophysical Journal Interna-tional, vol. 43, no. 1, pp. 163–200, 1975.
[34] B. Hager and R. O’Connell, “A simple global model of platedynamics and mantle convection,” J. Geophys. Res., vol. 86,pp. 4843–4867, 1981.