Massively Parallel Phase Field Simulations using HPC Framework waLBerla
SIAM CSE 2015, March 15th 2015
Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer, Harald Köstler and Ulrich Rüde
Chair for System SimulationFriedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
• Motivation
• waLBerla Framework
• Phase Field Method• Overview
• Optimizations
• Performance Modelling
• Managing I/O
• Summary and Outlook
2
Outline
Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015
• large domain required to reduce boundary influence
• some physical patterns only occur in highly resolved simulations ( spiral )
• simulate big domains in 3D
• unoptimized, general purpose code phase field code from KIT available
• goal: write optimized parallel version for specific model
3
Motivation
Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015
The waLBerla Framework
5
waLBerla Framework
• widely applicable Lattice-Boltzmann from Erlangen • HPC software framework, originally developed for CFD simulations with
Lattice Boltzmann Method (LBM) • evolved into general framework for algorithms on structured grids• coupling with in-house rigid body physics engine pe
Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015
Vocal Fold Study(Florian Schornbaum)
Fluid Structure Interaction (Simon Bogner)
Free Surface Flow
7
Block Structured Grids
• structured grid• domain is decomposed into blocks• blocks are the container data structure for simulation data (lattice) • blocks are the basic unit of load balancing
Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015
• Distributed Memory Parallelization: MPI• data exchange on borders between blocks via ghost layers
• support for overlapping communication and computation
• some advanced models ( f.e. FreeSurface) require more complex communication patterns
8
Hybrid Parallelization
receiverprocess
senderprocess
(slightly more complicated for non-uniform domain decompositions, but the same general ideas still apply)
A Python Extension for the massively parallel framework waLBerla - PyHPC 14Martin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – November 17, 2014
Phase field in waLBerla
10
Phase field algorithm
Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015
• two lattices (fields):• phase field 𝜙 with 4 entries per cell
• chemical potential 𝜇 with 2 entries per cell
• storing two time steps in “src” and “dst” fields
• spatial discretization: finite differences
• temporal discretization: explicit Euler method
11
Phase field algorithm
Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015
• two lattices (fields):• phase field 𝜙 with 4 components
• chemical potential 𝜇 with 2 components
• storing two time steps in “src” and “dst” fields
12
Phase field algorithm
Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015
FLOP per cell 940
Loads / Stores 34
13
Phase field algorithm
Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015
FLOPs per cell 2214
Load/Stores: 168
14
Roofline Performance Model
Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015
FLOPs 3154
Loads / Stores 202
Loads from RAM 101
FLOP / double 31.2
performance data per cell:
RAM bandwidth/core 6.4 GB/s
FLOP/s per core @2.7GHz 21.6 GFLOP/s
Balance (FLOP/double) 25
Sandy Bridge Architecture:
compute bound
from cache
Optimizations of Phase Field algorithm
16
Optimization Roadmap
Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015
• single core optimizations• based on results of performance model
• save floating point operations, pre-compute and store values where possible
• presented on example of 𝝁-Sweep here
• scaling• performance behavior of parallelization
• challenges related to Input/Output
• performance data presented for SuperMUC
18
Implementation in waLBerla
Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015
• starting point: general, prototyping code
• new model specific implementation in waLBerla
• performance guided design • no indirect or virtual calls
• optimized traversal over grid
19
Implementation in waLBerla
Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015
Step 1: Replace / Remove expensive operations
• pre-compute common subexpressions
• fast inverse square root approximation• replace division and sqrt operation with bit level operations and add/muls
• reduce number of divisions using table lookup where possible
20
Gibbs Energy subterm pre-computation
Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015
• many quantities depend on local temperature only
• in this scenario temperature is a function of one coordinate: T = 𝑇(𝑧)
• these quantities can be computed once for each 𝑥, 𝑦 -slice
z
21
Gibbs Energy subterm pre-computation
Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015
22
SIMD
Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015
• single instruction multiple data ( SIMD )
• architecture specific instructions• Intel: SSE, AVX, AVX2
• Blue Gene: QPX
• modern compiler do auto-vectorization
• still beneficial to write SIMD instructions explicitly via intrinsics
• problem: separate code for each architecture
• lightweight SIMD abstraction layer in waLBerla to write portable code
𝑎3 𝑎2 𝑎1 𝑎0
𝑏3 𝑏2 𝑏1 𝑏0
ymm0
ymm1
vaddpd+
𝑐3 𝑐2 𝑐1 𝑐0 ymm0
=
23
SIMD
Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015
24Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015
• to calculate divergence, values at staggered grid positions are required
• these values can be buffered
• more loads and stores, less floating point operations
• same technique can also be applied in 𝜙 sweep
Buffering of staggered values
pre-computed values
25
Buffering of staggered values
Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015
80 x faster compared to original version
26
Intranode Scaling
Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015
intranode weak scaling on SuperMUC
28
Single Node Optimization Summary
Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015
𝜙-Sweep 21 %
μ-Sweep 27 %
Complete Program 25%
Single Node Optimizations
• replace/remove expensive operations like square roots and divisions
• pre-compute and buffer values where possible
• SIMD intrinsics
Percent Peak on SuperMUC
Why not 100% Peak?
• unbalanced number of multiplications and addition
• divisions counted as 1 FLOP but they cost 43 times as much as a multiplication or addition
29
Scaling
Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015
• scaling on SuperMUC up to 32,768 cores
• ghost layer based communication
• communication hiding
31
Managing I/O
Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015
• I/O necessary to store results (frequently) and for checkpointing (seldom)
• for highly parallel simulations the output of results quickly becomes bottleneck
• Example: storing one time step of (940 x 940 x 2080) domain: 87 GB
• Solution: generate surface mesh from voxel data during simulation, locally on each process using a marching cubes algorithm
• one mesh for each phase boundary
32
Managing I/O
Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015
• surface meshes still unnecessarily fine resolved: one triangle per interface cell
33
Managing I/O
Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015
local fine meshes generated by marching cubes
on coarse mesh on root
• quadric edge reduce algorithm ( cglib )
• crucial: mesh reduction step preserves boundary vertices
• hierarchical mesh coarsening and reduction during simulation
• result: one coarse mesh with size in the order of several MB
Summary
Summary / Outlook
• efficient phase field algorithm necessary to simulate certain physical effects ( spiral )
• systematic performance engineering several levels
• speedup by factor of 80 compared to original version
• reached around 25% peak performance on SuperMUC
• parallel output data processing during simulation to reduce result file size
• GPU implementation
• coupling to Lattice Boltzmann Method
• improve discretization scheme (implicit method)
Summary
Outlook
A Python Extension for the massively parallel framework waLBerla - PyHPC 14Martin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – November 17, 2014
Thank you!
Questions?