Performance Optimization of a Massively Parallel Phase-Field Method
Using the HPC Framework waLBerla
SIAM PP 2016, April 13th 2016
Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer, Harald Köstler and Ulrich Rüde
Chair for System SimulationFriedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
• Motivation
• waLBerla Framework
• Phase-Field Method in waLBerla• Single-Core Optimizations
• Asynchronous Communication
• I/O and Post-processing
• In-Situ Processing with Python
• Summary and Outlook
2
Outline
Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016
• large domain required to reduce boundary influence
• some physical patterns only occur in highly resolved simulations ( spiral )
• simulate big domains in 3D
• unoptimized, general purpose code phase field code from KIT available
• goal: write optimized parallel version for specific model
3
Motivation
Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016
The waLBerla Framework
6
waLBerla Framework
• widely applicable Lattice-Boltzmann from Erlangen • HPC software framework, originally developed for CFD simulations with
Lattice Boltzmann Method (LBM) • evolved into general framework for algorithms on block-structured grids
• www.walberla.net
Vocal Fold Study(Florian Schornbaum)
Fluid Structure Interaction (Simon Bogner)
Free Surface Flow
Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016
• Written in C++ with Python extensions
• Hybridly parallelized (MPI + OpenMP)
• No data structures growing with number of processes involved
• Scales from laptop to recent petascale machines
• Parallel I/O
• Portable (Compiler/OS)
• Automated tests / CI servers
• Open Source release planned
7
waLBerla Framework
llvm/clang
Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016
• Domain Decomposition & Distribution to Processes:• regular decomposition into blocks containing uniform grids
• grid refinement: octree-like decomposition
8
Block-structured Grids
In most cases, if a regular decomposition of a uniform
grid is used, exactly one block is assigned to each process.
forest of octrees:each block contains a uniform grid
of the same size→ 2:1 balance between neighboring
cells on level transitions
Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016
• Distributed Memory Parallelization: MPI• data exchange on borders between blocks via ghost layers
• support for overlapping communication and computation
• some advanced models require more complex communication patterns ( e.g. free-surface and fluid-structure interaction)
10
Hybrid Parallelization
receiverprocess
senderprocess
(slightly more complicated for non-uniform domain decompositions, but the same general ideas still apply)
Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016
Phase field in waLBerla
12
Phase field algorithm
• two lattices (fields):• phase field 𝜙 with 4 components
• chemical potential 𝜇 with 2 components
• storing two time steps in “src” and “dst” fields
Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016
Optimizations of Phase Field algorithm
Optimization
• moving window technique
• simplifications due to special setup ( e.g. analytical temperaturegradient )
• shortcuts: some terms can be neglected in certain cell types
Model Layer
• access patterns / stencils
• overlapping computation and communication
• eliminate common subexpressions
Algorithm Layer
• SIMDification
• Memory layout (AoS vs. SoA)
Hardware Layer
Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016
Single Core Optimization Results
• test system: one SuperMUC core (Intel Xeon E5-2680 8C)
• for both kernels: systematic performance engineering leads to 80x faster code
0 2 4 6 8 10 12
with shortcuts(cellwise branching)
with staggered buffer
with T(z) optimization
with SIMD intrinsicssingle cell
basic waLBerlaimplementation
general purposeC code
MLUP/s ( for 𝜙-Kernel only)
interface solid liquid
speedup: 6
speedup: 28
speedup: 59
speedup: 67
speedup: 101
Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016
Optimization
Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016
Algorithm Layer Optimization
19
Communication Overlap
communication of 𝜇 can be overlapped without kernel adaptations
Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016
Algorithm Layer Optimization
20
Communication Overlap
Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016
Algorithm Layer Optimization
Optimization
Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016
Algorithm Layer Optimization
Optimization
Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016
Algorithm Layer Optimization
Scaling Results
SuperMUC JUQUEEN
I/O and Postprocessing
25
Managing I/O
• Solution: generate surface mesh from voxel data during simulation, locally on each process using a marching cubes algorithm
• one mesh for each phase boundary
• mesh size: < 10 MB
• I/O necessary to store results (frequently) and for checkpointing(seldom)
• for highly parallel simulations the output of results quickly becomes bottleneck
• Example: storing one time step of(2420 x 2420 x 1474) domain as voxel file: 386 GB
Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016
26
Managing I/O
• surface meshes still unnecessarily fine resolved: one triangle per interface cell
Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016
27
Managing I/O
local fine meshes generated by marching cubes
on coarse mesh on root
• quadric edge reduce algorithm ( cglib )
• crucial: mesh reduction step preserves boundary vertices
• hierarchical mesh coarsening and reduction during simulation
• result: one coarse mesh with size in the order of several MB
Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016
Python Coupling
waLBerlapython_coupling
module
boost::python
libpython
• extracting relevant data while simulationis running
• direct, efficient array access via Python numpy package – data is shared, not copied
• using boost::python library to connect C++ code with Python
• further applications:• flexible configuration
• model development: Matlab-like functionality available
Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016
Simplify Workflow
Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016
Python Coupling
waLBerlapython_coupling
module
boost::python
libpython
Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016
Python interpreter
walberla.so
Method 1: Using Python from C++Host Language: C++
Method 2: Using C++ from PythonHost Language: Python
Python Coupling
waLBerlapython_coupling
module
boost::python
libpython
Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016
Python interpreter
walberla.so
Method 1: Using Python from C++Host Language: C++
Method 2: Using C++ from PythonHost Language: Python
Demo
Summary
Summary / Outlook
• efficient phase field algorithm necessary to simulate certain physical effects ( spiral )
• systematic performance engineering several levels
• speedup by factor of 80 compared to original version
• parallel output data processing during simulation to reduce result file size
• coupling to Python scripting language for in-situ processing
• GPU implementation
• coupling to Lattice Boltzmann Method
• improve discretization scheme (implicit method)
Summary
Outlook
Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016
Thank you!
Questions?