Download - Massively Parallel Phase Field Simulations using HPC ...

Massively Parallel Phase Field Simulations using HPC Framework waLBerla

SIAM CSE 2015, March 15th 2015

Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer, Harald Köstler and Ulrich Rüde

Chair for System SimulationFriedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany

• Motivation

• waLBerla Framework

• Phase Field Method• Overview

• Optimizations

• Performance Modelling

• Managing I/O

• Summary and Outlook

2

Outline

Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015

• large domain required to reduce boundary influence

• some physical patterns only occur in highly resolved simulations ( spiral )

• simulate big domains in 3D

• unoptimized, general purpose code phase field code from KIT available

• goal: write optimized parallel version for specific model

3

Motivation


The waLBerla Framework

5

waLBerla Framework

• widely applicable Lattice-Boltzmann from Erlangen • HPC software framework, originally developed for CFD simulations with

Lattice Boltzmann Method (LBM) • evolved into general framework for algorithms on structured grids• coupling with in-house rigid body physics engine pe


Vocal Fold Study(Florian Schornbaum)

Fluid Structure Interaction (Simon Bogner)

Free Surface Flow

7

Block Structured Grids

• structured grid• domain is decomposed into blocks• blocks are the container data structure for simulation data (lattice) • blocks are the basic unit of load balancing


• Distributed Memory Parallelization: MPI• data exchange on borders between blocks via ghost layers

• support for overlapping communication and computation

• some advanced models ( f.e. FreeSurface) require more complex communication patterns

8

Hybrid Parallelization

receiverprocess

senderprocess

(slightly more complicated for non-uniform domain decompositions, but the same general ideas still apply)

A Python Extension for the massively parallel framework waLBerla - PyHPC 14Martin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – November 17, 2014

Phase field in waLBerla

10

Phase field algorithm


• two lattices (fields):• phase field 𝜙 with 4 entries per cell

• chemical potential 𝜇 with 2 entries per cell

• storing two time steps in “src” and “dst” fields

• spatial discretization: finite differences

• temporal discretization: explicit Euler method

11



• two lattices (fields):• phase field 𝜙 with 4 components

• chemical potential 𝜇 with 2 components

• storing two time steps in “src” and “dst” fields

12



FLOP per cell 940

Loads / Stores 34

13



FLOPs per cell 2214

Load/Stores: 168

14

Roofline Performance Model


FLOPs 3154

Loads / Stores 202

Loads from RAM 101

FLOP / double 31.2

performance data per cell:

RAM bandwidth/core 6.4 GB/s

FLOP/s per core @2.7GHz 21.6 GFLOP/s

Balance (FLOP/double) 25

Sandy Bridge Architecture:

compute bound

from cache

Optimizations of Phase Field algorithm

16

Optimization Roadmap


• single core optimizations• based on results of performance model

• save floating point operations, pre-compute and store values where possible

• presented on example of 𝝁-Sweep here

• scaling• performance behavior of parallelization

• challenges related to Input/Output

• performance data presented for SuperMUC

18

Implementation in waLBerla


• starting point: general, prototyping code

• new model specific implementation in waLBerla

• performance guided design • no indirect or virtual calls

• optimized traversal over grid

19

Implementation in waLBerla


Step 1: Replace / Remove expensive operations

• pre-compute common subexpressions

• fast inverse square root approximation• replace division and sqrt operation with bit level operations and add/muls

• reduce number of divisions using table lookup where possible

20

Gibbs Energy subterm pre-computation


• many quantities depend on local temperature only

• in this scenario temperature is a function of one coordinate: T = 𝑇(𝑧)

• these quantities can be computed once for each 𝑥, 𝑦 -slice

z

21

Gibbs Energy subterm pre-computation


22

SIMD


• single instruction multiple data ( SIMD )

• architecture specific instructions• Intel: SSE, AVX, AVX2

• Blue Gene: QPX

• modern compiler do auto-vectorization

• still beneficial to write SIMD instructions explicitly via intrinsics

• problem: separate code for each architecture

• lightweight SIMD abstraction layer in waLBerla to write portable code

𝑎3 𝑎2 𝑎1 𝑎0

𝑏3 𝑏2 𝑏1 𝑏0

ymm0

ymm1

vaddpd+

𝑐3 𝑐2 𝑐1 𝑐0 ymm0

=

23

SIMD


24Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015

• to calculate divergence, values at staggered grid positions are required

• these values can be buffered

• more loads and stores, less floating point operations

• same technique can also be applied in 𝜙 sweep

Buffering of staggered values

pre-computed values

25

Buffering of staggered values


80 x faster compared to original version

26

Intranode Scaling


intranode weak scaling on SuperMUC

28

Single Node Optimization Summary


𝜙-Sweep 21 %

μ-Sweep 27 %

Complete Program 25%

Single Node Optimizations

• replace/remove expensive operations like square roots and divisions

• pre-compute and buffer values where possible

• SIMD intrinsics

Percent Peak on SuperMUC

Why not 100% Peak?

• unbalanced number of multiplications and addition

• divisions counted as 1 FLOP but they cost 43 times as much as a multiplication or addition

29

Scaling


• scaling on SuperMUC up to 32,768 cores

• ghost layer based communication

• communication hiding

31

Managing I/O


• I/O necessary to store results (frequently) and for checkpointing (seldom)

• for highly parallel simulations the output of results quickly becomes bottleneck

• Example: storing one time step of (940 x 940 x 2080) domain: 87 GB

• Solution: generate surface mesh from voxel data during simulation, locally on each process using a marching cubes algorithm

• one mesh for each phase boundary

32

Managing I/O


• surface meshes still unnecessarily fine resolved: one triangle per interface cell

33

Managing I/O


local fine meshes generated by marching cubes

on coarse mesh on root

• quadric edge reduce algorithm ( cglib )

• crucial: mesh reduction step preserves boundary vertices

• hierarchical mesh coarsening and reduction during simulation

• result: one coarse mesh with size in the order of several MB

Summary

Summary / Outlook

• efficient phase field algorithm necessary to simulate certain physical effects ( spiral )

• systematic performance engineering several levels

• speedup by factor of 80 compared to original version

• reached around 25% peak performance on SuperMUC

• parallel output data processing during simulation to reduce result file size

• GPU implementation

• coupling to Lattice Boltzmann Method

• improve discretization scheme (implicit method)

Summary

Outlook

A Python Extension for the massively parallel framework waLBerla - PyHPC 14Martin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – November 17, 2014

Thank you!

Questions?