IPCC @ RWTH Aachen UniversityOptimization of multibody and long-range solvers in
LAMMPS
Rodrigo Canales William McDoniel Markus HohnerbachAhmed E. Ismail Paolo Bientinesi
IPCC Showcase – November 2016
Team
RWTH
Prof. Paolo Bientinesi Rodrigo Canales William McDoniel Markus Hohnerbach Prof. Ahmed Ismail
Intel
Georg Zitzlsberger Klaus-Dieter Oertel Michael W. Brown
2
Introduction
2015
I May: Kickoff – IPCC @ RWTH AachenOptimizing LAMMPS kernels
I Oct.: First results on Xeon & KNC, @ EMEA IPCC
2016
I Feb.: Showcase 1st year
I March: First results on KNL, @ IPCC & IXPUG Forum
I May: KNL Access
I Nov.: Showcase
2017
I May: End 2nd year
3
Agenda
I Intro to MD, LAMMPS
I Achievements 1st year
I Goals & Progress 2nd yearI AIREBOI REBOI PPPM ElectrostaticsI PPPM Dispersion
I Future Projects
4
LAMMPS
Large-scale Atomic–Molecular Massively Parallel Simulator
I Sandia National Labshttp://lammps.sandia.gov
I Widely used open source MDcode
I Support for OpenMP, Xeon Phi,and GPU (CUDA and OpenCL)
5
Molecular Dynamics
I Many particle systems
I Computes interactions betweenpairs of atoms
ΦLJ = 4ε[( σ
rij
)12−( σrij
)6]
6
First Year
I Pair Potentials
I KNL Ready
7
Buckingham: KNC vs. KNL - Full Node
KNC KNL0
5,000
10,000
15,000
20,000
tau
/day
Default Mode HBM (via numactl)
8
Tersoff: KNC
ThreadCores 1SMT 1Atoms 32.000
CoreCores 1SMT 4Atoms 32.000
FullCores 60SMT 4Atoms 512.000
Measurements in 1000 atom-ns/day/core, SMT minimizes runtime.
Thread Core Full0
20
40
60
80
atom
-ns/day/core
Ref Double Single Mixed
9
Tersoff: KNL
ThreadCores 1SMT 1Atoms 32.000HBM Yes
CoreCores 1SMT 4Atoms 32.000HBM Yes
FullCores 64SMT 4Atoms 512.000HBM Yes
Measurements in 1000 atom-ns/day/core, SMT minimizes runtime.
Thread Core Full0
20
40
60
80
atom
-ns/day/core
Ref Double Single Mixed
9
The Vectorization of the Tersoff Many-Body Potential:An Exercise in Performance Portability
I Initial work: workshop on MD simulation software @ SC’15I Full portability across existing Intel archsI Focus on vector operation wrapper
I Submitted to SC’16 technical programI Additional architecturesI KNL results
I KNL measurements via Mike (Thanks!)
I For submission: NDA waiver
I Accepted
I Best Student Paper Finalist
I (Maybe) part of replication initiative SC’17
10
Second Year (After Q2)
I Multi-body Potentials
I Long Range Interactions
11
Multi-body Potential: REBO
I Similar to Tersoff
I Applicable to Carbohydrates
I Improves Tersoff through additional terms
I Additional neighbor finding routines needed by REBO (Ready)
I Vectorized/Optimized code for KNC/KNL (Ready)
I Optimized code for CPU, same approach as Tersoff (Ready)
I Vectorized/Optimized code for CPU (In Progress)
I Offloading Performance (In Progress)
I Speedup KNL: ca. 2.5x total, and ca. 3x on kernel
I Bottleneck: Neighbor Lists
12
REBO Results – KNL
Ref Opt0
5
10
15
20
25
Tim
e(s
)
Pair Neigh
13
AIREBO
I Based on REBO
I Two additional terms: Torsion and Lennard-Jones
I Torsion: Easy to vectorize (Ready)
I Lennard-Jones: Hard to vectorize (In Progress)
I Search through neighbor list and branch
I Idea: Separate expensive and cheap cases
14
Long Range Interactions: PPPM
I Cutoff distances make pair potential calculations feasible
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
1 1.5 2 2.5 3 3.5
V / ε
r / σ
Re
pu
lsiv
e
Attractive
Re
pu
lsiv
e
Attractive
Cu
toff r
cI Long-range calculations can still be important:
I ElectrostaticsI Interfaces
I Particle-Particle Particle-Mesh (PPPM) approximateslong-range forces without requiring pair-wise calculations
15
PPPM
Four Steps:
1. Determine the charge distribution ρ by mapping particlecharges to a grid
2. Take the Fourier transform of the charge distribution to findthe potential:
∇2Φ = − ρε0
3. Obtain forces due to all interactions by inverse Fouriertransform:
~F = −∇Φ
4. Map forces back to the particles
16
PPPM: Charge Mapping
17
PPPM: Charge Mapping
17
PPPM: Charge Mapping
17
PPPM: Charge Mapping
17
PPPM: Charge Mapping
17
PPPM: Charge Mapping
17
PPPM: Charge Mapping
17
PPPM: Charge Mapping
17
PPPM: Charge Mapping
17
PPPM: Charge Mapping
17
PPPM: Charge Mapping
17
PPPM: Charge Mapping
17
PPPM: Charge Mapping
17
PPPM: Charge Mapping
17
PPPM: Charge Mapping
17
PPPM: Charge Mapping
17
PPPM: Charge Mapping
17
PPPM: Charge Mapping
17
PPPM: Charge Mapping
17
PPPM: Charge Mapping
17
PPPM: Charge Mapping
17
PPPM: Charge Mapping
17
PPPM: Charge Mapping
17
PPPM: Charge Mapping
17
PPPM: Charge Mapping
17
PPPM: Charge Mapping
17
PPPM: Charge Mapping
17
PPPM: Charge Mapping
17
PPPM: Charge Mapping
17
PPPM: Charge Mapping
17
PPPM: Charge Mapping
17
PPPM: Charge Mapping
17
PPPM: Charge Mapping
18
PPPM: Charge Mapping
18
PPPM: Charge Mapping
18
PPPM: Charge Mapping
18
PPPM: Charge Mapping
18
PPPM: Charge Mapping
18
PPPM: Charge Mapping
18
PPPM: Charge Mapping
18
PPPM: Charge Mapping
18
PPPM: Charge Mapping
18
PPPM: Charge Mapping
18
PPPM: Charge Mapping
18
PPPM: Charge Mapping
18
PPPM: Charge Mapping
18
PPPM: Charge Mapping
18
PPPM: Charge Mapping
18
PPPM: Charge Mapping
18
PPPM: Charge Mapping
18
PPPM: Charge Mapping
18
PPPM: Charge Mapping
18
PPPM: Charge Mapping
I Threading across atoms instead of across grid points
I Look-up table for stencil coefficients
I Vectorized inner stencil loop
I Larger stencil takes advantage of KNL vector length
19
PPPM: Charge Mapping on KNL 1c/1t
40.5k 81k 162k 324k0
0.5
1
1.5·10−3
Problem size (atoms)
tim
ep
erat
om(s
)
baseline new threading vectorized
20
PPPM: Distributing Forces
I Look-up table for stencil coefficients
I Vectorized inner stencil loop
I Larger stencil takes advantage of KNL vector length
I Repack force data to get multiple components simultaneously
21
PPPM: Distributing Forces on KNL 1c/1t
40.5k 81k 162k 324k0
0.5
1
1.5·10−3
Problem size (atoms)
tim
ep
erat
om(s
)
baseline atom loop vectorized
inner stencil vectorized repacking
22
PPPM: FFTs
I Larger stencil allows coarser grid while preserving accuracy
I Reduce communication by doing:2D → remap → 1D → remapinstead of:1D → remap → 1D → remap → 1D → remap
I Vectorization elsewhere makes ad differentiation relativelymore appealing than ik differentiation – half as many FFTs inexchange for more work in stencil loops
23
Water Benchmark – KNL 1c/1t
4 5 6 7 80
20
40
60
Cutoff (angstroms)
Tim
e(s
)
Reference
PPPM non-FFT PPPM FFT Pair Other
4 5 6 7 80
20
40
60
3.5X
3.4X3.1X3.0X
2.4X
Cutoff (angstroms)
Tim
e(s
)
Optimized
PPPM non-FFT PPPM FFT Pair Other
24
Water Benchmark – KNL 64c/1t
4 5 6 7 8 90
20
40
60
80
100
Cutoff (angstroms)
Tim
e(s
)
Reference
PPPM non-FFT PPPM FFT Pair Other
4 5 6 7 8 90
20
40
60
80
100
2.8X2.9X
2.5X2.3X
1.9X
1.5X
Cutoff (angstroms)
Tim
e(s
)
Optimized
PPPM non-FFT PPPM FFT Pair Other
25
PPPM Dispersion
Similar particle mapping concept, but with two potentials:
I Electrostatics ∼ 1r
I Dispersion interactions ∼ 1r6
I Optimize compatible pair potentials (Ready)I Buckingham (buck/long/coul/long)I Lennard Jones (lj/long/coul/long)
I Optimize PPPM-dispersion solver (In Progress)
26
Buckingham – Dispersion
I SiO2 model, 19200 atoms - coulomb and buck potentials
I KNL Cache mode
I Reference : USER OMP
1 Thr 2 Thr 64 × 2 Thr
0
2
4
Speedup on KNL
ref double single
mixed
27
PPPM Dispersion: Components
Having multiple types of forces requires different mixing rules:
I Equivalent routines operate on different stencilsI 2 versions of particle mappingI 4 versions of charge densityI 12 versions of force distribution & poisson solver
I We use templates, optimizing only onceI Minimize control structures
28
PPPM Dispersion: Results
I Optimized charge & particle mappingdouble precision + single precision FFTs
I Between 1.4X and 1.6X speedup on K-space
I Potential speedups for poisson & force
Base Opt0
20
40
60
80
Tim
e(m
s) poissonforce dist.part. mappingcharge density
29
Code Availability
Code Github1 LAMMPS
Tersoff X X
Buckingham X XBuckingham Coul Long X XBuckingham Long Coul Long X . . .
Lennard-Jones Long Coul Long X . . .
PPPM . . . . . .PPPM Dispersion X . . .
REBO . . . . . .AIREBO . . . . . .
1Our group’s repositories are at github.com/HPAC.
30
Dissemination and Community Involvement
I SIAM CSE 2017, Atlanta: MD Exascale Mini-SymposiumI Bientinesi (Aachen), McDoniel (Aachen), Tchipev (Munchen)
I ISC’17 Paper
I SC’16 Technical Program Talk
I IPCC Meeting Toulouse Talk
I Paper for IXPUG Workshop @ ISC’16:“Dynamic SIMD Lane Scheduling”
I Krzikalla (Dresden), Wende (Berlin), Hohnerbach (Aachen)
I ISC’16 Booth and IPCC Meeting TalksI Parallel’16 Talk
I Krzikalla (Dresden), Hohnerbach (Aachen)
I IPCC Meeting Ostrava Talk
I SC’15 Workshop Talk
I IPCC Meeting Munchen Code Dungeon
31
Other activities & future work
32
Other research activities
I Tensors operationsI Tensor transposition, summations, contractions
I Applications from Chemistry and Machine Learning
I Collaboration with IPCC UT Austin
I BLASI Idea: CPU + stream to Phi
I MKL – limited functionality
I Application in Density Functional Theory
I Initial results: 1610 vs 1350 GFLOPS/s (MKL)
33
LAMMPS
I Continue collaboration with Mike BrownI Additional Long-Ranged Solvers
I Multi-Level Summation (MSM): O(n) algorithmI 2/3rd of routines similar to PPPM (particle to grid and back)I 1/3rd: Stencil application (Research topic)I MSM Dispersion solver developed by our group
I Gaussian split Ewald: Mesh-based real/frequency spaceI Might provide better accuracy than MSMI First implementation into LAMMPS
I Extend KOKKOS to enable vector classes (avoid GPU bias)
DSMC
I Particle-based method for rarefied gas
I Similar to molecular dynamics and LAMMPS
34
Buckingham: vectorization, single thread
buck/cut buck/coul/cut buck/cut/coul/long
0
2
4
6
8
10
12
Speedup on KNL (1 Thread)
Double Single Mixed
36
Buckingham: vectorization, full node
buck/cut buck/coul/cut buck/cut/coul/long
0
2
4
6
8
10
12
Speedup on KNL (MPI + multithread)
Double Single Mixed
37
Part 2: Tersoff multibody potential Markus Hohnerbach
for i in local atoms of the current thread dofor j in atoms neighboring i do
ζij ← 0;for k in atoms neighboring i do
ζij ← ζij + ζ(i , j , k);
E ← E + V (i , j , ζij);Fi ← Fi − ∂xiV (i , j , ζij);Fj ← Fj − ∂xjV (i , j , ζij);δζ ← ∂ζV (i , j , ζij);for k in atoms neighboring i do
Fi ← Fi − δζ · ∂xi ζ(i , j , k);Fj ← Fj − δζ · ∂xj ζ(i , j , k);Fk ← Fk − δζ · ∂xk ζ(i , j , k);
38