IPCC @ RWTH Aachen Universityhpac.rwth-aachen.de/ipcc/showcase_nov16.pdfI IPCC Meeting Toulouse Talk...

$Page 1: IPCC @ RWTH Aachen Universityhpac.rwth-aachen.de/ipcc/showcase_nov16.pdfI IPCC Meeting Toulouse Talk I Paper for IXPUG Workshop @ ISC’16: \Dynamic SIMD Lane Scheduling" I Krzikalla$
IPCC @ RWTH Aachen UniversityOptimization of multibody and long-range solvers in

LAMMPS

Rodrigo Canales William McDoniel Markus HohnerbachAhmed E. Ismail Paolo Bientinesi

IPCC Showcase – November 2016

Team

RWTH

Prof. Paolo Bientinesi Rodrigo Canales William McDoniel Markus Hohnerbach Prof. Ahmed Ismail

Intel

Georg Zitzlsberger Klaus-Dieter Oertel Michael W. Brown

2

Introduction

2015

I May: Kickoff – IPCC @ RWTH AachenOptimizing LAMMPS kernels

I Oct.: First results on Xeon & KNC, @ EMEA IPCC

2016

I Feb.: Showcase 1st year

I March: First results on KNL, @ IPCC & IXPUG Forum

I May: KNL Access

I Nov.: Showcase

2017

I May: End 2nd year

3

Agenda

I Intro to MD, LAMMPS

I Achievements 1st year

I Goals & Progress 2nd yearI AIREBOI REBOI PPPM ElectrostaticsI PPPM Dispersion

I Future Projects

4

LAMMPS

Large-scale Atomic–Molecular Massively Parallel Simulator

I Sandia National Labshttp://lammps.sandia.gov

I Widely used open source MDcode

I Support for OpenMP, Xeon Phi,and GPU (CUDA and OpenCL)

5

http://lammps.sandia.gov

Molecular Dynamics

I Many particle systems

I Computes interactions betweenpairs of atoms

ΦLJ = 4ε[( σ

rij

)12−( σrij

)6]

6

First Year

I Pair Potentials

I KNL Ready

7

Buckingham: KNC vs. KNL - Full Node

KNC KNL0

5,000

10,000

15,000

20,000

tau

/day

Default Mode HBM (via numactl)

8

Tersoff: KNC

ThreadCores 1SMT 1Atoms 32.000

CoreCores 1SMT 4Atoms 32.000

FullCores 60SMT 4Atoms 512.000

Measurements in 1000 atom-ns/day/core, SMT minimizes runtime.

Thread Core Full0

20

40

60

80

atom

-ns/day/core

Ref Double Single Mixed

9

Tersoff: KNL

ThreadCores 1SMT 1Atoms 32.000HBM Yes

CoreCores 1SMT 4Atoms 32.000HBM Yes

FullCores 64SMT 4Atoms 512.000HBM Yes

Measurements in 1000 atom-ns/day/core, SMT minimizes runtime.

Thread Core Full0

20

40

60

80

atom

-ns/day/core

Ref Double Single Mixed

9

The Vectorization of the Tersoff Many-Body Potential:An Exercise in Performance Portability

I Initial work: workshop on MD simulation software @ SC’15I Full portability across existing Intel archsI Focus on vector operation wrapper

I Submitted to SC’16 technical programI Additional architecturesI KNL results

I KNL measurements via Mike (Thanks!)

I For submission: NDA waiver

I Accepted

I Best Student Paper Finalist

I (Maybe) part of replication initiative SC’17

10

Second Year (After Q2)

I Multi-body Potentials

I Long Range Interactions

11

Multi-body Potential: REBO

I Similar to Tersoff

I Applicable to Carbohydrates

I Improves Tersoff through additional terms

I Additional neighbor finding routines needed by REBO (Ready)

I Vectorized/Optimized code for KNC/KNL (Ready)

I Optimized code for CPU, same approach as Tersoff (Ready)

I Vectorized/Optimized code for CPU (In Progress)

I Offloading Performance (In Progress)

I Speedup KNL: ca. 2.5x total, and ca. 3x on kernel

I Bottleneck: Neighbor Lists

12

REBO Results – KNL

Ref Opt0

5

10

15

20

25

Tim

e(s

)

Pair Neigh

13

AIREBO

I Based on REBO

I Two additional terms: Torsion and Lennard-Jones

I Torsion: Easy to vectorize (Ready)

I Lennard-Jones: Hard to vectorize (In Progress)

I Search through neighbor list and branch

I Idea: Separate expensive and cheap cases

14

Long Range Interactions: PPPM

I Cutoff distances make pair potential calculations feasible

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3

1 1.5 2 2.5 3 3.5

V / ε

r / σ

Re

pu

lsiv

e

Attractive

Re

pu

lsiv

e

Attractive

Cu

toff r

cI Long-range calculations can still be important:

I ElectrostaticsI Interfaces

I Particle-Particle Particle-Mesh (PPPM) approximateslong-range forces without requiring pair-wise calculations

15

PPPM

Four Steps:

1. Determine the charge distribution ρ by mapping particlecharges to a grid

2. Take the Fourier transform of the charge distribution to findthe potential:

∇2Φ = − ρε0

3. Obtain forces due to all interactions by inverse Fouriertransform:

~F = −∇Φ

4. Map forces back to the particles

16

PPPM: Charge Mapping

17


17


17


17


17


17


17


17


17


17


17


17


17


17


17


17


17


17


17


17


17


17


17


17


17


17


17


17


17


17


17


17


18


18


18


18


18


18


18


18


18


18


18


18


18


18


18


18


18


18


18


18


I Threading across atoms instead of across grid points

I Look-up table for stencil coefficients

I Vectorized inner stencil loop

I Larger stencil takes advantage of KNL vector length

19

PPPM: Charge Mapping on KNL 1c/1t

40.5k 81k 162k 324k0

0.5

1

1.5·10−3

Problem size (atoms)

tim

ep

erat

om(s

)

baseline new threading vectorized

20

PPPM: Distributing Forces

I Look-up table for stencil coefficients

I Vectorized inner stencil loop

I Larger stencil takes advantage of KNL vector length

I Repack force data to get multiple components simultaneously

21

PPPM: Distributing Forces on KNL 1c/1t

40.5k 81k 162k 324k0

0.5

1

1.5·10−3

Problem size (atoms)

tim

ep

erat

om(s

)

baseline atom loop vectorized

inner stencil vectorized repacking

22

PPPM: FFTs

I Larger stencil allows coarser grid while preserving accuracy

I Reduce communication by doing:2D → remap → 1D → remapinstead of:1D → remap → 1D → remap → 1D → remap

I Vectorization elsewhere makes ad differentiation relativelymore appealing than ik differentiation – half as many FFTs inexchange for more work in stencil loops

23

Water Benchmark – KNL 1c/1t

4 5 6 7 80

20

40

60

Cutoff (angstroms)

Tim

e(s

)

Reference

PPPM non-FFT PPPM FFT Pair Other

4 5 6 7 80

20

40

60

3.5X

3.4X3.1X3.0X

2.4X

Cutoff (angstroms)

Tim

e(s

)

Optimized


24

Water Benchmark – KNL 64c/1t

4 5 6 7 8 90

20

40

60

80

100

Cutoff (angstroms)

Tim

e(s

)

Reference


4 5 6 7 8 90

20

40

60

80

100

2.8X2.9X

2.5X2.3X

1.9X

1.5X

Cutoff (angstroms)

Tim

e(s

)

Optimized


25

PPPM Dispersion

Similar particle mapping concept, but with two potentials:

I Electrostatics ∼ 1r

I Dispersion interactions ∼ 1r6

I Optimize compatible pair potentials (Ready)I Buckingham (buck/long/coul/long)I Lennard Jones (lj/long/coul/long)

I Optimize PPPM-dispersion solver (In Progress)

26

Buckingham – Dispersion

I SiO2 model, 19200 atoms - coulomb and buck potentials

I KNL Cache mode

I Reference : USER OMP

1 Thr 2 Thr 64 × 2 Thr

0

2

4

Speedup on KNL

ref double single

mixed

27

PPPM Dispersion: Components

Having multiple types of forces requires different mixing rules:

I Equivalent routines operate on different stencilsI 2 versions of particle mappingI 4 versions of charge densityI 12 versions of force distribution & poisson solver

I We use templates, optimizing only onceI Minimize control structures

28

PPPM Dispersion: Results

I Optimized charge & particle mappingdouble precision + single precision FFTs

I Between 1.4X and 1.6X speedup on K-space

I Potential speedups for poisson & force

Base Opt0

20

40

60

80

Tim

e(m

s) poissonforce dist.part. mappingcharge density

29

Code Availability

Code Github1 LAMMPS

Tersoff X X

Buckingham X XBuckingham Coul Long X XBuckingham Long Coul Long X . . .

Lennard-Jones Long Coul Long X . . .

PPPM . . . . . .PPPM Dispersion X . . .

REBO . . . . . .AIREBO . . . . . .

1Our group’s repositories are at github.com/HPAC.

30

Dissemination and Community Involvement

I SIAM CSE 2017, Atlanta: MD Exascale Mini-SymposiumI Bientinesi (Aachen), McDoniel (Aachen), Tchipev (Munchen)

I ISC’17 Paper

I SC’16 Technical Program Talk

I IPCC Meeting Toulouse Talk

I Paper for IXPUG Workshop @ ISC’16:“Dynamic SIMD Lane Scheduling”

I Krzikalla (Dresden), Wende (Berlin), Hohnerbach (Aachen)

I ISC’16 Booth and IPCC Meeting TalksI Parallel’16 Talk

I Krzikalla (Dresden), Hohnerbach (Aachen)

I IPCC Meeting Ostrava Talk

I SC’15 Workshop Talk

I IPCC Meeting Munchen Code Dungeon

31

Other activities & future work

32

Other research activities

I Tensors operationsI Tensor transposition, summations, contractions

I Applications from Chemistry and Machine Learning

I Collaboration with IPCC UT Austin

I BLASI Idea: CPU + stream to Phi

I MKL – limited functionality

I Application in Density Functional Theory

I Initial results: 1610 vs 1350 GFLOPS/s (MKL)

33

LAMMPS

I Continue collaboration with Mike BrownI Additional Long-Ranged Solvers

I Multi-Level Summation (MSM): O(n) algorithmI 2/3rd of routines similar to PPPM (particle to grid and back)I 1/3rd: Stencil application (Research topic)I MSM Dispersion solver developed by our group

I Gaussian split Ewald: Mesh-based real/frequency spaceI Might provide better accuracy than MSMI First implementation into LAMMPS

I Extend KOKKOS to enable vector classes (avoid GPU bias)

DSMC

I Particle-based method for rarefied gas

I Similar to molecular dynamics and LAMMPS

34


Buckingham: vectorization, single thread

buck/cut buck/coul/cut buck/cut/coul/long

0

2

4

6

8

10

12

Speedup on KNL (1 Thread)

Double Single Mixed

36

Buckingham: vectorization, full node

buck/cut buck/coul/cut buck/cut/coul/long

0

2

4

6

8

10

12

Speedup on KNL (MPI + multithread)

Double Single Mixed

37

Part 2: Tersoff multibody potential Markus Hohnerbach

for i in local atoms of the current thread dofor j in atoms neighboring i do

ζij ← 0;for k in atoms neighboring i do

ζij ← ζij + ζ(i , j , k);

E ← E + V (i , j , ζij);Fi ← Fi − ∂xiV (i , j , ζij);Fj ← Fj − ∂xjV (i , j , ζij);δζ ← ∂ζV (i , j , ζij);for k in atoms neighboring i do

Fi ← Fi − δζ · ∂xi ζ(i , j , k);Fj ← Fj − δζ · ∂xj ζ(i , j , k);Fk ← Fk − δζ · ∂xk ζ(i , j , k);

38

Date post:	12-Oct-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

IPCC @ RWTH Aachen Universityhpac.rwth-aachen.de/ipcc/showcase_nov16.pdfI IPCC Meeting Toulouse Talk...

Documents