LLNL-PRES-687782 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
Burning on the GPU: Fast and Accurate Chemical Kinetics
GPU Technology Conference
Russell Whitesides April 7, 2016
Session 6195
Funded by: U.S. Department of Energy
Vehicle Technologies Program Program Manager: Gurpreet Singh & Leo Breton
Lawrence Livermore National Laboratory LLNL-PRES-687782 2
To make it go faster?
+
Why?
Lawrence Livermore National Laboratory LLNL-PRES-687782 3
Why?
We burn a lot of gasoline.
• Transportation efficiency • Chemistry is vital to predictive simulations • Chemistry can be > 90% of simulation time.
Lawrence Livermore National Laboratory LLNL-PRES-687782 4
National lab compute power and industry need.
Supercomputing @ DOE labs: Strong investment in GPUs with eye towards exascale
OEM engine designers:
Require fast turnaround with desktop class hardware
Why?
Lawrence Livermore National Laboratory LLNL-PRES-687782 5
“Colorful Fluid Dynamics”
YO2 Temperature
“Typical” engine simulation w/ detailed chemistry
Lawrence Livermore National Laboratory LLNL-PRES-687782 6
Detailed Chemistry in Reacting Flow CFD:
Each cells is treated as an isolated system for chemistry.
Operator Splitting Technique: Solve independent set of ordinary differential equations (ODEs) in each cell to calculate chemical source terms for species and energy advection/diffusion equations.
t t+∆t
Lawrence Livermore National Laboratory LLNL-PRES-687782 7
CPU (un-coupled) chemistry integration
Each cells is treated as an isolated system for chemistry.
t t+∆t
Lawrence Livermore National Laboratory LLNL-PRES-687782 8
GPU (batched) chemistry integration
On the GPU we solve chemistry in batches of cells simultaneously.
t t+∆t
Lawrence Livermore National Laboratory LLNL-PRES-687782 9
See also Whitesides & McNenly, GTC 2015; McNenly & Whitesides, GTC 2014
Previously at GTC:
Lawrence Livermore National Laboratory LLNL-PRES-687782 10
n_gpu = 0;
Note: most CFD simulations are done on distributed memory systems
rank0
rank1
rank2
rank3
rank4
rank6
rank7
rank5
CPU
CPU
CPU
CPU
CPU
CPU
CPU CPU
Lawrence Livermore National Laboratory LLNL-PRES-687782 11
++n_gpu; //now what?
Note: most CFD simulations are done on distributed memory systems
rank0
rank1
rank2
rank3
rank4
rank6
rank7
rank5
CPU
CPU
CPU
CPU
CPU
CPU
CPU CPU
Lawrence Livermore National Laboratory LLNL-PRES-687782 12
Here CPU is a single core.
Ideal CPU-GPU Work-sharing
SGPU =walltime(CPU)walltime(GPU)
Lawrence Livermore National Laboratory LLNL-PRES-687782 13
Let’s make use of the whole machine.
Ideal CPU-GPU Work-sharing
§ # CPU cores = NCPU
§ # GPU devices = NGPU
Stotal =NCPU + NGPU SGPU −1( )( )
NCPU 1
2
3
4
5
6
7
8
1 2 3 4
S tot
al
NGPU
SGPU = 8 NCPU=4
NCPU=8
NCPU=16
NCPU=32 **
*TITAN(1.4375)*surface(1.8750)
SGPU =walltime(CPU)walltime(GPU)
Lawrence Livermore National Laboratory LLNL-PRES-687782 14
Distribute based on number of cells and give more to GPU.
Good performance in simple case with both CPU and GPU doing work
100
1000
10000
1 2 4 8 16
Chem
istryTime(secon
ds)
NumberofProcessors
CPU Chemistry
GPU Chemistry (std work sharing)
Lawrence Livermore National Laboratory LLNL-PRES-687782 15
Distribute based on number of cells and give more to GPU.
Good performance in simple case with both CPU and GPU doing work
100
1000
10000
1 2 4 8 16
Chem
istryTime(secon
ds)
NumberofProcessors
CPU Chemistry
GPU Chemistry (std work sharing)
GPU Chemistry (custom work sharing)
SGPU = 7 Stotal = 1.7 (SGPU = 6.6)
Lawrence Livermore National Laboratory LLNL-PRES-687782 16
Let’s go!
First attempt @
engine calculation on GPU+CPU
Lawrence Livermore National Laboratory LLNL-PRES-687782 17
What happened?
First attempt @
engine calculation on GPU+CPU
§ 2x Xeon E5-2670 (16 cores) => § 2x Xeon E5-2670 + 2 Tesla K40m => § Stotal = 21.2/17.6 = 1.20
21.2 hours 17.6 hours
(SGPU = 2.6)
Lawrence Livermore National Laboratory LLNL-PRES-687782 18
Integrator performance when doing batch solution
If the systems are not similar how much extra work needs to be done?
vs.
Lawrence Livermore National Laboratory LLNL-PRES-687782 19
Batches of dissimilar reactors will suffer from excessive extra steps
What penalty do we pay when batching?
Lawrence Livermore National Laboratory LLNL-PRES-687782 20
Batches of dissimilar reactors will suffer from excessive extra steps
What penalty do we pay when batching?
Lawrence Livermore National Laboratory LLNL-PRES-687782 21
Batches of dissimilar reactors will suffer from excessive extra steps
Possibly a lot of extra steps.
Lawrence Livermore National Laboratory LLNL-PRES-687782 22
Sort reactors by how many steps they took to solve on the last CFD step
Easy as pie?
n_steps >100
1
batch3 batch2 batch1 batch0
Lawrence Livermore National Laboratory LLNL-PRES-687782 23
Have to manage the sorting and load-balancing in distributed memory system
Not so fast.
rank0
rank7
rank5
rank6
rank4
rank1
rank2
rank3
Lawrence Livermore National Laboratory LLNL-PRES-687782 24
Load balance based on expected cost and expected performance.
MPI communication to re-balance for chemistry.
rank0
rank7
rank5
rank6
rank4
rank1
rank2
rank3
Lawrence Livermore National Laboratory LLNL-PRES-687782 25
Let’s go again!
Second attempt @
engine calculation on GPU+CPU
Lawrence Livermore National Laboratory LLNL-PRES-687782 26
How much does difference does it make?
Total steps significantly reduced by batching appropriately
Lawrence Livermore National Laboratory LLNL-PRES-687782 27
J
Engine results with improved work-sharing and reactor sorting
9.1 hrs
7.6 hrs
13.0 hrs
~40 % reduction in chemistry time; ~36% reduction in overall time
Stotal=1.7SGPU=6.6
Lawrence Livermore National Laboratory LLNL-PRES-687782 28
§ Improve SGPU • Derivative kernels • Matrix operations
§ Extrapolative integration methods • Less “startup” cost when re-initializing • Potentially well suited for GPU
§ Non-chemistry calc’s on GPU • Multi-species transport • Particle spray
Future directions
Possibilities for significant further improvements.
Lawrence Livermore National Laboratory LLNL-PRES-687782 29
§ Much improved CFD chemistry work-sharing with GPU
§ ~40% reduction in chemistry time for real engine case (~36% total time)
§ Working on further improvement
Summary
Thank you!
+