Download - U.S. Department of Energy Vehicle Technologies Program … · 2016. 4. 10. · Session 6195 Funded by: U.S. Department of Energy ... batch2 batch1 batch0 . Lawrence Livermore National

LLNL-PRES-687782 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Burning on the GPU: Fast and Accurate Chemical Kinetics

GPU Technology Conference

Russell Whitesides April 7, 2016

Session 6195

Funded by: U.S. Department of Energy

Vehicle Technologies Program Program Manager: Gurpreet Singh & Leo Breton

Lawrence Livermore National Laboratory LLNL-PRES-687782 2

To make it go faster?

+

Why?


Why?

We burn a lot of gasoline.

•  Transportation efficiency • Chemistry is vital to predictive simulations • Chemistry can be > 90% of simulation time.


National lab compute power and industry need.

Supercomputing @ DOE labs: Strong investment in GPUs with eye towards exascale

OEM engine designers:

Require fast turnaround with desktop class hardware

Why?


“Colorful Fluid Dynamics”

YO2 Temperature

“Typical” engine simulation w/ detailed chemistry


Detailed Chemistry in Reacting Flow CFD:

Each cells is treated as an isolated system for chemistry.

Operator Splitting Technique: Solve independent set of ordinary differential equations (ODEs) in each cell to calculate chemical source terms for species and energy advection/diffusion equations.

t t+∆t


CPU (un-coupled) chemistry integration

Each cells is treated as an isolated system for chemistry.

t t+∆t


GPU (batched) chemistry integration

On the GPU we solve chemistry in batches of cells simultaneously.

t t+∆t


See also Whitesides & McNenly, GTC 2015; McNenly & Whitesides, GTC 2014

Previously at GTC:


n_gpu = 0;

Note: most CFD simulations are done on distributed memory systems

rank0

rank1

rank2

rank3

rank4

rank6

rank7

rank5

CPU

CPU

CPU

CPU

CPU

CPU

CPU CPU


++n_gpu; //now what?

Note: most CFD simulations are done on distributed memory systems

rank0

rank1

rank2

rank3

rank4

rank6

rank7

rank5

CPU

CPU

CPU

CPU

CPU

CPU

CPU CPU


Here CPU is a single core.

Ideal CPU-GPU Work-sharing

SGPU =walltime(CPU)walltime(GPU)


Let’s make use of the whole machine.

Ideal CPU-GPU Work-sharing

§  # CPU cores = NCPU

§  # GPU devices = NGPU

Stotal =NCPU + NGPU SGPU −1( )( )

NCPU 1

2

3

4

5

6

7

8

1 2 3 4

S tot

al

NGPU

SGPU = 8 NCPU=4

NCPU=8

NCPU=16

NCPU=32 **

*TITAN(1.4375)*surface(1.8750)

SGPU =walltime(CPU)walltime(GPU)


Distribute based on number of cells and give more to GPU.

Good performance in simple case with both CPU and GPU doing work

100

1000

10000

1 2 4 8 16

Chem

istryTime(secon

ds)

NumberofProcessors

CPU Chemistry

GPU Chemistry (std work sharing)


Distribute based on number of cells and give more to GPU.

Good performance in simple case with both CPU and GPU doing work

100

1000

10000

1 2 4 8 16

Chem

istryTime(secon

ds)

NumberofProcessors

CPU Chemistry

GPU Chemistry (std work sharing)

GPU Chemistry (custom work sharing)

SGPU = 7 Stotal = 1.7 (SGPU = 6.6)


Let’s go!

First attempt @

engine calculation on GPU+CPU


What happened?

First attempt @


§  2x Xeon E5-2670 (16 cores) => §  2x Xeon E5-2670 + 2 Tesla K40m => §  Stotal = 21.2/17.6 = 1.20

21.2 hours 17.6 hours

(SGPU = 2.6)


Integrator performance when doing batch solution

If the systems are not similar how much extra work needs to be done?

vs.


Batches of dissimilar reactors will suffer from excessive extra steps

What penalty do we pay when batching?



What penalty do we pay when batching?



Possibly a lot of extra steps.


Sort reactors by how many steps they took to solve on the last CFD step

Easy as pie?

n_steps >100

1

batch3 batch2 batch1 batch0


Have to manage the sorting and load-balancing in distributed memory system

Not so fast.

rank0

rank7

rank5

rank6

rank4

rank1

rank2

rank3


Load balance based on expected cost and expected performance.

MPI communication to re-balance for chemistry.

rank0

rank7

rank5

rank6

rank4

rank1

rank2

rank3


Let’s go again!

Second attempt @



How much does difference does it make?

Total steps significantly reduced by batching appropriately


J

Engine results with improved work-sharing and reactor sorting

9.1 hrs

7.6 hrs

13.0 hrs

~40 % reduction in chemistry time; ~36% reduction in overall time

Stotal=1.7SGPU=6.6


§  Improve SGPU •  Derivative kernels •  Matrix operations

§  Extrapolative integration methods •  Less “startup” cost when re-initializing •  Potentially well suited for GPU

§  Non-chemistry calc’s on GPU •  Multi-species transport •  Particle spray

Future directions

Possibilities for significant further improvements.


§ Much improved CFD chemistry work-sharing with GPU

§ ~40% reduction in chemistry time for real engine case (~36% total time)

§ Working on further improvement

Summary

Thank you!

+