Multi-GPU High-Order Unstructured Solver for Compressible … · 2010-09-07 · Local operations...

Single-GPU Results

Multi-GPU High-Order Unstructured Solver for Compressible Navier-Stokes EquationsP. Castonguay, D. Williams, P. Vincent, and A. Jameson

Aeronautics and Astronautics Department, Stanford University

IntroductionAlgorithm: Unstructured high-order compressible flow solver for Navier-Stokes equations

Why a higher order flow solver?

For applications with low error tolerance, high order methods are more cost effective

Why is higher order suitable for the GPU?

More operations per degree of freedomLocal operations within cells allow use of shared memory

Objectives: Develop a fast, high-order multi-GPU solver to simulate viscous compressible flows over complex geometries such as micro air vehicles, birds, and Formula 1 cars

Multi-GPU Results

0

25

50

75

100

125

150

175

3 4 5 6 7 8

Perf

orm

ance

(G

flo

ps)

Order of Accuracy

0

2

4

6

8

10

12

14

16

2 4 8 12 16

Spee

du

p r

elat

ive

to 1

GP

U

Number of GPUs

20000 cells

40000 cells

80000 cells

160000 cells

75%

80%

85%

90%

95%

100%

105%

0 2 4 6 8 10 12 14 16

Effi

cien

cy

Number of GPUs

N=3 N=4 N=5

N=6 N=7

49.82 Gflops

97.13 Gflops88.94 Gflops

79.71 Gflops85.18 Gflops

98.54 Gflops

0

10

20

30

40

50

60

70

80

1 2 3 4 5 6

Spee

du

p

Order of Accuracy3 4 5 6 7 8

Figure 3: Overall double precision performance on Tesla C2050 relative to 1 core of Xeon X5670 vs. order of accuracy for quads

Figure 4: (Top) Profile of GPU algorithm vs. order of accuracy for quads (Bottom) Performance per kernel vs. order of accuracy for quads

Figure 5: Spanwise vorticity contours for unsteady flow over SD7003 airfoil at Re = 10000, M=0.2

Summary:

1. Developed a 2D unstructured compressible viscous flow solver on multiple GPUs

2. Achieved 70x speedup (double precision) on single GPU relative to 1 core of Xeon X5670 CPU

3. Obtained good scalability on up to 16 GPUs and peak performance of 1.2 Teraflops

Future work will involve extension to 3D and application to more computationally intensive cases

Conclusions and Future Work

Figure 6: Mesh partition obtained using ParMetis

Figure 7: Speedup for the MPI-CUDA implementationrelative to single GPU code for 6th order calculation

Figure 8: Weak scalability of the MPI-CUDA implementation.Domain size varies from 10000 to 160000 cells

Approach:

Use mixed MPI-CUDA implementation to make use of multiple GPUs working in parallel

Divide mesh between processors (figure on the right)

Overlap CUDA computation with the CUDA memory copy and MPI communication (see pseudo-code below)

// Some code here

Pack_mpi_buf<<<grid_size,block_size>>>;

Cuda_Memcpy(DeviceToHost);

MPI_Isend(mpi_out_buf);

MPI_Irecv(mpi_in_buf);

// Compute discontinuous residual (cell local)

MPI_Wait();

Cuda_Memcpy(HostToDevice);

// Compute continuous residual (using neighboring cell information)

Pseudo-Code:

1.2 Teraflops Double Precision!

Scalability:

- Multi-GPU code run on up to 16 GPUs

- 2 Tesla C2050s per compute node

- C2050 specs:

Memory Bandwidth: 144 GB/s

Single precision peak: 1.03 Tflops

Double precision peak: 515 Gflops

- 2 Xeon X5670 CPUs per compute node

- PCIe x16 slots

- InfiniBand Interconnect

Matrix Multiplication ExampleStrategy: Make use of shared and texture memory. Allow multiple cells per block

Challenges: Avoid bank-conflicts and non-coalesced memory accessesExtrapolation matrix M is too big to fit in shared memory for 3D grids

for (i=0;i<n_qpts;i++){

m = i*n_ed_fgpts+ifp; m1 = n_fields*n_qpts*ic_loc+i;

q0 += fetch_double(t_interp_mat_0,m)*s_q_qpts[m1]; m1 += n_qpts;q1 += fetch_double(t_interp_mat_0,m)*s_q_qpts[m1]; m1 += n_qpts;q2 += fetch_double(t_interp_mat_0,m)*s_q_qpts[m1]; m1 += n_qpts;q3 += fetch_double(t_interp_mat_0,m)*s_q_qpts[m1];

}

Figure 2: Matrix multiplication example. Extrapolate state from solution points to flux points

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 2 3 4 5 6

Frac

tio

n o

f to

tal t

ime

Order of Accuracy

Something Long 1 Something Long 1 Something Long 1 Something Long 1

Something Long 1 Something Long 1 Something Long 1 Something Long 1

3 4 5 6 7 8

Solution Method

Flux Reconstruction Approach:

1. Use solution points to define polynomial representation of solution inside the cell

2. Use flux points to communicate information between cells

Flux Exchange: Find a common flux at points on the boundary between cells

Flux Reconstruction: Propagate new flux information back into each cell interior using higher order correction polynomial (see figure above)

3. Update the solution using new information

Types of Operations:

Matrix multiplication: Ex: Extrapolate state between flux and solution points

Algebraic: Ex: Find common flux in Flux Exchange procedure

flux points

sol’n points

Figure 1: Reference Element (left). Correction function (right)

Date post:	16-Jul-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Multi-GPU High-Order Unstructured Solver for Compressible … · 2010-09-07 · Local operations...

Documents