Single-GPU Results
Multi-GPU High-Order Unstructured Solver for Compressible Navier-Stokes EquationsP. Castonguay, D. Williams, P. Vincent, and A. Jameson
Aeronautics and Astronautics Department, Stanford University
IntroductionAlgorithm: Unstructured high-order compressible flow solver for Navier-Stokes equations
Why a higher order flow solver?
For applications with low error tolerance, high order methods are more cost effective
Why is higher order suitable for the GPU?
More operations per degree of freedomLocal operations within cells allow use of shared memory
Objectives: Develop a fast, high-order multi-GPU solver to simulate viscous compressible flows over complex geometries such as micro air vehicles, birds, and Formula 1 cars
Multi-GPU Results
0
25
50
75
100
125
150
175
3 4 5 6 7 8
Perf
orm
ance
(G
flo
ps)
Order of Accuracy
0
2
4
6
8
10
12
14
16
2 4 8 12 16
Spee
du
p r
elat
ive
to 1
GP
U
Number of GPUs
20000 cells
40000 cells
80000 cells
160000 cells
75%
80%
85%
90%
95%
100%
105%
0 2 4 6 8 10 12 14 16
Effi
cien
cy
Number of GPUs
N=3 N=4 N=5
N=6 N=7
49.82 Gflops
97.13 Gflops88.94 Gflops
79.71 Gflops85.18 Gflops
98.54 Gflops
0
10
20
30
40
50
60
70
80
1 2 3 4 5 6
Spee
du
p
Order of Accuracy3 4 5 6 7 8
Figure 3: Overall double precision performance on Tesla C2050 relative to 1 core of Xeon X5670 vs. order of accuracy for quads
Figure 4: (Top) Profile of GPU algorithm vs. order of accuracy for quads (Bottom) Performance per kernel vs. order of accuracy for quads
Figure 5: Spanwise vorticity contours for unsteady flow over SD7003 airfoil at Re = 10000, M=0.2
Summary:
1. Developed a 2D unstructured compressible viscous flow solver on multiple GPUs
2. Achieved 70x speedup (double precision) on single GPU relative to 1 core of Xeon X5670 CPU
3. Obtained good scalability on up to 16 GPUs and peak performance of 1.2 Teraflops
Future work will involve extension to 3D and application to more computationally intensive cases
Conclusions and Future Work
Figure 6: Mesh partition obtained using ParMetis
Figure 7: Speedup for the MPI-CUDA implementationrelative to single GPU code for 6th order calculation
Figure 8: Weak scalability of the MPI-CUDA implementation.Domain size varies from 10000 to 160000 cells
Approach:
Use mixed MPI-CUDA implementation to make use of multiple GPUs working in parallel
Divide mesh between processors (figure on the right)
Overlap CUDA computation with the CUDA memory copy and MPI communication (see pseudo-code below)
// Some code here
Pack_mpi_buf<<<grid_size,block_size>>>;
Cuda_Memcpy(DeviceToHost);
MPI_Isend(mpi_out_buf);
MPI_Irecv(mpi_in_buf);
// Compute discontinuous residual (cell local)
MPI_Wait();
Cuda_Memcpy(HostToDevice);
// Compute continuous residual (using neighboring cell information)
Pseudo-Code:
1.2 Teraflops Double Precision!
Scalability:
- Multi-GPU code run on up to 16 GPUs
- 2 Tesla C2050s per compute node
- C2050 specs:
Memory Bandwidth: 144 GB/s
Single precision peak: 1.03 Tflops
Double precision peak: 515 Gflops
- 2 Xeon X5670 CPUs per compute node
- PCIe x16 slots
- InfiniBand Interconnect
Matrix Multiplication ExampleStrategy: Make use of shared and texture memory. Allow multiple cells per block
Challenges: Avoid bank-conflicts and non-coalesced memory accessesExtrapolation matrix M is too big to fit in shared memory for 3D grids
for (i=0;i<n_qpts;i++){
m = i*n_ed_fgpts+ifp; m1 = n_fields*n_qpts*ic_loc+i;
q0 += fetch_double(t_interp_mat_0,m)*s_q_qpts[m1]; m1 += n_qpts;q1 += fetch_double(t_interp_mat_0,m)*s_q_qpts[m1]; m1 += n_qpts;q2 += fetch_double(t_interp_mat_0,m)*s_q_qpts[m1]; m1 += n_qpts;q3 += fetch_double(t_interp_mat_0,m)*s_q_qpts[m1];
}
Figure 2: Matrix multiplication example. Extrapolate state from solution points to flux points
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2 3 4 5 6
Frac
tio
n o
f to
tal t
ime
Order of Accuracy
Something Long 1 Something Long 1 Something Long 1 Something Long 1
Something Long 1 Something Long 1 Something Long 1 Something Long 1
3 4 5 6 7 8
Solution Method
Flux Reconstruction Approach:
1. Use solution points to define polynomial representation of solution inside the cell
2. Use flux points to communicate information between cells
Flux Exchange: Find a common flux at points on the boundary between cells
Flux Reconstruction: Propagate new flux information back into each cell interior using higher order correction polynomial (see figure above)
3. Update the solution using new information
Types of Operations:
Matrix multiplication: Ex: Extrapolate state between flux and solution points
Algebraic: Ex: Find common flux in Flux Exchange procedure
flux points
sol’n points
Figure 1: Reference Element (left). Correction function (right)