+ All Categories
Home > Documents > Multi-GPU High-Order Unstructured Solver for Compressible … · 2010-09-07 · Local operations...

Multi-GPU High-Order Unstructured Solver for Compressible … · 2010-09-07 · Local operations...

Date post: 16-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
1
Single-GPU Results Multi-GPU High-Order Unstructured Solver for Compressible Navier-Stokes Equations P. Castonguay, D. Williams, P. Vincent, and A. Jameson Aeronautics and Astronautics Department, Stanford University Introduction Algorithm: Unstructured high-order compressible flow solver for Navier-Stokes equations Why a higher order flow solver? For applications with low error tolerance, high order methods are more cost effective Why is higher order suitable for the GPU? More operations per degree of freedom Local operations within cells allow use of shared memory Objectives: Develop a fast, high-order multi-GPU solver to simulate viscous compressible flows over complex geometries such as micro air vehicles, birds, and Formula 1 cars Multi-GPU Results 0 25 50 75 100 125 150 175 3 4 5 6 7 8 Performance (Gflops) Order of Accuracy 0 2 4 6 8 10 12 14 16 2 4 8 12 16 Speedup relative to 1 GPU Number of GPUs 20000 cells 40000 cells 80000 cells 160000 cells 75% 80% 85% 90% 95% 100% 105% 0 2 4 6 8 10 12 14 16 Efficiency Number of GPUs N=3 N=4 N=5 N=6 N=7 49.82 Gflops 97.13 Gflops 88.94 Gflops 79.71 Gflops 85.18 Gflops 98.54 Gflops 0 10 20 30 40 50 60 70 80 Speedup Order of Accuracy 3 4 5 6 7 8 Figure 3: Overall double precision performance on Tesla C2050 relative to 1 core of Xeon X5670 vs. order of accuracy for quads Figure 4: (Top) Profile of GPU algorithm vs. order of accuracy for quads (Bottom) Performance per kernel vs. order of accuracy for quads Figure 5: Spanwise vorticity contours for unsteady flow over SD7003 airfoil at Re = 10000, M=0.2 Summary: 1. Developed a 2D unstructured compressible viscous flow solver on multiple GPUs 2. Achieved 70x speedup (double precision) on single GPU relative to 1 core of Xeon X5670 CPU 3. Obtained good scalability on up to 16 GPUs and peak performance of 1.2 Teraflops Future work will involve extension to 3D and application to more computationally intensive cases Conclusions and Future Work Figure 6: Mesh partition obtained using ParMetis Figure 7: Speedup for the MPI-CUDA implementation relative to single GPU code for 6 th order calculation Figure 8: Weak scalability of the MPI-CUDA implementation. Domain size varies from 10000 to 160000 cells Approach: Use mixed MPI-CUDA implementation to make use of multiple GPUs working in parallel Divide mesh between processors (figure on the right) Overlap CUDA computation with the CUDA memory copy and MPI communication (see pseudo-code below) // Some code here Pack_mpi_buf<<<grid_size,block_size>>>; Cuda_Memcpy(DeviceToHost); MPI_Isend(mpi_out_buf); MPI_Irecv(mpi_in_buf); // Compute discontinuous residual (cell local) MPI_Wait(); Cuda_Memcpy(HostToDevice); // Compute continuous residual (using neighboring cell information) Pseudo-Code: 1.2 Teraflops Double Precision! Scalability: - Multi-GPU code run on up to 16 GPUs - 2 Tesla C2050s per compute node - C2050 specs: Memory Bandwidth: 144 GB/s Single precision peak: 1.03 Tflops Double precision peak: 515 Gflops - 2 Xeon X5670 CPUs per compute node - PCIe x16 slots - InfiniBand Interconnect Matrix Multiplication Example Strategy: Make use of shared and texture memory. Allow multiple cells per block Challenges: Avoid bank-conflicts and non-coalesced memory accesses Extrapolation matrix M is too big to fit in shared memory for 3D grids for (i=0;i<n_qpts;i++) { m = i*n_ed_fgpts+ifp; m1 = n_fields*n_qpts*ic_loc+i; q0 += fetch_double(t_interp_mat_0,m)*s_q_qpts[m1]; m1 += n_qpts; q1 += fetch_double(t_interp_mat_0,m)*s_q_qpts[m1]; m1 += n_qpts; q2 += fetch_double(t_interp_mat_0,m)*s_q_qpts[m1]; m1 += n_qpts; q3 += fetch_double(t_interp_mat_0,m)*s_q_qpts[m1]; } Figure 2: Matrix multiplication example. Extrapolate state from solution points to flux points 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Fraction of total time Order of Accuracy Something Long 1 Something Long 1 Something Long 1 Something Long 1 3 4 5 6 7 8 Solution Method Flux Reconstruction Approach: 1. Use solution points to define polynomial representation of solution inside the cell 2. Use flux points to communicate information between cells Flux Exchange: Find a common flux at points on the boundary between cells Flux Reconstruction: Propagate new flux information back into each cell interior using higher order correction polynomial (see figure above) 3. Update the solution using new information Types of Operations: Matrix multiplication: Ex: Extrapolate state between flux and solution points Algebraic: Ex: Find common flux in Flux Exchange procedure flux points sol’n points Figure 1: Reference Element (left). Correction function (right)
Transcript
Page 1: Multi-GPU High-Order Unstructured Solver for Compressible … · 2010-09-07 · Local operations within cells allow use of shared memory Objectives: Develop a fast, high-order multi-GPU

Single-GPU Results

Multi-GPU High-Order Unstructured Solver for Compressible Navier-Stokes EquationsP. Castonguay, D. Williams, P. Vincent, and A. Jameson

Aeronautics and Astronautics Department, Stanford University

IntroductionAlgorithm: Unstructured high-order compressible flow solver for Navier-Stokes equations

Why a higher order flow solver?

For applications with low error tolerance, high order methods are more cost effective

Why is higher order suitable for the GPU?

More operations per degree of freedomLocal operations within cells allow use of shared memory

Objectives: Develop a fast, high-order multi-GPU solver to simulate viscous compressible flows over complex geometries such as micro air vehicles, birds, and Formula 1 cars

Multi-GPU Results

0

25

50

75

100

125

150

175

3 4 5 6 7 8

Perf

orm

ance

(G

flo

ps)

Order of Accuracy

0

2

4

6

8

10

12

14

16

2 4 8 12 16

Spee

du

p r

elat

ive

to 1

GP

U

Number of GPUs

20000 cells

40000 cells

80000 cells

160000 cells

75%

80%

85%

90%

95%

100%

105%

0 2 4 6 8 10 12 14 16

Effi

cien

cy

Number of GPUs

N=3 N=4 N=5

N=6 N=7

49.82 Gflops

97.13 Gflops88.94 Gflops

79.71 Gflops85.18 Gflops

98.54 Gflops

0

10

20

30

40

50

60

70

80

1 2 3 4 5 6

Spee

du

p

Order of Accuracy3 4 5 6 7 8

Figure 3: Overall double precision performance on Tesla C2050 relative to 1 core of Xeon X5670 vs. order of accuracy for quads

Figure 4: (Top) Profile of GPU algorithm vs. order of accuracy for quads (Bottom) Performance per kernel vs. order of accuracy for quads

Figure 5: Spanwise vorticity contours for unsteady flow over SD7003 airfoil at Re = 10000, M=0.2

Summary:

1. Developed a 2D unstructured compressible viscous flow solver on multiple GPUs

2. Achieved 70x speedup (double precision) on single GPU relative to 1 core of Xeon X5670 CPU

3. Obtained good scalability on up to 16 GPUs and peak performance of 1.2 Teraflops

Future work will involve extension to 3D and application to more computationally intensive cases

Conclusions and Future Work

Figure 6: Mesh partition obtained using ParMetis

Figure 7: Speedup for the MPI-CUDA implementationrelative to single GPU code for 6th order calculation

Figure 8: Weak scalability of the MPI-CUDA implementation.Domain size varies from 10000 to 160000 cells

Approach:

Use mixed MPI-CUDA implementation to make use of multiple GPUs working in parallel

Divide mesh between processors (figure on the right)

Overlap CUDA computation with the CUDA memory copy and MPI communication (see pseudo-code below)

// Some code here

Pack_mpi_buf<<<grid_size,block_size>>>;

Cuda_Memcpy(DeviceToHost);

MPI_Isend(mpi_out_buf);

MPI_Irecv(mpi_in_buf);

// Compute discontinuous residual (cell local)

MPI_Wait();

Cuda_Memcpy(HostToDevice);

// Compute continuous residual (using neighboring cell information)

Pseudo-Code:

1.2 Teraflops Double Precision!

Scalability:

- Multi-GPU code run on up to 16 GPUs

- 2 Tesla C2050s per compute node

- C2050 specs:

Memory Bandwidth: 144 GB/s

Single precision peak: 1.03 Tflops

Double precision peak: 515 Gflops

- 2 Xeon X5670 CPUs per compute node

- PCIe x16 slots

- InfiniBand Interconnect

Matrix Multiplication ExampleStrategy: Make use of shared and texture memory. Allow multiple cells per block

Challenges: Avoid bank-conflicts and non-coalesced memory accessesExtrapolation matrix M is too big to fit in shared memory for 3D grids

for (i=0;i<n_qpts;i++){

m = i*n_ed_fgpts+ifp; m1 = n_fields*n_qpts*ic_loc+i;

q0 += fetch_double(t_interp_mat_0,m)*s_q_qpts[m1]; m1 += n_qpts;q1 += fetch_double(t_interp_mat_0,m)*s_q_qpts[m1]; m1 += n_qpts;q2 += fetch_double(t_interp_mat_0,m)*s_q_qpts[m1]; m1 += n_qpts;q3 += fetch_double(t_interp_mat_0,m)*s_q_qpts[m1];

}

Figure 2: Matrix multiplication example. Extrapolate state from solution points to flux points

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 2 3 4 5 6

Frac

tio

n o

f to

tal t

ime

Order of Accuracy

Something Long 1 Something Long 1 Something Long 1 Something Long 1

Something Long 1 Something Long 1 Something Long 1 Something Long 1

3 4 5 6 7 8

Solution Method

Flux Reconstruction Approach:

1. Use solution points to define polynomial representation of solution inside the cell

2. Use flux points to communicate information between cells

Flux Exchange: Find a common flux at points on the boundary between cells

Flux Reconstruction: Propagate new flux information back into each cell interior using higher order correction polynomial (see figure above)

3. Update the solution using new information

Types of Operations:

Matrix multiplication: Ex: Extrapolate state between flux and solution points

Algebraic: Ex: Find common flux in Flux Exchange procedure

flux points

sol’n points

Figure 1: Reference Element (left). Correction function (right)

Recommended