Date post: | 21-Mar-2017 |
Category: |
Science |
Upload: | takateru-yamagishi |
View: | 135 times |
Download: | 0 times |
GPU acceleration of a non-hydrostatic ocean model with a
multigrid Poisson/Helmholtz solver
Takateru Yamagishi1, Yoshimasa Matsumura2
1 Research Organization for Information Science and Technology
2 Institute of Low Temperature Science, Hokkaido University
6th International Workshop on Advances in High-Performance Computational Earth Sciences: Applications
& Frameworks
Table of ContentsMotivation
Numerical ocean model ‘kinaco’
GPU implementation and Optimization
Evaluation and validation
Summary
MotivationSignificance of numerical ocean modelling
Global climate, weather, marine resource, etc.GPU’s high computational performance
Explicit and detail expression, long time simulation, many experiment cases
Previous studiesBleichrodt et al. (2012), Milakov et al. (2013), Werkhoven et al. (2013) Xu, et al. (2015)They showed high performance, but limited to experimental studies
We aim at realistic and practical studies
Non-hydrostatic numerical ocean model ‘kinaco’
Formation of Antarctic bottom water in the southern Weddell Sea
We try to accelerate this model by the GPU
Basic equation of dynamics in kinaco
3D Navier-Stokes equationFluid dynamics
Poisson/Helmholtz equation∆ = , (∆ + )ℎ = 0
DiscretizationStencil access to adjacent 6 grids
Solving systems of equations: Ax=bSparse matrix-vector multiplication
Efficient solver to solve Ax=b is required
CG method with multigrid preconditioner (MGCG)
Fast and scalable iteration method
Matsumura and Hasumi(2008)
Preconditioner: Multigrid method
Solve equation on various resolution grids
multigrid method
Implementation to the GPUCUDA Fortran
kinaco is written in Fortran 90CUDA instructions are available
almost the same as CUDA CFollowing the original structure of CPU code
Good performance vs CPU is achieved
We aimed at further acceleration!
Optimization of the MGCG solver
The cost of MGCG solver: 21% of total simulation
Mainly consists of sparse matrix-vector multiplication
Optimization1. Memory access2. Hide latency by thread/Instruction-level
parallelism3. Mixed precision preconditioner of MGCG
Memory access in CPU kernel
DO k=1, n3DO j=1, n2
DO i=1, n1out(i,j,k) = a(-3,i,j,k) * x(i, j, k-1) &
+ a(-2,i,j,k) * x(i, j-1,k ) &+ a(-1,i,j,k) * x(i-1,j, k ) &+ a( 0,i,j,k) * x(i, j, k ) &+ a( 1,i,j,k) * x(i+1,j, k ) &+ a( 2,i,j,k) * x(i, j+1,k ) &+ a( 3,i,j,k) * x(i, j, k+1)
END DOEND DO
END DO
-3 -2 -1 0 1 2 3
a(-3,i,j,k)~a( 3,i,j,k)
Sparse matrix-vector kernel in the CPU code
matrix coefficient
Location of matrix coefficient
-3
3
1-1-2
2
0
CPU thread load the array ‘a’ in cache line.
Memory access in GPU kernel
a(i,j,k,-3)
a(i+1,j,k,-3)
a(i+2,j,k,-3)
thread(id)thread(id+1)
thread(id+2)
a(-3:3,i,j,k) a(i,j,k,-3:3)
Each GPU thread accesses array “a” with 7 intervals.
a(-3,i,j,k) a(-3,i+1,j,k) a(-3,i+2,j,k)
thread(id) thread(id+1) thread(id+2)
Coalesced access to array “a”
Hide latency by thread/Instruction-level parallelism
Hide latency = do other operations when waiting for latencyThread-level parallelism
Switch thread to hide latencyInstruction-level parallelism (Volkov, 2010)
One thread with several independent operations
Comparison of the two parallelism
Case 1: Thread-level parallelism
i = threadidx%x + blockdim%x * (blockidx%x-1)j = threadidx%y + blockdim%y * (blockidx%y-1) k = threadidx%z + blockdim%z * (blockidx%z-1)
out(i,j,k) = a(i,j,k,-3) * x(i, j, k-1) &+ a(i,j,k,-2) * x(i, j-1,k ) &+ a(i,j,k,-1) * x(i-1,j, k ) &+ a(i,j,k, 0) * x(i, j, k ) &+ a(i,j,k, 1) * x(i+1,j, k ) &+ a(i,j,k, 2) * x(i, j+1,k ) &+ a(i,j,k, 3) * x(i, j, k+1)
Set many threads as possible (i, j, k)
• 3D (i, j, k) threads are set• One thread for one grid
Hyde latency by switching many threads
Case 2: Instruction-level parallelism
Independent operations are repeated
i = threadidx%x + blockdim%x * (blockidx%x-1)j = threadidx%y + blockdim%y * (blockidx%y-1)
DO k=1, n3out(i,j,k) = a(i,j,k,-3) * x(i, j, k-1) &
+ a(i,j,k,-2) * x(i, j-1,k ) &+ a(i,j,k,-1) * x(i-1,j, k ) &+ a(i,j,k, 0) * x(i, j, k ) &+ a(i,j,k, 1) * x(i+1,j, k ) &+ a(i,j,k, 2) * x(i, j+1,k ) &+ a(i,j,k, 3) * x(i, j, k+1)
END DO
Hyde latency with instructions
• 2D (i, j) threads are set• One thread for one column
(i, j)
Case 2 is faster
Mixed precision for multigrid preconditioning
Low precisionutilize GPU resources
PreconditioningLow precision is enoughGPU: Deterioration of performance with coarse grids
multigrid method
Number of iterations in CG method unchanged with/without mixed precision
Evaluation, experimental settingCPU (Fujitsu SPARC64VIIIfx) vs GPU (NVIDIA K20c)
1 CPU vs 1 GPUStudy of baloclinic instability
Visbeck et al. (1996)Forcing: Coriolis force, temperature forcingStructured, Isotropic domain
size: (256, 256, 32)Time step, simulation time
2min, 5hours (150 steps)5 days(3600 steps)
256256
32
Performance
CPU GPU_1 GPU_2 GPU_3 Speedup (GPU_3)
all components 174.2 42.6 39.2 37.3 4.7 Poisson/Helmholtz solver 36.8 15.8 12.4 10.5 3.5
others 137.4 26.9 26.8 26.8 5.1
Elapsed time[s]: CPU vs GPU
CPU : original CPU codeGPU_1: basic and typical implementation to the GPUGPU_2: GPU_1 + memory optimization, hyde latencyGPU_3: GPU_2 + mixed precision preconditioning
GPU achieved 4.7 times speedup vs CPU
5hours (150 steps)
Surface ocean current/velocity field
GPU_3GPU_2CPU
Good reproduction of growing meanders due to baloclinic instability
Temperature at the cross section
Good reproduction of vertical
convection of water
CPU GPU_2
GPU_2
Summary and future worksNumerical ocean model on the GPU (K20C) vs the CPU (SPARC 64 VIIIfx)
x4.7 faster compared to CPUThe errors due to implementation
not significant to oceanic studies
Further worksApplication of mixed precision to other kernels MPI implementationRealistic experiments