+ All Categories
Home > Documents > Rolls-Royce Hydra CFD Code on GPUs using OP2...

Rolls-Royce Hydra CFD Code on GPUs using OP2...

Date post: 27-Jul-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
1
Rolls-Royce Hydra CFD Code on GPUs using OP2 Abstraction I. Z. Reguly , G. R. Mudalige , C. Bertolli , M. B. Giles , A. Betts , P. H. J. Kelly , D. Radford 1 1 2 1 3 3 4 Oxford e-Research Centre, University of Oxford, UK, IBM TJ Watson Research Center, USA Department of Computing, Imperial College London, Rolls-Royce plc. 1 3 4 2 OP2 Abstraction for Unstructured Grid Computations c2 c1 c3 c6 c5 c4 e3 e5 e4 e1 e6 e2 e8 e7 e16 e17 e9 e10 e11 e13 e15 e14 e12 v8 v12 v3 v1 v2 v4 v6 v7 v11 v10 v9 v5 Mappings between sets: Cells2Edges: ..... 2,3,6,9,.... op_decl_map(cells,edges,4, ...) Edges2Vertices: ...,3,5,3,8.... op_decl_map(edges,vertices,2, ...) c2 c3 c1 e3 e4 e1 e2 Data on sets: Coordinates on vertices: ...xyz,xyz,xyz,... op_decl_dat(vertices,3,”double”,... Flow variables on cells: ...uvwp,uvwp,uvwp,... op_decl_dat(cells,4,”double”,... c3 c4 c2 v2 v3 v1 Sets: op_decl_set(num_cells, “cells”) op_decl_set(num_edges, “edges”) op_decl_set(num_vertices, “vertices”) Parallel loops: Iterate over all the elements in a set in parallel, execut- ing a user-definded “kernel function” on each, passing a number of data arguments, either on the iteration set or through at most one level of indirection, describing the type of access (read / write / increment). void res(const double *edge_flux, double *left_cell, double *right_cell) { //Computational code double f_x = edge_flux[0]; right_cell[0] += f_x * 0.25; ... } ... op_par_loop(res,"res", edges, op_arg_dat(flux,-1,OP_ID,3,”double",OP_READ), op_arg_dat(flow,0,edges2cells,4,”double",OP_INC), op_arg_dat(flow,1,edges2cells,4,”double",OP_INC)); Based on this information, OP2 can automatically handle data dependencies and generate code for parallel execution on a range of heterogeneous architectures MPI boundary Owner-compute Halo exchanges Block 1 Block 2 Parallelization happens on three levels: (1) distributed memory MPI and the standard owner-compute approach with halo exchanges, (2) coarse grained shared memory, for OpenMP threads and CUDA thread blocks, where a partition is broken up into blocks which are colored based on potential data races and then executed by color, (3) fine grained shared memory, for CUDA threads in the same thread block, where set elements are colored based on potential data races, accesses are then serialized by color. Rolls-Royce Hydra • Complex and configurable full-scale industrial application used for the simulation and design of tur- bomachinery components. • Equations solved are the Reynolds-Averaged Navier-Stokes second-order PDEs, using a 5-step Runge-Kutta method for time-marching, accelerated by multigrid and block-Jacobi preconditioning. • Fortran code, uses the OPlus library, consists of 300+ parallel loops. • OPlus library only supports distributed memory parallelism over MPI, but the latest hardware require the utilization of shared memory programming models (OpenMP, CUDA). • Adopted Hydra to use OP2 and support modern heterogeneous architectures. Acknowledgements This research has been funded by the UK Technology Strategy Board and Rolls-Royce plc. through the Siloet project, the UK En- gineering and Physical Sciences Research Council projects EP/I006079/1, EP/I00677X/1 on “Multi-layered Abstractions for PDEs” and the “Algorithms, Software for Emerging Architectures“ (ASEArch) EP/J010553/1 project. The authors would like to ac- knowledge the use of the University of Oxford Advanced Research Computing (ARC) facility in carrying out this work. Organizing parallelism !"#$% !%#'" "%#!" "(#'$ ""#' )#$ *#* + % "+ "% !+ !% (+ ,-./0 1-2 1234 5!+ 6789:;.< 1234 5!+ 6=>4< 1234 5!+ 6?.>@A >BC< 1234 5!+ 6?D0C< 1234 !E5!+ 6?D0C< 1234 5$+ 6?D0C< FGD@/:>8 :HD 60D@< Single node performance Optimizations can be enabled in the code generator and parameters can be automatically tuned: Transposing data to Structure of Arrays (SoA) Moving read-only data through the texture cache Tuning thread block size and register count Timings from a machine with a dual- socket Intel Xeon E2640 CPU and 2 NVIDIA Tesla K20c GPUs. !"#$ !"$ & # ' ( &) *# & # ' ( &) *# )' &#( +,-./012 03- 45-.6 718-5 9:;/5 9:# <:= 4:>[email protected] 9:# <:=B9<: 4:>[email protected] 9:# <:=BCDEF 4:>[email protected] Strong scaling performance SoA is especially advantageous with high dimensional datasets - gives coa- lesced memory accesses and decreases cache contention. Using the texture cache helps maximize the bandwidth and gives data reuse for indirect ac- cesses. Fully automated distributed execution: Partitioning using ParMetis or PT- Scotch Latency hiding Halo exchages and redundant execu- tion to facilitate owner-compute ap- proach ! # $ % !& ! # $ % !& '()*+,-. ,/) 01)*2 3-4)1 567+1 56# 869 06:;*-<*=2 56# 869>586 06:;?5:?@2 56# 869>?ABC 06:;?5:?@2 Weak scaling performance With good utilization up to 2 times per- formance gain over fully utilized HECToR nodes (32 cores), at low node counts when strong scaling, and main- taned when weak scaling - doubling problem size when doubling node count. Timings for the CPU from HECToR (32 AMD Opteron cores per node) and for the GPU from Jade (2 Tesla K20m cards per node), strong scaling with a 800k vertex mesh, weak scaling with 500k vertex per node ! # $ % & '! '# '$ '% '& !() ' '() # #() * *() $ +,-./012 03- 45-.6 7890012 5:;- <8=82.- ',>#!. ?7@ ',>#!. ?7@ A #,+#%$! B7@ edgecon accumedges ifluxedge vfluxedge srcsa Hybrid CPU-GPU execution With OP2, it is possible to utilize the CPU and the GPU at the same time. The key challenge is to find the right load balance when there are performance differences between loops on different hardware: with a good balance, 15% speedup can be achieved over a single GPU by utilizing the CPUs in the system as well. Timings from a machine with a dual- socket Intel Xeon E2640 CPU and 2 NVIDIA Tesla K20c GPUs. Platform-specific code generation and compilation !"# %&'()*+,&+%&'()* -&./01*( 2/3,4&5 6*5*(7,&(8 !"# "179&(. %/*)0:) !/;.0<*= >7)?*5= 10@(7(0*A -&5B*5;&571 -&./01*( C )&./01*( D76A 2*E6E F))G 5B))G /6))8 H7(=I7(* J05? %0561* K&=* -LMN %0561* K&=* !/*5O" -1'A,*( O"F -1'A,*( O"FC-LMN L5A,('),'(*= O*A4 N//10)7;&5 'A056 !"# 2 -P-CC &( Q!RSRNK N"F8 "179&(. %/*)0:) !/;.0<*= N//10)7;&5 Q01*A O*A4 24=TUG N%-F8 "179&(. %/*)0:) >057(3 VW*)',7@1* Abstract Developers of scientific codes face an important dilemma: in order to achieve high performance, it is in- creasingly necessary to implement low-level optimisations that are specific to a certain hardware. At the same time, there is considerable uncertainty about what programming approaches and hardware are best suited to different problems and how they will change in the future - it would be unfeasible to refactor large codes to every new generation of hardware. Domain Specific Languages address this problem by providing a high-level abstraction for a specific class of applications. OP2 defines such an abstraction for unstructured grid computations, that hides the details of parallelism and data move- ment, and enables efficient mapping of execution to various programming abstractions and hardware. Conclusions • A highly complex industrial application, such as Rolls-Royce Hydra, can be adopted to use the OP2 Domain Specific High-Level abstraction framework. • Using OP2 does not compromise performance compared to the hand-coded original; in fact OP2 matches and outperforms it. • By using OP2, Hydra is now capable of exploting the latest hardware, such as GPUs. • OP2 provides code maintainability and future-proofing to the application developers.
Transcript
Page 1: Rolls-Royce Hydra CFD Code on GPUs using OP2 Abstractionsc14.supercomputing.org/sites/all/themes/sc14/...c1 c2 c3 c6 c4 c5 e3 e5 e4 e1 e6 e2 e8 e7 e16 e17 e9 e10 e11 e13 e15 e14 e12

Rolls-Royce Hydra CFD Codeon GPUs using OP2 Abstraction

I. Z. Reguly , G. R. Mudalige , C. Bertolli , M. B. Giles , A. Betts , P. H. J. Kelly , D. Radford1 1 2 1 3 3 4

Oxford e-Research Centre, University of Oxford, UK, IBM TJ Watson Research Center, USADepartment of Computing, Imperial College London, Rolls-Royce plc.

1

3 4

2

OP2 Abstraction forUnstructured Grid

Computations

c2c1

c3

c6c5c4

e3

e5

e4

e1

e6

e2

e8e7

e16

e17

e9

e10 e11

e13

e15

e14e12

v8

v12

v3v1

v2

v4v6

v7v11

v10

v9v5

Mappings between sets:

Cells2Edges: ..... 2,3,6,9,....op_decl_map(cells,edges,4, ...)

Edges2Vertices: ...,3,5,3,8....op_decl_map(edges,vertices,2, ...)

c2 c3c1

e3 e4e1 e2

Data on sets:Coordinates on vertices:

...xyz,xyz,xyz,...op_decl_dat(vertices,3,”double”,...

Flow variables on cells:

...uvwp,uvwp,uvwp,...op_decl_dat(cells,4,”double”,...

c3 c4c2

v2 v3v1

Sets:op_decl_set(num_cells, “cells”)op_decl_set(num_edges, “edges”)op_decl_set(num_vertices, “vertices”)

Parallel loops:Iterate over all the elements in a set in parallel, execut-ing a user-de�nded “kernel function” on each, passing a number of data arguments, either on the iteration set or through at most one level of indirection, describing the type of access (read / write / increment).

void res(const double *edge_flux, double *left_cell, double *right_cell) { //Computational code double f_x = edge_flux[0]; right_cell[0] += f_x * 0.25; ...}...op_par_loop(res,"res", edges, op_arg_dat(flux,-1,OP_ID,3,”double",OP_READ), op_arg_dat(flow,0,edges2cells,4,”double",OP_INC), op_arg_dat(flow,1,edges2cells,4,”double",OP_INC));

Based on this information, OP2 can automatically handle data dependencies and generate code for parallel execution on a range of heterogeneous architectures

MPI boundaryOwner-computeHalo exchanges

Block 1Block 2

Parallelization happens on three levels:(1) distributed memory MPI and the standard owner-compute approach with halo exchanges,(2) coarse grained shared memory, for OpenMP threads and CUDA thread blocks, where a partition is broken up into blocks which are colored based on potential data races and then executed by color,(3) �ne grained shared memory, for CUDA threads in the same thread block, where set elements are colored based on potential data races, accesses are then serialized by color.

Rolls-Royce Hydra• Complex and configurable full-scale industrial application used for the simulation and design of tur-bomachinery components.• Equations solved are the Reynolds-Averaged Navier-Stokes second-order PDEs, using a 5-step Runge-Kutta method for time-marching, accelerated by multigrid and block-Jacobi preconditioning.• Fortran code, uses the OPlus library, consists of 300+ parallel loops.• OPlus library only supports distributed memory parallelism over MPI, but the latest hardware require the utilization of shared memory programming models (OpenMP, CUDA).• Adopted Hydra to use OP2 and support modern heterogeneous architectures.

AcknowledgementsThis research has been funded by the UK Technology Strategy Board and Rolls-Royce plc. through the Siloet project, the UK En-gineering and Physical Sciences Research Council projects EP/I006079/1, EP/I00677X/1 on “Multi-layered Abstractions for PDEs” and the “Algorithms, Software for Emerging Architectures“ (ASEArch) EP/J010553/1 project. The authors would like to ac-knowledge the use of the University of Oxford Advanced Research Computing (ARC) facility in carrying out this work.

Organizing parallelism

!"#$%&

!%#'"&

"%#!"&"(#'$&

""#'&

)#$&*#*&

+&

%&

"+&

"%&

!+&

!%&

(+&

,-./0&1-2& 1234&5!+&6789:;.<& 1234&5!+&6=>4<&1234&5!+&6?.>@A&>BC<& 1234&5!+&6?D0C<& 1234&!E5!+&6?D0C<&1234&5$+&6?D0C<&

FGD@

/:>8

&:H

D&60

D@<&

Single node performanceOptimizations can be enabled in the code generator and parameters can be automatically tuned:• Transposing data to Structure of Arrays (SoA)• Moving read-only data through the texture cache• Tuning thread block size and register count

Timings from a machine with a dual-socket Intel Xeon E2640 CPU and 2 NVIDIA Tesla K20c GPUs.

!"#$%

!"$%

&%

#%

'%

(%

&)%

*#%

&% #% '% (% &)% *#% )'% &#(%

+,-./0

12%03-%45-.6%

718-5%

9:;/5%9:#%<:=%4:>[email protected]%9:#%<:=B9<:%4:>[email protected]%9:#%<:=BCDEF%4:>[email protected]%

Strong scaling performanceSoA is especially advantageous with high dimensional datasets - gives coa-lesced memory accesses and decreases cache contention. Using the texture cache helps maximize the bandwidth and gives data reuse for indirect ac-cesses.

Fully automated distributed execution:• Partitioning using ParMetis or PT-Scotch• Latency hiding• Halo exchages and redundant execu-tion to facilitate owner-compute ap-proach

!"

#"

$"

%"

!&"

!" #" $" %" !&"

'()*+,

-.",/)"01)*2"

3-4)1"

567+1"56#"869"06:;*-<*=2"56#"869>586"06:;?5:?@2"56#"869>?ABC"06:;?5:?@2"

Weak scaling performance With good utilization up to 2 times per-formance gain over fully utilized HECToR nodes (32 cores), at low node counts when strong scaling, and main-taned when weak scaling - doubling problem size when doubling node count.

Timings for the CPU from HECToR (32 AMD Opteron cores per node) and for the GPU from Jade (2 Tesla K20m cards per node), strong scaling with a 800k vertex mesh, weak scaling with 500k vertex per node

!"

#"

$"

%"

&"

'!"

'#"

'$"

'%"

'&"

!()" '" '()" #" #()" *" *()" $"

+,-./0

12"03-"45-.6"

7890012"5:;-"<8=82.-"

',>#!."?7@"',>#!."?7@"A"#,+#%$!"B7@"

edge

con

accu

med

ges

iflux

edge

vflu

xedg

e

srcs

a

Hybrid CPU-GPU executionWith OP2, it is possible to utilize the CPU and the GPU at the same time. The key challenge is to �nd the right load balance when there are performance di�erences between loops on di�erent hardware: with a good balance, 15% speedup can be achieved over a single GPU by utilizing the CPUs in the system as well.

Timings from a machine with a dual-socket Intel Xeon E2640 CPU and 2 NVIDIA Tesla K20c GPUs.

Platform-speci�c code generation and compilation

!"#$%&'()*+,&+%&'()*$-&./01*($2/3,4&5$6*5*(7,&(8$

!"#$"179&(.$%/*)0:)$!/;.0<*=$>7)?*5=$

10@(7(0*A$$

-&5B*5;&571$-&./01*($C$)&./01*($D76A$$2*E6E$F))G$5B))G$/6))8$

H7(=I7(*$

J05?$%0561*$K&=*$-LMN$

%0561*$K&=*$!/*5O"$$

-1'A,*($O"F$$

-1'A,*($O"FC-LMN$$

L5A,('),'(*=$O*A4$N//10)7;&5$'A056$!"#$2$-P-CC$&($Q!RSRNK$N"F8$

"179&(.$%/*)0:)$!/;.0<*=$N//10)7;&5$Q01*A$

O*A4$$24=TUG$N%-F8$

"179&(.$%/*)0:)$>057(3$VW*)',7@1*$

AbstractDevelopers of scienti�c codes face an important dilemma: in order to achieve high performance, it is in-creasingly necessary to implement low-level optimisations that are speci�c to a certain hardware. At the same time, there is considerable uncertainty about what programming approaches and hardware are best suited to di�erent problems and how they will change in the future - it would be unfeasible to refactor large codes to every new generation of hardware. Domain Speci�c Languages address this problem by providing a high-level abstraction for a speci�c class of applications. OP2 de�nes such an abstraction for unstructured grid computations, that hides the details of parallelism and data move-ment, and enables e�cient mapping of execution to various programming abstractions and hardware.

Conclusions• A highly complex industrial application, such as Rolls-Royce Hydra, can be adopted to use the OP2 Domain Speci�c High-Level abstraction framework.• Using OP2 does not compromise performance compared to the hand-coded original; in fact OP2 matches and outperforms it. • By using OP2, Hydra is now capable of exploting the latest hardware, such as GPUs.• OP2 provides code maintainability and future-proofing to the application developers.

Recommended