Download - Nested-Loop RAJA Extensions for Determinisc Transport · 2020. 9. 10. · Nested-Loop RAJA extensions provide muldimensional support § Maintains the RAJA philosophy — Separate

LLNL-PRES-681292 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Nested-LoopRAJAExtensionsforDeterminis6cTransportDOECentersofExcellencePerformancePortabilityMee7ng

AdamJ.Kunen

April 19, 2016

LLNL-PRES-681292 2

RAJA (L2) Demoed OMP/CUDA Portability (FY15)

Legion Ongoing (Sam White, UIUC)

STAPL Ongoing (TAMU)

OCCA Demoed OKL DSL (David Medina 2015)

CORAL CoE Exploring CUDA algorithms Ongoing

Charm++ Ongoing (Sam White, UIUC)

KripkeisaresearchtoolthatisinformingthedevelopmentofASCcodes

ARDRA

Kripke v1.0

Kripke v1.1

Nested-Loop RAJA

External PM Research/Development Collaborations

L, L+, Sweep (Aug 2014)

L, L+, Sweep, Source, Scattering (Sept 2015)

Developing RAJA extensions for use in ARDRA (Ongoing)

Porting to RAJA during FY16/FY17

Internal Research and Development

Research with Kripke motivated nested-loop abstractions

LLNL-PRES-681292 3

Kripkedemonstratesperformanceissensi6vetodatalayout,architectureandproblemsize

Abstracting data layout and loop order enable performance portability

Data layout and loop interchange affect performance

… so do problem dimensions and architecture

Sandy Bridge, P4, 12^3 zones/thread, 64 grp, 96 dir BG/Q, P9, 12^3 zones/thread, 64 grp, 96 dir

LLNL-PRES-681292 4

DiscreteordinatestransportneedsaportablePMthattreatsmul6dimensionalloopsanddataasfirstclassci6zens

§  SnTransportisdominatedbynestedloops—  Highdimensionalphasespace—  OHenloopsarenested2to5deep(some7mesevenmore)— Manyofourloopsareperfectlynested—  Complexitera7onpaMerns(sweeps,mul7-material)

§  Kripkeshowsperformanceissensi7veto:—  Datalayout,Loop-nes7ngorder—  Architecture,Compiler(vendorandversion)—  Problemspecifica7ons(zones,groups,direc7ons,moments,etc.)

§  GivenArchitecture+Compiler+Problem:—  ChooseDataLayoutandLoop-NestOrder—  Chooseexecu7onpolicies

§  Prepara7onforSierrarequirespor7ngtoGPU—  CUDA?OpenMP?— Whataboutexascale?WhatPMwillweneedthen?

Sn transport needs performance portable multidimensional PM concepts

LLNL-PRES-681292 5

RAJA::forall(IndexSet_I,[=](inti){RAJA::forall(IndexSet_J,[=](intj){RAJA::forall(IndexSet_K,[=](intk){y[i*nj+j]+=a[k]*x[j*nk+k];})})});

WhydoweneedRAJA::forallNinsteadofjustnes6ngtheexis6ngRAJA::forall?

Nested loops and multidimensional arrays are needed in addition to RAJA

Multi-Dimensional Arrays •  Manual array calculations are error prone •  Hard-wires code for specific data layouts

Nested Loop Constructs •  Nesting RAJA::forall’s work, but don’t enable complex loop

transformations •  Hard-wires code for specific patterns •  CUDA kernels require building a thread+block index space

from multiple loop indices… very difficult with nested forall’s •  Nested lambdas are problematic for OMP4 and CUDA

LLNL-PRES-681292 6

Nested-LoopRAJAextensionsprovidemul6dimensionalsupport

§  MaintainstheRAJAphilosophy—  Separateconceptsofloopexecu7on,itera7onpaMernsandloopbodies—  Minorcodestructurechanges—  Allowincrementaltransi7ontoRAJA—  Leverageexis7ngRAJAcode(forall,IndexSet,etc.)

•  Basicnested-loopsarefunc7onallyequivalenttonestedRAJA::forall()’s

§  ArbitraryDimensionality—  RAJA::forallNforanyN—  Usingvariadictemplatemeta-programming,nocodegen(almost)

§  LoopTransforma7ons(forperfectlynestedloops)—  Loopinterchange—  Mul7-level7ling/blocking—  MappingtoCUDAthreadsandblocks—  Loopcollapsing(OpenMPcollapse(N))

§  DataLayouts—  Arbitrarydatastridingorders

§  Portability(andhopefullyperformance)—  Sequen7al,SIMD,OpenMP(CPU/GPU),CUDA

RAJA is a good starting point for an Sn programming model

LLNL-PRES-681292 7

HowdoweextendtheRAJA::forallabstrac6onforsingleloopstonestedloops?

Kripke v1.1: LTimes for DGZ layout

Hand coded loops are inflexible and non-portable

double*ell_ptr;double*psi_ptr;double*phi_ptr;for(intnm=0;nm<num_moments;++nm){double*ell_nm=ell+nm*num_loc_dir;double*__restrict__phi_nm=phi+nm*num_gz;for(intd=0;d<num_loc_dir;d++){double*__restrict__psi_d=psi+d*num_locgz;doubleell_nm_d=ell_nm[d];for(intgz=0;gz<num_loc_gz;++gz){phi_nm[gz]+=ell_nm_d*psi_d[gz];}}}

§  Loop-nestorderisfixed—  Loop-interchangerequiresrewrite

§  Inner2-loopscollapsed—  Anarch-specificop7miza7on

§  Fixedlayoutofeachvariable

§  NoobviousmappingtoCUDA— Needtorewrite

§  Codeisjustdownrightugly

Issues with this coding style:

LLNL-PRES-681292 8

Nested-LoopRAJAabstractsnestedloops,loopinterchange,anddatalayouts

View_Psipsi(psi_ptr,…);View_Phiphi(phi_ptr,…);View_Ellell(ell_ptr,…);forallN(is_moment,is_dir,is_group,is_zone,[=](NMnm,Dird,Groupg,Zonez){phi(nm,g,z)+=ell(d,nm)*psi(d,g,z);});

Kripke+RAJA: LTimes for all layouts

Nested-loop abstraction promotes flexibility, while maintaining code structure

double*ell_ptr;double*psi_ptr;double*phi_ptr;for(intnm=0;nm<num_moments;++nm){double*ell_nm=ell+nm*num_loc_dir;double*__restrict__phi_nm=phi+nm*num_gz;for(intd=0;d<num_loc_dir;d++){double*__restrict__psi_d=psi+d*num_locgz;doubleell_nm_d=ell_nm[d];for(intgz=0;gz<num_loc_gz;++gz){phi_nm[gz]+=ell_nm_d*psi_d[gz];}}}

Kripke v1.1: LTimes for DGZ layout

LLNL-PRES-681292 9

forallNextendsRAJAconceptsneededfornested-loops

RAJA::forallN(is_moment,is_dir,is_group,is_zone,[=](NMnm,Dird,Groupg,Zonez){phi(nm,g,z)+=ell(d,nm)*psi(d,g,z);});

forallN() abstracts an N-nested loop

•  Exec policies for each loop nest •  Loop transformations

IndexSet for each loop nest

Loop-body Type-safe indices

Extensions provide nested loop concepts and promote code correctness

Views abstract data access and provide type safety

LLNL-PRES-681292 10

Execu6onpoliciesenablerapidtes6ngofdiverseparalleliza6onstrategies

usingPol=NestedPolicy>>;

Parallel region, with OpenMP threading over groups, collapse(2), tiling zones by 512:

usingPol=NestedPolicy<ExecList,OMP_Parallel<Permute>>;

Parallel region, with OpenMP threading over groups:

usingPol=NestedPolicy;

Sequential Policy:

Complex loop nesting constructs are easy to implement without kernel changes

usingPol=NestedPolicy<ExecList>;

CUDA policy, mapping groups to threads and block in X (with 32 threads/block), and zones to blocks in Y.

LLNL-PRES-681292 11

Kernelperformancedependsonchoosingtheexecu6onpolicythatmatchesthearchitecture

0.00E+00

2.00E-08

4.00E-08

6.00E-08

8.00E-08

1.00E-07

1.20E-07

1.40E-07

0.00E+00 2.00E+07 4.00E+07 6.00E+07 8.00E+07 1.00E+08 1.20E+08

Grin

d Ti

me

(Sec

onds

/Unk

now

n)

Number of Unknowns

Grind Time for DGZ LTimes Kernel in Kripke for Various Execution Policies

collapse(2) omp_seq par_collapse(2) par_omp_seq par_tile256_omp_seq par_tile512_omp_seq

DGZ Dirs/Sets: 96/8 Grp/Sets: 32/1 Zones: 8x8x8, 10x10x10, … (Git Hash: d3799680b5fe930423a29e0478329d2edaa2e8a7 )

Performance can be tuned with policies, and w/o modifying kernels

LLNL-PRES-681292 12

Conclusion

§  NestedLoopconstructsarenowofficiallyinRAJA—  Offloadtothreads(OpenMP)andGPU(CUDA)—  Complexlooptransforma7onsarepossiblewithoutimpac7ngcode

§  CoEinterac7onshavebeencrucial—  Vendorcoopera7onhasbeengood—  Intelmachinesarelookinggood…

•  Op7miza7onimprovementswouldmakemarkedimprovement•  OpenMPkernelsinVtuneareproblema7c

—  IBMandNVidia(nvcc)havemostissuestoworkout•  Star7ngtoexploresolu7ons,mayimpactimplementa7ondetails

§  Star7ngtoexploreimplementa7oninARDRA—  Star7ngwithconceptsthatseemlessriskyandincorpora7ngthemintoARDRA—  Juststar7ngtomovetoC++11(Sequoia+XLConlyblocker)—  Hopingthatwegetmorevendorissuesworkedoutsoon

§  Ques7ons?