LLNL-PRES-681292 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
Nested-LoopRAJAExtensionsforDeterminis6cTransportDOECentersofExcellencePerformancePortabilityMee7ng
AdamJ.Kunen
April 19, 2016
LLNL-PRES-681292 2
RAJA (L2) Demoed OMP/CUDA Portability (FY15)
Legion Ongoing (Sam White, UIUC)
STAPL Ongoing (TAMU)
OCCA Demoed OKL DSL (David Medina 2015)
CORAL CoE Exploring CUDA algorithms Ongoing
Charm++ Ongoing (Sam White, UIUC)
KripkeisaresearchtoolthatisinformingthedevelopmentofASCcodes
ARDRA
Kripke v1.0
Kripke v1.1
Nested-Loop RAJA
External PM Research/Development Collaborations
L, L+, Sweep (Aug 2014)
L, L+, Sweep, Source, Scattering (Sept 2015)
Developing RAJA extensions for use in ARDRA (Ongoing)
Porting to RAJA during FY16/FY17
Internal Research and Development
Research with Kripke motivated nested-loop abstractions
LLNL-PRES-681292 3
Kripkedemonstratesperformanceissensi6vetodatalayout,architectureandproblemsize
Abstracting data layout and loop order enable performance portability
Data layout and loop interchange affect performance
… so do problem dimensions and architecture
Sandy Bridge, P4, 12^3 zones/thread, 64 grp, 96 dir BG/Q, P9, 12^3 zones/thread, 64 grp, 96 dir
LLNL-PRES-681292 4
DiscreteordinatestransportneedsaportablePMthattreatsmul6dimensionalloopsanddataasfirstclassci6zens
§ SnTransportisdominatedbynestedloops— Highdimensionalphasespace— OHenloopsarenested2to5deep(some7mesevenmore)— Manyofourloopsareperfectlynested— Complexitera7onpaMerns(sweeps,mul7-material)
§ Kripkeshowsperformanceissensi7veto:— Datalayout,Loop-nes7ngorder— Architecture,Compiler(vendorandversion)— Problemspecifica7ons(zones,groups,direc7ons,moments,etc.)
§ GivenArchitecture+Compiler+Problem:— ChooseDataLayoutandLoop-NestOrder— Chooseexecu7onpolicies
§ Prepara7onforSierrarequirespor7ngtoGPU— CUDA?OpenMP?— Whataboutexascale?WhatPMwillweneedthen?
Sn transport needs performance portable multidimensional PM concepts
LLNL-PRES-681292 5
RAJA::forall(IndexSet_I,[=](inti){RAJA::forall(IndexSet_J,[=](intj){RAJA::forall(IndexSet_K,[=](intk){y[i*nj+j]+=a[k]*x[j*nk+k];})})});
WhydoweneedRAJA::forallNinsteadofjustnes6ngtheexis6ngRAJA::forall?
Nested loops and multidimensional arrays are needed in addition to RAJA
Multi-Dimensional Arrays • Manual array calculations are error prone • Hard-wires code for specific data layouts
Nested Loop Constructs • Nesting RAJA::forall’s work, but don’t enable complex loop
transformations • Hard-wires code for specific patterns • CUDA kernels require building a thread+block index space
from multiple loop indices… very difficult with nested forall’s • Nested lambdas are problematic for OMP4 and CUDA
LLNL-PRES-681292 6
Nested-LoopRAJAextensionsprovidemul6dimensionalsupport
§ MaintainstheRAJAphilosophy— Separateconceptsofloopexecu7on,itera7onpaMernsandloopbodies— Minorcodestructurechanges— Allowincrementaltransi7ontoRAJA— Leverageexis7ngRAJAcode(forall,IndexSet,etc.)
• Basicnested-loopsarefunc7onallyequivalenttonestedRAJA::forall()’s
§ ArbitraryDimensionality— RAJA::forallNforanyN— Usingvariadictemplatemeta-programming,nocodegen(almost)
§ LoopTransforma7ons(forperfectlynestedloops)— Loopinterchange— Mul7-level7ling/blocking— MappingtoCUDAthreadsandblocks— Loopcollapsing(OpenMPcollapse(N))
§ DataLayouts— Arbitrarydatastridingorders
§ Portability(andhopefullyperformance)— Sequen7al,SIMD,OpenMP(CPU/GPU),CUDA
RAJA is a good starting point for an Sn programming model
LLNL-PRES-681292 7
HowdoweextendtheRAJA::forallabstrac6onforsingleloopstonestedloops?
Kripke v1.1: LTimes for DGZ layout
Hand coded loops are inflexible and non-portable
double*ell_ptr;double*psi_ptr;double*phi_ptr;for(intnm=0;nm<num_moments;++nm){double*ell_nm=ell+nm*num_loc_dir;double*__restrict__phi_nm=phi+nm*num_gz;for(intd=0;d<num_loc_dir;d++){double*__restrict__psi_d=psi+d*num_locgz;doubleell_nm_d=ell_nm[d];for(intgz=0;gz<num_loc_gz;++gz){phi_nm[gz]+=ell_nm_d*psi_d[gz];}}}
§ Loop-nestorderisfixed— Loop-interchangerequiresrewrite
§ Inner2-loopscollapsed— Anarch-specificop7miza7on
§ Fixedlayoutofeachvariable
§ NoobviousmappingtoCUDA— Needtorewrite
§ Codeisjustdownrightugly
Issues with this coding style:
LLNL-PRES-681292 8
Nested-LoopRAJAabstractsnestedloops,loopinterchange,anddatalayouts
View_Psipsi(psi_ptr,…);View_Phiphi(phi_ptr,…);View_Ellell(ell_ptr,…);forallN(is_moment,is_dir,is_group,is_zone,[=](NMnm,Dird,Groupg,Zonez){phi(nm,g,z)+=ell(d,nm)*psi(d,g,z);});
Kripke+RAJA: LTimes for all layouts
Nested-loop abstraction promotes flexibility, while maintaining code structure
double*ell_ptr;double*psi_ptr;double*phi_ptr;for(intnm=0;nm<num_moments;++nm){double*ell_nm=ell+nm*num_loc_dir;double*__restrict__phi_nm=phi+nm*num_gz;for(intd=0;d<num_loc_dir;d++){double*__restrict__psi_d=psi+d*num_locgz;doubleell_nm_d=ell_nm[d];for(intgz=0;gz<num_loc_gz;++gz){phi_nm[gz]+=ell_nm_d*psi_d[gz];}}}
Kripke v1.1: LTimes for DGZ layout
LLNL-PRES-681292 9
forallNextendsRAJAconceptsneededfornested-loops
RAJA::forallN(is_moment,is_dir,is_group,is_zone,[=](NMnm,Dird,Groupg,Zonez){phi(nm,g,z)+=ell(d,nm)*psi(d,g,z);});
forallN() abstracts an N-nested loop
• Exec policies for each loop nest • Loop transformations
IndexSet for each loop nest
Loop-body Type-safe indices
Extensions provide nested loop concepts and promote code correctness
Views abstract data access and provide type safety
LLNL-PRES-681292 10
Execu6onpoliciesenablerapidtes6ngofdiverseparalleliza6onstrategies
usingPol=NestedPolicy>>;
Parallel region, with OpenMP threading over groups, collapse(2), tiling zones by 512:
usingPol=NestedPolicy<ExecList,OMP_Parallel<Permute>>;
Parallel region, with OpenMP threading over groups:
usingPol=NestedPolicy;
Sequential Policy:
Complex loop nesting constructs are easy to implement without kernel changes
usingPol=NestedPolicy<ExecList>;
CUDA policy, mapping groups to threads and block in X (with 32 threads/block), and zones to blocks in Y.
LLNL-PRES-681292 11
Kernelperformancedependsonchoosingtheexecu6onpolicythatmatchesthearchitecture
0.00E+00
2.00E-08
4.00E-08
6.00E-08
8.00E-08
1.00E-07
1.20E-07
1.40E-07
0.00E+00 2.00E+07 4.00E+07 6.00E+07 8.00E+07 1.00E+08 1.20E+08
Grin
d Ti
me
(Sec
onds
/Unk
now
n)
Number of Unknowns
Grind Time for DGZ LTimes Kernel in Kripke for Various Execution Policies
collapse(2) omp_seq par_collapse(2) par_omp_seq par_tile256_omp_seq par_tile512_omp_seq
DGZ Dirs/Sets: 96/8 Grp/Sets: 32/1 Zones: 8x8x8, 10x10x10, … (Git Hash: d3799680b5fe930423a29e0478329d2edaa2e8a7 )
Performance can be tuned with policies, and w/o modifying kernels
LLNL-PRES-681292 12
Conclusion
§ NestedLoopconstructsarenowofficiallyinRAJA— Offloadtothreads(OpenMP)andGPU(CUDA)— Complexlooptransforma7onsarepossiblewithoutimpac7ngcode
§ CoEinterac7onshavebeencrucial— Vendorcoopera7onhasbeengood— Intelmachinesarelookinggood…
• Op7miza7onimprovementswouldmakemarkedimprovement• OpenMPkernelsinVtuneareproblema7c
— IBMandNVidia(nvcc)havemostissuestoworkout• Star7ngtoexploresolu7ons,mayimpactimplementa7ondetails
§ Star7ngtoexploreimplementa7oninARDRA— Star7ngwithconceptsthatseemlessriskyandincorpora7ngthemintoARDRA— Juststar7ngtomovetoC++11(Sequoia+XLConlyblocker)— Hopingthatwegetmorevendorissuesworkedoutsoon
§ Ques7ons?