GPU Hackathon 2017- OpenACC
Programming GPUs with OpenACC
1
SaberFekiComputationalScientistLead
SupercomputingCoreLaboratory,KAUST [email protected]
GPU Hackathon 2017- OpenACC
GPU architecture
2
GPU Hackathon 2017- OpenACC
CPU-GPU memory model
3
PCIeInterconnect16X- 8GB/s(gen2)and15.75GB/s(gen3),verythinpipe!KeplerK402,880cudacores1.48Tflops/s
GPU Hackathon 2017- OpenACC
GPU programming
4
GPU Hackathon 2017- OpenACC
OpenACC, the standard
• ByNVIDIA,CRAY,PGIandCAPS• ThestandardwasannouncedinNov2011atSC11conference• http://www.openacc-standard.org• OpenACC2.0releasedinsummer2013• Now,20+partnersfromacademiaandindustry
5
GPU Hackathon 2017- OpenACC
OpenACC advantages
• Easy:Directivesaretheeasypathtoacceleratecomputeintensiveapplications
• Open:OpenACCisanopenGPUdirectivesstandard,makingGPUprogrammingstraightforwardandportableacrossparallelandmulti-coreprocessors
• Powerful:GPUDirectivesallowcompleteaccesstothemassiveparallelpowerofaGPU
6
GPU Hackathon 2017- OpenACC
PGI and CAPS compilers study (I)
7
S.Feki,A.Al-Jarro,H.Bağcı.PortinganExplicitTimeDomainVolumeIntegralEquationSolveronGPUswithOpenACC,IEEEAntennasandPropagationMagazine,July,2014
#pragma acc kernels {for ( l = 0 ; l < nt; ++l) { // time loop#pragma acc loop independent collapse (3)
for (int i = 0; i < n; ++i){ for (int j = 0; j < n; ++j){
for (int k = 0; k < n; ++k){B[i][j][k] = B[i]][j][k] + ....
}}
}
#pragma acc loop independent collapse (3)for (int i = 0; i < n; ++i){
for (int j = 0; j < n; ++j){for (int k = 0; k < n; ++k){
B[i][j][k] = B[i]][j][k] + ....}
}}
} // end time loop }
#pragma acc datafor ( l = 0 ; l < nt; ++l) { // time loop#pragma acc kernels#pragma acc loop independent gang
for (int i = 0; i < n; ++i){ #pragma acc loop independent gang,vector
for (int j = 0; j < n; ++j){#pragma acc loop independent gang,vector
for (int k = 0; k < n; ++k){B[i][j][k] = B[i]][j][k] + ....
} } }#pragma acc kernels#pragma acc loop independent gang
for (int i = 0; i < n; ++i){ #pragma acc loop independent gang,vector
for (int j = 0; j < n; ++j){#pragma acc loop independent gang,vector
for (int k = 0; k < n; ++k){B[i][j][k] = B[i]][j][k] + ....
} } }} // end time loopCAPS PGI
GPU Hackathon 2017- OpenACC
PGI and CAPS compilers study (II)
8
0
5
10
15
20
25
30
35
6 11 25 32 41 56 77 113 176
Speedup
Numberofdegreesoffreedom(X1000)
CAPSPGI
S.Feki,A.Al-Jarro,H.Bağcı.PortinganExplicitTimeDomainVolumeIntegralEquationSolveronGPUswithOpenACC,IEEEAntennasandPropagationMagazine,July,2014
GPU Hackathon 2017- OpenACC
Directive syntax
• Fortran!$accdirective[clause[,]clause]…]…oftenpairedwithamatchingenddirective!$accenddirective• C#pragmaaccdirective[clause[,]clause]…]Oftenfollowedbyastructuredcodeblock
9
GPU Hackathon 2017- OpenACC
kernels: Your first OpenACC Directive
• Eachloopexecutedasaseparatekernel (aparallelfunctionthatrunsontheGPU)
!$acc kernelsdo i=1,n
a(i) = 0.0 b(i) = 1.0c(i) = 2.0
end dodo i=1,na(i) = b(i) + c(i)
end do !$acc end kernels
10
GPU Hackathon 2017- OpenACC
Compile and run
• C:pgcc–acc[-Minfo=accel]–osaxpy_accsaxpy.c• Fortran:pgf90–acc[-Minfo=accel]–osaxpy_accsaxpy.f90• Compileroutput:[sfeki@c4hdnsaxpy]$pgcc-acc-ta=nvidia-Minfo=accel-osaxpysaxpy.csaxpy:
5,Generatingpresent_or_copyin(x[0:n])Generatingpresent_or_copy(y[0:n])Generatingcomputecapability1.0binaryGeneratingcomputecapability2.0binary
6,LoopisparallelizableAcceleratorkernelgenerated6,#pragmaaccloopgang,vector(128)/*blockIdx.xthreadIdx.x*/CC1.0:8registers;48shared,0constant,0localmemorybytesCC2.0:12registers;0shared,64constant,0localmemorybytes
11
GPU Hackathon 2017- OpenACC
SAXPY example, revisited
12
GPU Hackathon 2017- OpenACC
Jacobi Iteration: C code
13
GPU Hackathon 2017- OpenACC
Jacobi Iteration: OpenACC code
14
GPU Hackathon 2017- OpenACC
PGI Accelerator Compiler output
15
GPU Hackathon 2017- OpenACC
What went wrong ?
17
GPU Hackathon 2017- OpenACC
Excessive data transfer
18
GPU Hackathon 2017- OpenACC
Another way of detecting it: NVIDIA Profiler
• Usenvprof forprofilingtheGPUapplication:
• UseNVVPGUI:NVIDIAVisualProfiler:
19
GPU Hackathon 2017- OpenACC
Data construct
• Fortran!$accdata[clause…]structuredblock
!$accenddata• C#pragmaaccdata[clause…]{structuredblock}• Managedatamovement.Dataregionsmaybenested• GeneralClausesif(condition)async(expression)
20
GPU Hackathon 2017- OpenACC
Data clauses
• copy (list)AllocatesmemoryonGPUandcopiesdatafromhosttoGPUwhenenteringregionandcopiesdatatothehostwhenexitingregion.
• copyin (list)AllocatesmemoryonGPUandcopiesdatafromhosttoGPUwhenenteringregion.
• copyout (list)AllocatesmemoryonGPUandcopiesdatatothehostwhenexitingregion.
• create (list)AllocatesmemoryonGPUbutdoesnotcopy.• present (list)DataisalreadypresentonGPUfromanother
containingdataregion.• andpresent_or_copy[in|out],present_or_create,deviceptr.
21
GPU Hackathon 2017- OpenACC
Array shaping
• Compilersometimescannotdeterminesizeofarrays• Mustspecifyexplicitlyusingdataclausesandarray“shape”• C#pragmaaccdatacopyin(a[0:size]),copyout(b[s/4:3*s/4])
• Fortran!$accdatacopyin(a(1:size)),copyout(b(s/4:3*s/4))• Note:dataclausescanbeusedondata,kernelsorparallel
22
GPU Hackathon 2017- OpenACC
Jacobi Iteration: OpenACC C Code, Revisited
23
GPU Hackathon 2017- OpenACC
Performance numbers
24
GPU Hackathon 2017- OpenACC
New NVIDIA profiles
25
GPU Hackathon 2017- OpenACC
CUDA Kernels
• Threadsaregroupedintoblocks• Blocks aregroupedintoagrid• Akernel isexecutedasagridofblocksofthreads
26
GPU Hackathon 2017- OpenACC
Thread blocks• Threadblocksallowcooperation– Cooperativelyload/storeblocksofmemorythattheyalluse
– Shareresultswitheachotherorcooperatetoproduceasingleresult
– Synchronizewitheachother• Threadblocksallowscalability– Blockscanexecuteinanyorder,concurrentlyorsequentially
– Thisindependencebetweenblocksgivesscalability:• AkernelscalesacrossanynumberofSMs
27
GPU Hackathon 2017- OpenACC
Mapping OpenACC to CUDA I
• TheOpenACCexecutionmodelhasthreelevels:gang,worker,andvector
• AllowsmappingtoanarchitecturethatisacollectionofProcessingElements(PEs)
• OneormorePEspernode• EachPEismulti-threaded• Eachthreadcanexecutevectorinstructions
• Tile pragmainOpenACC2.0
28
GPU Hackathon 2017- OpenACC
Mapping OpenACC to CUDA II
• ForGPUs,themappingisimplementation-dependent.Somepossibilities:– gang==block,worker==warp,andvector==threadsofawarp– omit“worker”andjusthavegang==block,vector==threadsofablock
• Dependsonwhatthecompilerthinksisthebestmappingfortheproblem
• ...Butexplicitlyspecifyingthatagivenloopshouldmaptogangs,workers,and/orvectorsisoptionalanyway– Furtherspecifyingthenumberofgangs/workers/vectorsisalsooptional
– Sowhydoit?Totunethecodetofitaparticulartargetarchitectureinastraightforwardandeasilyre-tunedway.
29
GPU Hackathon 2017- OpenACC
OpenACC loop directive and clauses
#pragmaacc kernelsloopfor(int i=0;i<n;++i)y[i]+=a*x[i];Useswhatevermappingtothreadsandblocksthecompilerchooses.Perhaps16blocks,256threadseach#pragma acc kernelsloopgang(100),vector(128)for(int i=0;i<n;++i)y[i]+=a*x[i];100threadblocks,eachwith128threads,eachthreadexecutesoneiterationoftheloop,usingkernels#pragma acc parallelnum_gangs(100),vector_length(128){#pragma acc loopgang,vectorfor(int i=0;i<n;++i)y[i]+=a*x[i];
}100threadblocks,eachwith128threads,eachthreadexecutesoneiterationoftheloop,usingparallel
30
GPU Hackathon 2017- OpenACC
Mapping OpenACC to CUDA threads and blocks
31
• Nestedloopsgeneratemulti-dimensionalblocksandgrids:#pragmaacckernelsloopgang(100),vector(16)for(…)
#pragmaaccloopgang(200),vector(32)for(…)
16threadtallblock
100blockstall(row/Y
direction)
and32threadwide
200blockswide(column/Xdirection)
GPU Hackathon 2017- OpenACC
Other clauses for loop directive
32
#pragmaaccloop[cluases]
•independent:forindependentloops•seq:forsequentialexecutionoftheloop•Reduction:forreductionoperationsuchasmin,max,etc…
GPU Hackathon 2017- OpenACC
Jacobi example … again
33
WithKernelsanddatadirectives
GPU Hackathon 2017- OpenACC
Jacobi example … again
34
Afteraddingloopdirectivewithgangandvectorclauses
GPU Hackathon 2017- OpenACC
An opportunity for Auto-tuning
• Gangandvectorvaluescanbeauto-tunedfortheapplication,targetingtheavailableacceleratordevice
35
2.37
1.68
1.83
1.44
1.15
1.49
1.67
2.54
1.171.22
1.321.24 1.20
1.10
1.331.25 1.27 1.26
1.00
1.20
1.40
1.60
1.80
2.00
2.20
2.40
2.60
Performan
ceSpe
edup
ProblemSizes
S.Siddiqui,F.Al-Zayer,S.Feki.HistoricLearningApproachforAuto-tuningOpenACCAcceleratedScientificApplications,iWAPT2014,Eugene,Oregon,USA
GPU Hackathon 2017- OpenACC
An opportunity for Auto-tuning
36
Input code annotated with OpenACC
#pragma acc kernels#pragma acc loop independentfor (x = 4 ; x < nx-4; x++) {
#pragma acc loop independentfor (y = 4; y < ny-4; y++) {
#pragma acc loop independentfor (z = 4; k < nz-4; z++) {U[x][y][z] = c1*V[x]][y][z] + ....
}}
}
Accelerator Specification
Automatic code generator
#pragma acc kernels#pragma acc loop independent gang(a),vector(b)for (x = 4 ; x < nx-4; x++) {
#pragma acc loop independent gang(c)for (y = 4; y < ny-4; y++) {
#pragma acc loop independent vector(d)for (z = 4; k < nz-4; z++) {U[x][y][z] = c1*V[x]][y][z] + ....
}}
}
#pragma acc kernels#pragma acc loop independent gang(a)for (x = 4 ; x < nx-4; x++) {
#pragma acc loop independent gang(b),vector(c)for (y = 4; y < ny-4; y++) {
#pragma acc loop independent vector(d)for (z = 4; k < nz-4; z++) {U[x][y][z] = c1*V[x]][y][z] + ....
}}
}
#pragma acc kernels#pragma acc loop independent gang(a),vector(b)for (x = 4 ; x < nx-4; x++) {
#pragma acc loop independent vector(c)for (y = 4; y < ny-4; y++) {
#pragma acc loop independent gang(d),vector(e)for (z = 4; k < nz-4; z++) {U[x][y][z] = c1*V[x]][y][z] + ....
}}
}
#pragma acc kernels#pragma acc loop independent for (x = 4 ; x < nx-4; x++) { #pragma acc loop independent gang(a),vector(b)
for (y = 4; y < ny-4; y++) {#pragma acc loop independent gang(c),vector(d)
for (z = 4; k < nz-4; z++) {U[x][y][z] = c1*V[x]][y][z] + ....
}}
}
Runtime evaluation and selection
Database
GPU Hackathon 2017- OpenACC
Jacobi example … again
37
• Whichotheroptimizationwecanfurtherdo?
– RestructuringthecodewillenhancebothCPUandGPUversion– Hint:reducememoryoperations
GPU Hackathon 2017- OpenACC
OpenACC Runtime Library
38
• InC:#include“openacc.h”• InFortran:#include‘openacc_lib.h’ oruseopenacc• Contains:– Prototypesofallroutines– Definitionofdatatypesusedintheseroutinesincludingenumerationtypedescribingtypesofaccelerators
GPU Hackathon 2017- OpenACC 39
OpenACC Runtime Library Definitions
• openacc_version withavalueyyyymm (yearandmonthoftheopenacc version)
• acc_device_t :typeofacceleratordevice– acc_device_none– acc_device_default– acc_device_host– acc_device_not_host
GPU Hackathon 2017- OpenACC
• acc_get_num_devices:returnsthenumberofdevicesofthegiventypeattachedtothehost
• acc_set_device_type:tellswhichtypeofdevicetousewhenexecutinganacceleratorparallelorkernelregion.
• acc_get_device_type:tellswhichtypeofdevicetobeusedforthenextacceleratedregion
• acc_set_device_num:specifywhichdevicetouse• acc_get_device_num:returnsthedevicenumberofthe
specifieddevicetypethatwillbeusedtorunthenextacceleratorparallelorkernelsregion
40
OpenACC Runtime Library Routines I
GPU Hackathon 2017- OpenACC
OpenACC Runtime Library Routines II
• acc_init:initializetheruntime,canbeusedtoisolatetheinitializationcostfromthecomputationcost
• acc_shutdown:shutdowntheconnectiontothedeviceandfreeanyallocatedresources
• acc_malloc:allocatememoryontheacceleratordevice• acc_free:freesmemoryontheacceleratordevice
41
GPU Hackathon 2017- OpenACC
OpenACC Runtime Library Routines: use case
• PortinganMPIcodetomultipleGPUs.• Exampleinrunningon8nodes,with4GPUseach,i.e.32MPI
processes
• acc_init()• acc_set_device_num( rank%4)
• Eachnoderuns4MPIprocesses,eachofthemisoffloadingcomputekernelstoaseparateGPU
42
S.Feki,A.Al-Jarro,H.Bağcı.MultipleGPUsElectromagneticsSimulationsusingMPIandOpenACC,PosterinGPUTechnologyConference,SanJose,California,USA,March24-27,2014
GPU Hackathon 2017- OpenACC
OpenACC and CUDA libraries
43
GPU Hackathon 2017- OpenACC
GPU accelerated libraries
44
GPU Hackathon 2017- OpenACC
Sharing data with libraries
• CUDAlibrariesandOpenACCbothoperateondevicearrays• OpenACCprovidesmechanismsforinteroperabilitywith
librarycalls– deviceptr dataclause– host_data construct
• Note:samemechanismsusefulforinteroperabilitywithcustomCUDAC/C++/Fortrancode
45
GPU Hackathon 2017- OpenACC
deviceptr Data Clause
deviceptr(list)Declaresthatthepointersinlistrefertodevicepointersthatneednotbeallocatedormovedbetweenthehostanddeviceforthispointer.Example:• C#pragmaacc datadeviceptr(d_input)• Fortran$!acc datadeviceptr(d_input)
46
GPU Hackathon 2017- OpenACC
host_data Construct
• Makestheaddressofdevicedataavailableonthehost.• deviceptr(list)Tellsthecompilertousethedeviceaddress
foranyvariableinlist.Variablesinthelistmustbepresentindevicememoryduetodataregionsthatcontainthisconstruct
• Example• C#pragmaacc host_data use_device(d_input)• Fortran$!acc host_data use_device(d_input)
47
GPU Hackathon 2017- OpenACC
Summary on device pointers
• Usedeviceptr dataclausetopasspre-allocateddevicedatatoOpenACCregionsandloops
• Usehost_data togetdeviceaddressforpointersinsideaccdataregions
• ThesametechniquesshownherecanbeusedtosharedevicedatabetweenOpenACCloopsand– YourcustomCUDAC/C++/Fortran/etc.devicecode– AnyCUDALibrarythatusesCUDAdevicepointers
48GPU Hackathon 2017- OpenACC
GPU Hackathon 2017- OpenACC 49
Thanks !
GPU Hackathon 2017- OpenACC