Date post: | 28-Mar-2018 |
Category: |
Documents |
Upload: | nguyenkhuong |
View: | 236 times |
Download: | 3 times |
Using CAPS Compiler on NVIDIA Kepler and
CARMA Systems
F. Bodin
CTO – CAPS entreprise
• CAPS develops programming tools to help writing a unique
source code that can be executed on existing accelerator
technologies
o C / C++ / Fortran
• Fast moving hardware systems require two directives sets
o OpenHMPP - easy to extend – integrate new HW features
o OpenACC - standardized – longer term view but moving slowly
• Generates CUDA or OpenCL codes
o Portable on AMD GPU and APU, Intel MIC, Nvidia Kepler-Carma, …
Introduction
nvidia SC 2012 2 www.caps-entreprise.com
• Provide OpenACC and OpenHMPP directives
o OpenHMPP codelet based
o OpenACC code region based
www.caps-entreprise.com 3 nvidia SC 2012
CAPS Technology
#pragma hmpp myfunc codelet, …
void saxpy(int n, float alpha, float x[n], float y[n]){
#pragma hmppcg gridify(i)
for(int i = 0; i<n; ++i)
y[i] = alpha*x[i] + y[i];
}
#pragma acc kernels …
{
for(int i = 0; i<n; ++i)
y[i] = alpha*x[i] + y[i];
}
• Source-to-source technology
www.caps-entreprise.com 4 nvidia SC 2012
Compilation Process
C++ Frontend
C Frontend
Fortran Frontend
Executable (mybin.exe)
Instrumentation module
CPU compiler (gcc, ifort, …)
HWA Code (Dyn. library)
OpenCL/Cuda Generation
Native compilers
Extraction module
Fun #2
Fun #3 Fun#1
Host code
codelets
CAPS Runtime
A Few Typical Situations
6. Dealing with accelerated library
7. Dealing with dynamic
accelerated tasks scheduling
8. Using multiple accelerators
9. Nested parallelism using native
www.caps-entreprise.com 5 nvidia SC 2012
1. Simple nested loops
2. Data transfer optimization
3. Complex loop nests
4. Code tuning
5. Integrating auto-tuning
techniques
• The simple construct is to
declare a parallel loop to be
compiled and executed on
an accelerator
o Iterations of the loop nests
are converted into threads
• Data in and out declaration
is used to determine the
data to move between the
host and the accelerator
www.caps-entreprise.com 6 nvidia SC 2012
Simple nested loops - 1 Host CPU code
send A,B,C
get A
execute kern. 1
Accelerator
wai
t fo
r ac
cele
rato
r
• Example of stencil computation
www.caps-entreprise.com 7 nvidia SC 2012
Simple nested loops - 2
...
#pragma acc kernels pcopyin(A[0:m]) pcopy(B[0:m])
{
float c11,c12,c13,c21,c22,c23,c31,c32,c33;
c11 = +2.0f; c21 = +5.0f; c31 = -8.0f; ...
#pragma acc loop independent
for (int i = 1; i < M - 1; ++i){
#pragma acc loop independent
for (int j = 1; j < N - 1; ++j){
B[i][j] = c11*A[i-1][j-1]+c12*A[i][j-1]...;
}}}
...
• Data transfers between the
host CPU and the accelerator
may very negatively impact on
performance
• A set of directives are provided
to keep data on the accelerator
beyond the execution of one
kernel
www.caps-entreprise.com 8 nvidia SC 2012
Data Transfer Optimization - 1 Host CPU code
csite 1 send A
get A
execute kern. 1
Accelerator
execute kern. 4
execute kern. 3
execute kern. 2
send A
get A
csite 2
csite 3
csite 4
• Example from HydroC*
www.caps-entreprise.com 9 nvidia SC 2012
Data Transfer Optimization - 2
void hydro_godunov (…)
{
#pragma acc data \
create(qleft[0:H.nvar], qright[0:H.nvar], \
\...\) \
copy(uold[0:H.nvar*H.nxt*H.nyt]) \
copyin(Hstep)
{
for (j = Hmin; j < Hmax; j += Hstep){
// compute many slices each pass
int jend = j + Hstep;
if (jend >= Hmax)
jend = Hmax;
. . .// the work here
} // for j
}//end of data region
...
Data are left on the GPU during the step loop. pcopy clauses are used into called routines
*Pierre-Francois Lavalleea, Guillaume Colin de Verdiereb, Philippe Wauteleta, Dimitri Lecasa, Jean-Michel Dupaysa aIDRIS/CNRS, bCEA,Centre DAM
main (100%)
hydro_godunov (97.0%)
riemann
(48.2%)
slope
(11.2%)
trace (10.1
%)
qleftright
(4.9%)
cmpflx (3.0%)
constoprim
(4.6%)
equationofstate
(5.0%)
compute_deltat (2.9%)
updateConservativeVars
(5.6%)
gatherConservativeVars
(5.4%)
• Non perfectly nested loops can be challenging
to parallelize efficiently
o OpenACC parallel regions provide control over the parallelization
scheme
o Requires to distribute the iteration spaces onto gangs and workers
www.caps-entreprise.com 10 nvidia SC 2012
Complex Loop Nests - 1
Gang
Workers
Vectors
Device
• Extract from NOAA Nonhydrostatic Icosahedral Model (NIM) code
www.caps-entreprise.com 11 nvidia SC 2012
Complex Loop Nests - 2
!$acc parallel present(nprox,prox,u,...) vector_length(1) num_workers(64) num_gangs(512)
!$acc loop gang private (rhsu,...) private(ipn,k,isn,...)
do ipn=ips,ipe
n = nprox(ipn)
ipp1 = prox(1,ipn)
...
!$acc loop worker vector
do k=1,nz-1
rhsu(k,1) = cs(1,ipn)*u(k ,ipp1)...
...
enddo !k-loop
k=nz-1
rhsu(k+1,1) = cs(1,ipn)*u(k ,ipp1)...
...
!$acc loop worker vector private(wk)
do k=1,nz
Lots of statements
enddo !k-loop
!$acc loop seq
do isn = 1,nprox(ipn)
!$acc loop worker vector
do k=1,nz-1
Tgtu(k,isn) = ...
enddo !k-loop
Tgtu(nz,isn) = 2.*Tgtu(nz-1,isn) - ...
end do ! isn-loop
(continued on next page)
(continued from previous page)
!$acc loop seq
do isn = 1,nprox(ipn)
isp=mod(isn,nprox(ipn))+1
!$acc loop worker vector
do k = 2,nz-1
...
end do ! k -loop
sedgvar( 1,isn,ipn,1)=(zm(1,ipn)...
...
end do ! isn-loop
!$acc loop worker vector
do k=1,nz
kp1=min(nz,k+1)
...
end do
bedgvar(0,ipn,1)=...
enddo !ipn-loop
!$acc end parallel
• Speedup X 19
o With a FERMI C2050
• Speedup X 34
o With K20
www.caps-entreprise.com 12 nvidia SC 2012
Accelerating Heat Transfer Ray-Tracing Code
with CAPS OpenACC Compiler
PROMES Laboratory Application: ray-by-ray heat transfer simulation (DP)
CAPS Experimentations done on FERMI C2050 / K20 in comparison to Sandy Bridge E5-2687W
0
20000
40000
60000
80000
100000
120000
3840 15360 76800 200000
Exe
cuti
on
tim
e (
s)
Performances for various numbers of rays and configurations.
X 19
Accelerating Heat Transfer Ray-Tracing Code
with CAPS OpenACC Compiler
o ARM CPU + Accelerator Target
o Speedup 12x ARM cores and CARMA GPU
www.caps-entreprise.com 13 nvidia SC 2012
OpenACC on Nvidia CARMA
CUDA on ARM
• Molecular dynamic codes (HLRS / Colin Glass)
• From the APOS project (http://apos-project.eu)
www.caps-entreprise.com 14 nvidia SC 2012
OpenHMPP MD Example
source: HLRS
• Stereo Matching (ESAW) on GTX 550 Ti
• Result from the ANR Compa project
www.caps-entreprise.com 15 nvidia SC 2012
OpenHMPP Stereo Vision Example
From: Jinglin ZHANG, Jean-Francois NEZAN, Jean-Gabriel COUSIN, Erwan RAFFIN "Implementation of Stereo Matching Using A High Level Compiler for Parallel Computing Acceleration" IVCNZ ’12, November 26 - 28 2012, Dunedin, New Zealand
• The more optimized a code is, the less portable it is
o Optimized code tends to saturate some hardware resources
o Parallelism ROI varies a lot
• i.e. # threads and workload need to be tuned
o Many resources not virtualized on HWA (e.g. registers, #threads)
www.caps-entreprise.com 16 nvidia SC 2012
Code Tuning - 1
0
0.2
0.4
0.6
0.8
1Threads
Registers/threads
L1 Hit RatioMem. Throughput
Occupancy
Run 1 norm
Run 2 norm
Example of an optimized versus a non optimized stencil code
cores
pe
rfo
rman
ce
HW1
HW2
• Express code transformations via directives
www.caps-entreprise.com 17 nvidia SC 2012
Code Tuning - 2
#pragma hmpp <mygroup> sgemm codelet, args[tout].io=inout, &
#pragma hmpp & args[*].mirror args[*].transfer=manual
void sgemm( float alphav[1], float betav[1], const float t1[SIZE][SIZE],
const float t2[SIZE][SIZE], float tout[SIZE][SIZE] ) {
int j, i;
const float alpha = alphav[0], beta = betav[0];
#pragma hmppcg(OCL) unroll i:4, j:4, split(i), noremainder(i,j), jam
#pragma hmppcg gridify (j,i)
for( j = 0 ; j < SIZE ; j++ ) {
for( i = 0 ; i < SIZE ; i++ ) {
int k;
float prod = 0.0f;
for( k = 0 ; k < SIZE ; k++ ) {
prod += t1[k][i] * t2[j][k];
}
tout[j][i] = alpha * prod + beta * tout[j][i];
}
}
}
Loop transformations
Apply only when compiling for OpenCL
• Adaptation of the code @ runtime o Use multiple call to a kernel to find the most efficient one
• Need to create an optimization space to explore o Compiler issue runtime configurable #gang, #worker, #vector
• Need a way to explore optimization space o Auto-tuning driver issue
o may also focus on execution time or energy
www.caps-entreprise.com 18 nvidia SC 2012
Integrating Auto-Tuning Techniques - 1
Source code
HMPP Compiler
Autotunable executable
code
hmpp profiling interface
auto-tuning driver
collect profiling data explore the variants space
• Auto-tuning implementation of a Blur filter in OpenACC
• Explore dynamic parameters (e.g. #gangs, #workers)
www.caps-entreprise.com 19 nvidia SC 2012
Integrating Auto-Tuning Techniques – 2-a
size_t gangs[] = { 8, 16, 32, 64, 128, 128, 8, 16, 32, 64, 128, 256 };
size_t workers[] = { 16, 16, 16, 16, 16, 16, 24, 24, 24, 24, 24, 24 };
…
while (nber_of_iterations < max_iterations) {
…
variant = variantSelectorState("kernel.c:21",
(sizeof(gangs)/sizeof(size_t))-1);
blur(images[(currentImage + 1) % 2], image_caps, width, height,
blockSize, gangs[variant], workers[variant]);
…
}
#pragma acc parallel, copyin(dst_caps[0:height*width]),
copyout(src_caps[0:height*width]), num_gangs(gangs),
num_workers(workers), vector_length(32)
{
#pragma acc loop, gang
for (tileY = 0; tileY < tileCountY; tileY++) {
for (tileX = 0; tileX < tileCountX; tileX++) {
…
Parameterized parallel regions
parameter space to explore
set auto-tuning driver on
• Auto-tuning driver behavior with DNADist (bio info)
www.caps-entreprise.com 20 nvidia SC 2012
Integrating Auto-Tuning Techniques – 2-b
#call
kernel time exploration phase
steady state
Data can be collected over multiple executions
0
2
4
6
8
10
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Kernel Computation Time (in sec). Lower is better
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Kernel Computation Time (in sec). Lower is better
config 10 = 256 G x 128 W
config. 8 = 14 G x 16 W
• Variant based auto-tuning in HMPP
• Explore compile time code transformations / algorithms
www.caps-entreprise.com 21 nvidia SC 2012
Integrating Auto-Tuning Techniques - 3
#pragma hmpp <convolution> filter5x5 callsite variants( &
#pragma hmpp & filterStencil5x5@<convolution>[C], &
#pragma hmpp & filterStencil5x5_V1@<convolution>[CUDA], &
#pragma hmpp & filterStencil5x5_V2@<convolution>[CUDA]) &
#pragma hmpp & selector(filterVariantSelector)
filterStencil5x5(&fullHeigh, &width, stencil1, raster1, raster2);
void filterStencil5x5_V2(const uint32 p_heigh[1],
const uint32 p_width[1], const RasterType filter[5][5],
const RasterType *p_inRaster, RasterType *p_outRaster){
. . .
#pragma hmppcg grid blocksize "32x4"
#pragma hmppcg unroll 6, jam
for (i = stencil; i < heigh - stencil; i++) {
...
void filterStencil5x5_V1(const uint32 p_heigh[1],
const uint32 p_width[1], const RasterType filter[5][5],
const RasterType *p_inRaster, RasterType *p_outRaster){
. . .
#pragma hmppcg grid blocksize "32x4"
#pragma hmppcg unroll 6, jam
for (i = stencil; i < heigh - stencil; i++) {
...
Variants of codelets to use
• Library calls can usually only be partially replaced o No one-to-one mapping between libraries (e.g. BLAS, FFTW, CuFFT, CULA,
ArrayFire)
o No access to all application codes (i.e. avoid side effects)
o Want a unique source code
• Deal with multiple address spaces / multi-HWA o Data location may not be unique (copies, mirrors)
o Usual library calls assume shared memory
o Library efficiency depends on updated data location
• Libraries can be written in many different languages o CUDA, OpenCL, OpenHMPP, etc.
www.caps-entreprise.com 22 nvidia SC 2012
Dealing with Accelerated Library – 1
• Using cuFFT accelerated version when source code is FFTW based o A set of proxies implements the FFT accelerated version of the calls
o Allows resources sharing between the library and users' codes to reduce data transfers between host and accelerator memories
o Only marked calls are executed on the accelerators
www.caps-entreprise.com 23 nvidia SC 2012
Dealing with Accelerated Library - 2
. . .
#pragma hmppalt cufft call, name="fftw_plan_dft_c2r_1d_sharing"
pc2r = fftw_plan_dft_c2r_1d(n, odata_intermediate,
odata_real_GPU,FFTW_ESTIMATE);
#pragma hmppalt cufft call, name="fftw_execute_sharing"
fftw_execute(pr2c);
#pragma hmpp <my_grp> filter callsite
filter(n, (double _Complex *)odata_intermediate, cf);
#pragma hmppalt cufft call, name="fftw_execute_sharing"
fftw_execute(pc2r);
. . .
• When dealing with a large bunch of small parallel tasks o Need to exploit accelerator asynchronous execution
• Madness (Multiresolution ADaptive NumErical Scientific Simulation) o Madness integrates its own tasks manager
o Study in collaboration with ORNL
www.caps-entreprise.com 24 nvidia SC 2012
Dealing with Dynamic Accelerated Tasks
Scheduling – 1
Q1 Q2 Q3
Accelerator
T0,T6,T4,T5
T1,T2,T3,T7
T8,T9
dependent tasks are sent to the same queue
Task C/Fortran source code
CAPS compiler
OCL/CUDA code
• Example of task queuing
• Exploit Caps compiler code generation
• Close to OpenCL API but task code remains C/Fortran one
www.caps-entreprise.com 25 nvidia SC 2012
Dealing with Dynamic Accelerated Tasks
Scheduling – 2
for(int i=0; i<nb_arrays;i++){
hmpprt::Queue *myQueue = (hmpprt::Queue *) myQueues[i];
myQueue->enqueueUpload(g_A[i],h_A[i]);
myQueue->enqueueUpload(g_B[i],h_B[i]);
hmpprt::ArgumentList myArguments;
myArguments.addArgument(g_A[i]);
myArguments.addArgument(g_B[i]);
myArguments.addArgument(g_C[i]);
myQueue->enqueueCall(myDevice, myCodelet, myArguments);
myQueue->enqueueDownload(g_C[i],h_C[i]);
}
for(int i=0; i<nb_arrays;i++){
hmpprt::Queue *myQueue = (hmpprt::Queue *) myQueues[i]; myQueue->start();
}
for(int i=0; i<nb_arrays;i++){
hmpprt::Queue *myQueue = (hmpprt::Queue *) myQueues[i]; myQueue->wait();
}
• Nodes can have multiple accelerators
• One thread / one MPI process mapping very limited
• Having to use CPU parallel code for exploiting multiple accelerator is inconvenient
• OpenHMPP multi-accelerator features based on
o Data distribution
o Owner compute rule to allocate the tasks/codelets to the device
www.caps-entreprise.com 26 nvidia SC 2012
Using Multiple Accelerators - 1
Using Multiple Accelerators - 2
www.caps-entreprise.com 27
GPU 0 GPU 1 GPU 2
#pragma hmpp <MyGroup> parallel for(k=0;k<n;k++) { #pragma hmpp <MyGroup> f1 callsite myparallelfunc(d[k],n); }
Main memory Device
memory Device
memory Device
memory
d0 d1 d2 d3 d1 d2 d3
nvidia SC 2012
• A parallel for loop implements a "map" operation
www.caps-entreprise.com 28 nvidia SC 2012
Using Multiple Accelerators - 3
#pragma hmpp parallel, device="i%2"
for( i=0;i<NB;i++){
//Allocate the mirrors for vin1, vin2 and vout
#pragma hmpp <mygroup> allocate, data["vin1[i]","vin2[i]",
"vout_multi[i]"], size={size,size}, elementsize="4"
. . .
}
#pragma hmpp parallel
for( i=0;i<NB;i++) {
//launch the codelet on the device
#pragma hmpp <mygroup> sgemm callsite
sgemm( vin1[i], vin2[i], vout_multi[i] );
}
Distribute the data over the accelerators
Execute the task in parallel according to the owner compute rule
• Exploit K20 new capabilities for nested parallelism
www.caps-entreprise.com 29 nvidia SC 2012
Nested parallelism using native – 1
#pragma hmpp cudadp codelet, target=CUDA, args[A].io=in, ...
void codelet(float *A, float *B, float *C, int nm, int ms, int size)
{
int i;
#pragma hmppcg gridify
for(i = 0 ; i < 1 ; i++)
{
#pragma hmppcg(CUDA) include("native_dynamic_parallelism.h")
#pragma hmppcg(CUDA) native(dynamic_parallelism)
dynamic_parallelism(A, B, C, nm, ms, size);
}
}
void dynamic_parallelism(float *A,float *B,float *C, ...)
{
#ifndef __HMPP
int i, j, k, l;
for (k = 0 ; k < nm ; k++) {
for (l = 0 ; l < nm ; l++) {
...
cblas_sgemm(CblasColMajor, CblasNoTrans, CblasNoTrans,
ms, ms, ms, 1.0f, loA, nm * ms, loB, nm * ms,
1.0f, loC, nm * ms);
for (i = 0 ; i < ms ; i ++) {
for (j = 0 ; j < ms ; j ++) {
loC[access(i,j)] /= 2.f;
...
CPU version of the function
Codelet
• The native CUDA device function
www.caps-entreprise.com 30 nvidia SC 2012
Nested parallelism using native – 2
__device__ void dynamic_parallelism(float *A, float *B, float *C, ...) {
cublasStatus_t status;
cublasHandle_t handle;
int i, j;
dim3 g(ms/8,ms/8,1);
dim3 b(8,8,1);
status = cublasCreate(&handle);
for (i = 0 ; i < nm ; i++) {
for (j = 0 ; j < nm ; j++) {
float loA = A + offset(i,j); float *loB = B + offset(i,j);
float *loC = C + offset(i,j);
const float alpha = 1.f; const float beta = 1.f;
cudaStream_t s;
cudaStreamCreateWithFlags(&s, cudaStreamNonBlocking);
cublasSetStream(handle, s);
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, ms, ms, ms,...);
halv<<<g,b,0,s>>>(loC,nm,ms);
cudaStreamDestroy(s);
}
}
cudaDeviceSynchronize();
status = cublasDestroy(handle);
}
• CAPS technology provides a flexible and portable
programming environment for accelerator based systems
o OpenACC and OpenHMPP are complementary sets of directives
o Many features to handle various coding issues
o CPU/Accelerator library integration important for code maintenance
• Auto-tuning techniques helps to simplify code tuning and
deployment
o Code adapts to the architecture configuration
www.caps-entreprise.com 31 nvidia SC 2012
Conclusion
See a demo on Booth #2330
Accelerator Programming Model Parallelization
Directive-based programming GPGPU Manycore programming
Hybrid Manycore Programming HPC community OpenACC
Petaflops Parallel computing HPC open standard
Multicore programming Exaflops NVIDIA Cuda
Code speedup Hardware accelerators programming
High Performance Computing OpenHMPP
Parallel programming interface
Massively parallel
Open CL http://www.caps-entreprise.com
http://twitter.com/CAPSentreprise http://www.openacc-standard.org/
http://www.openhmpp.org