+ All Categories
Home > Documents > Using CAPS Compiler on NVIDIA Kepler and CARMA...

Using CAPS Compiler on NVIDIA Kepler and CARMA...

Date post: 28-Mar-2018
Category:
Upload: nguyenkhuong
View: 236 times
Download: 3 times
Share this document with a friend
32
Using CAPS Compiler on NVIDIA Kepler and CARMA Systems F. Bodin CTO CAPS entreprise
Transcript
Page 1: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

Using CAPS Compiler on NVIDIA Kepler and

CARMA Systems

F. Bodin

CTO – CAPS entreprise

Page 2: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

• CAPS develops programming tools to help writing a unique

source code that can be executed on existing accelerator

technologies

o C / C++ / Fortran

• Fast moving hardware systems require two directives sets

o OpenHMPP - easy to extend – integrate new HW features

o OpenACC - standardized – longer term view but moving slowly

• Generates CUDA or OpenCL codes

o Portable on AMD GPU and APU, Intel MIC, Nvidia Kepler-Carma, …

Introduction

nvidia SC 2012 2 www.caps-entreprise.com

Page 3: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

• Provide OpenACC and OpenHMPP directives

o OpenHMPP codelet based

o OpenACC code region based

www.caps-entreprise.com 3 nvidia SC 2012

CAPS Technology

#pragma hmpp myfunc codelet, …

void saxpy(int n, float alpha, float x[n], float y[n]){

#pragma hmppcg gridify(i)

for(int i = 0; i<n; ++i)

y[i] = alpha*x[i] + y[i];

}

#pragma acc kernels …

{

for(int i = 0; i<n; ++i)

y[i] = alpha*x[i] + y[i];

}

Page 4: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

• Source-to-source technology

www.caps-entreprise.com 4 nvidia SC 2012

Compilation Process

C++ Frontend

C Frontend

Fortran Frontend

Executable (mybin.exe)

Instrumentation module

CPU compiler (gcc, ifort, …)

HWA Code (Dyn. library)

OpenCL/Cuda Generation

Native compilers

Extraction module

Fun #2

Fun #3 Fun#1

Host code

codelets

CAPS Runtime

Page 5: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

A Few Typical Situations

6. Dealing with accelerated library

7. Dealing with dynamic

accelerated tasks scheduling

8. Using multiple accelerators

9. Nested parallelism using native

www.caps-entreprise.com 5 nvidia SC 2012

1. Simple nested loops

2. Data transfer optimization

3. Complex loop nests

4. Code tuning

5. Integrating auto-tuning

techniques

Page 6: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

• The simple construct is to

declare a parallel loop to be

compiled and executed on

an accelerator

o Iterations of the loop nests

are converted into threads

• Data in and out declaration

is used to determine the

data to move between the

host and the accelerator

www.caps-entreprise.com 6 nvidia SC 2012

Simple nested loops - 1 Host CPU code

send A,B,C

get A

execute kern. 1

Accelerator

wai

t fo

r ac

cele

rato

r

Page 7: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

• Example of stencil computation

www.caps-entreprise.com 7 nvidia SC 2012

Simple nested loops - 2

...

#pragma acc kernels pcopyin(A[0:m]) pcopy(B[0:m])

{

float c11,c12,c13,c21,c22,c23,c31,c32,c33;

c11 = +2.0f; c21 = +5.0f; c31 = -8.0f; ...

#pragma acc loop independent

for (int i = 1; i < M - 1; ++i){

#pragma acc loop independent

for (int j = 1; j < N - 1; ++j){

B[i][j] = c11*A[i-1][j-1]+c12*A[i][j-1]...;

}}}

...

Page 8: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

• Data transfers between the

host CPU and the accelerator

may very negatively impact on

performance

• A set of directives are provided

to keep data on the accelerator

beyond the execution of one

kernel

www.caps-entreprise.com 8 nvidia SC 2012

Data Transfer Optimization - 1 Host CPU code

csite 1 send A

get A

execute kern. 1

Accelerator

execute kern. 4

execute kern. 3

execute kern. 2

send A

get A

csite 2

csite 3

csite 4

Page 9: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

• Example from HydroC*

www.caps-entreprise.com 9 nvidia SC 2012

Data Transfer Optimization - 2

void hydro_godunov (…)

{

#pragma acc data \

create(qleft[0:H.nvar], qright[0:H.nvar], \

\...\) \

copy(uold[0:H.nvar*H.nxt*H.nyt]) \

copyin(Hstep)

{

for (j = Hmin; j < Hmax; j += Hstep){

// compute many slices each pass

int jend = j + Hstep;

if (jend >= Hmax)

jend = Hmax;

. . .// the work here

} // for j

}//end of data region

...

Data are left on the GPU during the step loop. pcopy clauses are used into called routines

*Pierre-Francois Lavalleea, Guillaume Colin de Verdiereb, Philippe Wauteleta, Dimitri Lecasa, Jean-Michel Dupaysa aIDRIS/CNRS, bCEA,Centre DAM

main (100%)

hydro_godunov (97.0%)

riemann

(48.2%)

slope

(11.2%)

trace (10.1

%)

qleftright

(4.9%)

cmpflx (3.0%)

constoprim

(4.6%)

equationofstate

(5.0%)

compute_deltat (2.9%)

updateConservativeVars

(5.6%)

gatherConservativeVars

(5.4%)

Page 10: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

• Non perfectly nested loops can be challenging

to parallelize efficiently

o OpenACC parallel regions provide control over the parallelization

scheme

o Requires to distribute the iteration spaces onto gangs and workers

www.caps-entreprise.com 10 nvidia SC 2012

Complex Loop Nests - 1

Gang

Workers

Vectors

Device

Page 11: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

• Extract from NOAA Nonhydrostatic Icosahedral Model (NIM) code

www.caps-entreprise.com 11 nvidia SC 2012

Complex Loop Nests - 2

!$acc parallel present(nprox,prox,u,...) vector_length(1) num_workers(64) num_gangs(512)

!$acc loop gang private (rhsu,...) private(ipn,k,isn,...)

do ipn=ips,ipe

n = nprox(ipn)

ipp1 = prox(1,ipn)

...

!$acc loop worker vector

do k=1,nz-1

rhsu(k,1) = cs(1,ipn)*u(k ,ipp1)...

...

enddo !k-loop

k=nz-1

rhsu(k+1,1) = cs(1,ipn)*u(k ,ipp1)...

...

!$acc loop worker vector private(wk)

do k=1,nz

Lots of statements

enddo !k-loop

!$acc loop seq

do isn = 1,nprox(ipn)

!$acc loop worker vector

do k=1,nz-1

Tgtu(k,isn) = ...

enddo !k-loop

Tgtu(nz,isn) = 2.*Tgtu(nz-1,isn) - ...

end do ! isn-loop

(continued on next page)

(continued from previous page)

!$acc loop seq

do isn = 1,nprox(ipn)

isp=mod(isn,nprox(ipn))+1

!$acc loop worker vector

do k = 2,nz-1

...

end do ! k -loop

sedgvar( 1,isn,ipn,1)=(zm(1,ipn)...

...

end do ! isn-loop

!$acc loop worker vector

do k=1,nz

kp1=min(nz,k+1)

...

end do

bedgvar(0,ipn,1)=...

enddo !ipn-loop

!$acc end parallel

Page 12: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

• Speedup X 19

o With a FERMI C2050

• Speedup X 34

o With K20

www.caps-entreprise.com 12 nvidia SC 2012

Accelerating Heat Transfer Ray-Tracing Code

with CAPS OpenACC Compiler

PROMES Laboratory Application: ray-by-ray heat transfer simulation (DP)

CAPS Experimentations done on FERMI C2050 / K20 in comparison to Sandy Bridge E5-2687W

0

20000

40000

60000

80000

100000

120000

3840 15360 76800 200000

Exe

cuti

on

tim

e (

s)

Performances for various numbers of rays and configurations.

X 19

Page 13: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

Accelerating Heat Transfer Ray-Tracing Code

with CAPS OpenACC Compiler

o ARM CPU + Accelerator Target

o Speedup 12x ARM cores and CARMA GPU

www.caps-entreprise.com 13 nvidia SC 2012

OpenACC on Nvidia CARMA

CUDA on ARM

Page 14: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

• Molecular dynamic codes (HLRS / Colin Glass)

• From the APOS project (http://apos-project.eu)

www.caps-entreprise.com 14 nvidia SC 2012

OpenHMPP MD Example

source: HLRS

Page 15: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

• Stereo Matching (ESAW) on GTX 550 Ti

• Result from the ANR Compa project

www.caps-entreprise.com 15 nvidia SC 2012

OpenHMPP Stereo Vision Example

From: Jinglin ZHANG, Jean-Francois NEZAN, Jean-Gabriel COUSIN, Erwan RAFFIN "Implementation of Stereo Matching Using A High Level Compiler for Parallel Computing Acceleration" IVCNZ ’12, November 26 - 28 2012, Dunedin, New Zealand

Page 16: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

• The more optimized a code is, the less portable it is

o Optimized code tends to saturate some hardware resources

o Parallelism ROI varies a lot

• i.e. # threads and workload need to be tuned

o Many resources not virtualized on HWA (e.g. registers, #threads)

www.caps-entreprise.com 16 nvidia SC 2012

Code Tuning - 1

0

0.2

0.4

0.6

0.8

1Threads

Registers/threads

L1 Hit RatioMem. Throughput

Occupancy

Run 1 norm

Run 2 norm

Example of an optimized versus a non optimized stencil code

cores

pe

rfo

rman

ce

HW1

HW2

Page 17: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

• Express code transformations via directives

www.caps-entreprise.com 17 nvidia SC 2012

Code Tuning - 2

#pragma hmpp <mygroup> sgemm codelet, args[tout].io=inout, &

#pragma hmpp & args[*].mirror args[*].transfer=manual

void sgemm( float alphav[1], float betav[1], const float t1[SIZE][SIZE],

const float t2[SIZE][SIZE], float tout[SIZE][SIZE] ) {

int j, i;

const float alpha = alphav[0], beta = betav[0];

#pragma hmppcg(OCL) unroll i:4, j:4, split(i), noremainder(i,j), jam

#pragma hmppcg gridify (j,i)

for( j = 0 ; j < SIZE ; j++ ) {

for( i = 0 ; i < SIZE ; i++ ) {

int k;

float prod = 0.0f;

for( k = 0 ; k < SIZE ; k++ ) {

prod += t1[k][i] * t2[j][k];

}

tout[j][i] = alpha * prod + beta * tout[j][i];

}

}

}

Loop transformations

Apply only when compiling for OpenCL

Page 18: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

• Adaptation of the code @ runtime o Use multiple call to a kernel to find the most efficient one

• Need to create an optimization space to explore o Compiler issue runtime configurable #gang, #worker, #vector

• Need a way to explore optimization space o Auto-tuning driver issue

o may also focus on execution time or energy

www.caps-entreprise.com 18 nvidia SC 2012

Integrating Auto-Tuning Techniques - 1

Source code

HMPP Compiler

Autotunable executable

code

hmpp profiling interface

auto-tuning driver

collect profiling data explore the variants space

Page 19: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

• Auto-tuning implementation of a Blur filter in OpenACC

• Explore dynamic parameters (e.g. #gangs, #workers)

www.caps-entreprise.com 19 nvidia SC 2012

Integrating Auto-Tuning Techniques – 2-a

size_t gangs[] = { 8, 16, 32, 64, 128, 128, 8, 16, 32, 64, 128, 256 };

size_t workers[] = { 16, 16, 16, 16, 16, 16, 24, 24, 24, 24, 24, 24 };

while (nber_of_iterations < max_iterations) {

variant = variantSelectorState("kernel.c:21",

(sizeof(gangs)/sizeof(size_t))-1);

blur(images[(currentImage + 1) % 2], image_caps, width, height,

blockSize, gangs[variant], workers[variant]);

}

#pragma acc parallel, copyin(dst_caps[0:height*width]),

copyout(src_caps[0:height*width]), num_gangs(gangs),

num_workers(workers), vector_length(32)

{

#pragma acc loop, gang

for (tileY = 0; tileY < tileCountY; tileY++) {

for (tileX = 0; tileX < tileCountX; tileX++) {

Parameterized parallel regions

parameter space to explore

set auto-tuning driver on

Page 20: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

• Auto-tuning driver behavior with DNADist (bio info)

www.caps-entreprise.com 20 nvidia SC 2012

Integrating Auto-Tuning Techniques – 2-b

#call

kernel time exploration phase

steady state

Data can be collected over multiple executions

0

2

4

6

8

10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Kernel Computation Time (in sec). Lower is better

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Kernel Computation Time (in sec). Lower is better

config 10 = 256 G x 128 W

config. 8 = 14 G x 16 W

Page 21: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

• Variant based auto-tuning in HMPP

• Explore compile time code transformations / algorithms

www.caps-entreprise.com 21 nvidia SC 2012

Integrating Auto-Tuning Techniques - 3

#pragma hmpp <convolution> filter5x5 callsite variants( &

#pragma hmpp & filterStencil5x5@<convolution>[C], &

#pragma hmpp & filterStencil5x5_V1@<convolution>[CUDA], &

#pragma hmpp & filterStencil5x5_V2@<convolution>[CUDA]) &

#pragma hmpp & selector(filterVariantSelector)

filterStencil5x5(&fullHeigh, &width, stencil1, raster1, raster2);

void filterStencil5x5_V2(const uint32 p_heigh[1],

const uint32 p_width[1], const RasterType filter[5][5],

const RasterType *p_inRaster, RasterType *p_outRaster){

. . .

#pragma hmppcg grid blocksize "32x4"

#pragma hmppcg unroll 6, jam

for (i = stencil; i < heigh - stencil; i++) {

...

void filterStencil5x5_V1(const uint32 p_heigh[1],

const uint32 p_width[1], const RasterType filter[5][5],

const RasterType *p_inRaster, RasterType *p_outRaster){

. . .

#pragma hmppcg grid blocksize "32x4"

#pragma hmppcg unroll 6, jam

for (i = stencil; i < heigh - stencil; i++) {

...

Variants of codelets to use

Page 22: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

• Library calls can usually only be partially replaced o No one-to-one mapping between libraries (e.g. BLAS, FFTW, CuFFT, CULA,

ArrayFire)

o No access to all application codes (i.e. avoid side effects)

o Want a unique source code

• Deal with multiple address spaces / multi-HWA o Data location may not be unique (copies, mirrors)

o Usual library calls assume shared memory

o Library efficiency depends on updated data location

• Libraries can be written in many different languages o CUDA, OpenCL, OpenHMPP, etc.

www.caps-entreprise.com 22 nvidia SC 2012

Dealing with Accelerated Library – 1

Page 23: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

• Using cuFFT accelerated version when source code is FFTW based o A set of proxies implements the FFT accelerated version of the calls

o Allows resources sharing between the library and users' codes to reduce data transfers between host and accelerator memories

o Only marked calls are executed on the accelerators

www.caps-entreprise.com 23 nvidia SC 2012

Dealing with Accelerated Library - 2

. . .

#pragma hmppalt cufft call, name="fftw_plan_dft_c2r_1d_sharing"

pc2r = fftw_plan_dft_c2r_1d(n, odata_intermediate,

odata_real_GPU,FFTW_ESTIMATE);

#pragma hmppalt cufft call, name="fftw_execute_sharing"

fftw_execute(pr2c);

#pragma hmpp <my_grp> filter callsite

filter(n, (double _Complex *)odata_intermediate, cf);

#pragma hmppalt cufft call, name="fftw_execute_sharing"

fftw_execute(pc2r);

. . .

Page 24: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

• When dealing with a large bunch of small parallel tasks o Need to exploit accelerator asynchronous execution

• Madness (Multiresolution ADaptive NumErical Scientific Simulation) o Madness integrates its own tasks manager

o Study in collaboration with ORNL

www.caps-entreprise.com 24 nvidia SC 2012

Dealing with Dynamic Accelerated Tasks

Scheduling – 1

Q1 Q2 Q3

Accelerator

T0,T6,T4,T5

T1,T2,T3,T7

T8,T9

dependent tasks are sent to the same queue

Task C/Fortran source code

CAPS compiler

OCL/CUDA code

Page 25: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

• Example of task queuing

• Exploit Caps compiler code generation

• Close to OpenCL API but task code remains C/Fortran one

www.caps-entreprise.com 25 nvidia SC 2012

Dealing with Dynamic Accelerated Tasks

Scheduling – 2

for(int i=0; i<nb_arrays;i++){

hmpprt::Queue *myQueue = (hmpprt::Queue *) myQueues[i];

myQueue->enqueueUpload(g_A[i],h_A[i]);

myQueue->enqueueUpload(g_B[i],h_B[i]);

hmpprt::ArgumentList myArguments;

myArguments.addArgument(g_A[i]);

myArguments.addArgument(g_B[i]);

myArguments.addArgument(g_C[i]);

myQueue->enqueueCall(myDevice, myCodelet, myArguments);

myQueue->enqueueDownload(g_C[i],h_C[i]);

}

for(int i=0; i<nb_arrays;i++){

hmpprt::Queue *myQueue = (hmpprt::Queue *) myQueues[i]; myQueue->start();

}

for(int i=0; i<nb_arrays;i++){

hmpprt::Queue *myQueue = (hmpprt::Queue *) myQueues[i]; myQueue->wait();

}

Page 26: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

• Nodes can have multiple accelerators

• One thread / one MPI process mapping very limited

• Having to use CPU parallel code for exploiting multiple accelerator is inconvenient

• OpenHMPP multi-accelerator features based on

o Data distribution

o Owner compute rule to allocate the tasks/codelets to the device

www.caps-entreprise.com 26 nvidia SC 2012

Using Multiple Accelerators - 1

Page 27: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

Using Multiple Accelerators - 2

www.caps-entreprise.com 27

GPU 0 GPU 1 GPU 2

#pragma hmpp <MyGroup> parallel for(k=0;k<n;k++) { #pragma hmpp <MyGroup> f1 callsite myparallelfunc(d[k],n); }

Main memory Device

memory Device

memory Device

memory

d0 d1 d2 d3 d1 d2 d3

nvidia SC 2012

Page 28: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

• A parallel for loop implements a "map" operation

www.caps-entreprise.com 28 nvidia SC 2012

Using Multiple Accelerators - 3

#pragma hmpp parallel, device="i%2"

for( i=0;i<NB;i++){

//Allocate the mirrors for vin1, vin2 and vout

#pragma hmpp <mygroup> allocate, data["vin1[i]","vin2[i]",

"vout_multi[i]"], size={size,size}, elementsize="4"

. . .

}

#pragma hmpp parallel

for( i=0;i<NB;i++) {

//launch the codelet on the device

#pragma hmpp <mygroup> sgemm callsite

sgemm( vin1[i], vin2[i], vout_multi[i] );

}

Distribute the data over the accelerators

Execute the task in parallel according to the owner compute rule

Page 29: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

• Exploit K20 new capabilities for nested parallelism

www.caps-entreprise.com 29 nvidia SC 2012

Nested parallelism using native – 1

#pragma hmpp cudadp codelet, target=CUDA, args[A].io=in, ...

void codelet(float *A, float *B, float *C, int nm, int ms, int size)

{

int i;

#pragma hmppcg gridify

for(i = 0 ; i < 1 ; i++)

{

#pragma hmppcg(CUDA) include("native_dynamic_parallelism.h")

#pragma hmppcg(CUDA) native(dynamic_parallelism)

dynamic_parallelism(A, B, C, nm, ms, size);

}

}

void dynamic_parallelism(float *A,float *B,float *C, ...)

{

#ifndef __HMPP

int i, j, k, l;

for (k = 0 ; k < nm ; k++) {

for (l = 0 ; l < nm ; l++) {

...

cblas_sgemm(CblasColMajor, CblasNoTrans, CblasNoTrans,

ms, ms, ms, 1.0f, loA, nm * ms, loB, nm * ms,

1.0f, loC, nm * ms);

for (i = 0 ; i < ms ; i ++) {

for (j = 0 ; j < ms ; j ++) {

loC[access(i,j)] /= 2.f;

...

CPU version of the function

Codelet

Page 30: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

• The native CUDA device function

www.caps-entreprise.com 30 nvidia SC 2012

Nested parallelism using native – 2

__device__ void dynamic_parallelism(float *A, float *B, float *C, ...) {

cublasStatus_t status;

cublasHandle_t handle;

int i, j;

dim3 g(ms/8,ms/8,1);

dim3 b(8,8,1);

status = cublasCreate(&handle);

for (i = 0 ; i < nm ; i++) {

for (j = 0 ; j < nm ; j++) {

float loA = A + offset(i,j); float *loB = B + offset(i,j);

float *loC = C + offset(i,j);

const float alpha = 1.f; const float beta = 1.f;

cudaStream_t s;

cudaStreamCreateWithFlags(&s, cudaStreamNonBlocking);

cublasSetStream(handle, s);

cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, ms, ms, ms,...);

halv<<<g,b,0,s>>>(loC,nm,ms);

cudaStreamDestroy(s);

}

}

cudaDeviceSynchronize();

status = cublasDestroy(handle);

}

Page 31: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

• CAPS technology provides a flexible and portable

programming environment for accelerator based systems

o OpenACC and OpenHMPP are complementary sets of directives

o Many features to handle various coding issues

o CPU/Accelerator library integration important for code maintenance

• Auto-tuning techniques helps to simplify code tuning and

deployment

o Code adapts to the architecture configuration

www.caps-entreprise.com 31 nvidia SC 2012

Conclusion

See a demo on Booth #2330

Page 32: Using CAPS Compiler on NVIDIA Kepler and CARMA …on-demand.gputechconf.com/...Bodin-CAPS-Compilers-Kepler-CARMA.pdfUsing CAPS Compiler on NVIDIA Kepler and ... OpenACC on Nvidia CARMA

Accelerator Programming Model Parallelization

Directive-based programming GPGPU Manycore programming

Hybrid Manycore Programming HPC community OpenACC

Petaflops Parallel computing HPC open standard

Multicore programming Exaflops NVIDIA Cuda

Code speedup Hardware accelerators programming

High Performance Computing OpenHMPP

Parallel programming interface

Massively parallel

Open CL http://www.caps-entreprise.com

http://twitter.com/CAPSentreprise http://www.openacc-standard.org/

http://www.openhmpp.org


Recommended