Task based Programming with OmpSs and its Application

www.bsc.es

Task-Based Programming with OmpSs and its Application

Facultad de Informática, UCM

Madrid, 4 de Nov 2014

2

Outline

  Motivation StarSs and OmpSs basics OmpSs flavors OmpSs environment eDSLs on top of OmpSs   Conclusions

3

Exascale challenge, or how to make it the HPC comfort zone

  The Learning Zone model establishes a theory of how performance of a person can be enhanced and their skills optimized

–  Comfort Zone: feel comfortable and do not have to take any risks

–  Learning Zone: just outside of our secure environment, we grow and learn

–  Panic Zone: all our energy is used up for managing/controlling our anxiety and no energy can flow into learning.

  Moving to the learning zone, enables to extend the comfort zone, moving towards the panic zone

  When following a personal dream or vision, individuals need to move to the learning zone and take controlled risks, in order to achieve the challenges of their panic zone

Social pedagogy

* The Learning Zone Model (Senninger, 2000)

Exascales poses different challenges to HPC … away from the current comfort zone … maybe in the panic zone???

4

The parallel programming comfort zone

  State of the art parallel programming –  Where to place data –  What to run where –  How to communicate

  Parallel programming in the future –  What do I need to compute –  What data do I need to use –  Hints (not necessarily very precise) on

potential concurrency, locality,…

Static scheduling, all decisions controlled by the programmer

Dynamic scheduling, optimizations decided by

runtime, loose of control by the programmer

Comfort zone

Panic? Zone

5

Parallel programming evolution

  At the beginning there was one language

Simple interface Sequential program

ILP

ISA / API

Programs “decoupled”

from hardware

Applications

6

Parallel programming evolution

  Multicores and heterogeneous processors made the interface to leak

ISA / API Address spaces (hierarchy,

transfer), specific instructions, …

Applications

Program logic +

Platform specificities

Applications

BSC vision in programming

  Need to decouple again

General purpose Task based

Single address space

“Reuse” architectural ideas

under new constraints

Program logic

Arch. independent Applications

Power to the runtime

PM: High-level, clean, abstract interface

ISA / API

BSC Vision in the programming

Special purpose

Must be easy to develop/maintain

Fast development, more expressivity Applications

Power to the runtime

PM: High-level, clean, abstract interface

DSL1 DSL2 DSL3

ISA / API

STARSs basic idea

... for (i=0; i<N; i++){ T1 (data1, data2); T2 (data4, data5); T3 (data2, data5, data6); T4 (data7, data8); T5 (data6, data8, data9); } ...

Sequential Application

T10 T20

T30 T40

T50 T11 T21

T31 T41

T51

T12

…

Resource 1

Resource 2

Resource 3

Resource N

.

.

.

Task graph creation

based on data

precedence

Task selection + parameters direction (input, output, inout)‏

Scheduling,

data transfer,

task execution

Synchronization, results transfer

Parallel Resources (multicore, GPU, cluster, cloud, grid)‏

Write Decouple how we write from how it is executed

Execute

10

OmpSs vs OpenMP OpenMP 3.0 includes tasks (2008) –  No dependencies

OpenMP 4.0 includes –  Task dependencies (2013)

•  Overlapped or strided regions not supported –  Support to accelerators

•  Static support to the device, without integration with dynamic scheduling •  Based on compilation from C

OmpSs supports task dependencies –  Main feature

OmpSs support to accelerators –  Leveraging CUDA, OpenCL –  Integrated in the dynamic scheduling –  Support to multiple devices, automatic data transfers –  Support to versioning

  Other OmpSs features –  Support to overlapped/strided regions –  Concurrent/Commutative clause

OmpSs: task dependencies int main (int argc, char **argv) { int i, j, k; … initialize(A, B, C); for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) matmul_tile( C[i][j], A[i][k], B[k][j], BS); }

#pragma omp task inout([BS*BS]C) in([BS*BS]A, [BS*BS]B) void matmul_tile (float *A, float *B, float *C , int BS) { int i, j, k; for (i = 0; i < BS; i++) for (j = 0; j < BS; j++) for (k = 0; k < BS; k++) { C[i*BS+j] += A[i*BS+k] * B[k*BS+j]; } }

OmpSs: defining array sections

int a[N][M]; #pragma omp task in(a[2:3][3:4]) // 2 x 2 subblock of a at a[2][3]

int a[N][M]; #pragma omp task in(a[1:2][0:M-1]) //rows 1 and 2

int a[N][M]; #pragma omp task in(a[0:N-1][0:M-1]) //whole matrix used to compute dependences

int a[N][M]; #pragma omp task in(a[0;N][0;M]) //whole matrix used to compute dependences

=

int a[N][M]; #pragma omp task in(a[2;2][3;2]) // 2 x 2 subblock of a at a[2][3]

=

OmpSs examples: Serialized reduction pattern

for (int j=0; j<N; j+=BS){ actual_size = (N- j> BS ? BS: N-j); #pragma omp task in (vec[j;actual_size]) inout(result)

for (int count = 0; count < actual_size; count ++, j++) result += vec [j] ;

} #pragma omp task input (result)

printf (“TOTAL is %d\n”, result); #pragma omp taskwait

BS

result

vec

< BS

Serialization

print

OmpSs: Concurrent

sum sum sum sum

...

BS

vec

print

... atomic access to total

double vec[N]; double result; for (int j; j<N; j+=BS){

atual_size = (N- j> BS ? BS: N-j); #pragma omp task in (vec[j;actual_size]) concurrent(result) { double local_result=0.0;

for (int count = 0; count < actual_size; count ++) local_result += vec [j++] ;

#pragma omp atomic result += local_result;

} } #pragma omp task input (result)


OmpSs: Commutative

sum

sum

sum

sum

...

BS

vec

print

...

Tasks executed out of order but not concurrently

for (int j; j<N; j+=BS){ actual_size = (N- j> BS ? BS: N-j); #pragma omp task in (vec[j;actual_size]) commutative(result)

for (int count = 0; count < actual_size; count ++, j++) result += vec [j] ;

} #pragma omp task input (result)


No mutual exclusion required

OmpSs support of ISA heterogeneity

  Target directive –  Source code parsing and backend invocation –  The compiler parses the specific syntax of that device and hands the

code over to the appropriate back end compiler

#pragma omp target device (smp | cuda | opencl) –  smp

•  Backend compiler: gcc, icc, xlc, … –  CUDA:

•  Mercurium parses cuda •  Backend compiler: nvcc

–  OpenCL •  Backend compiler selected at runtime

Only kernel in CUDA Runtime takes care of memory allocation, data transfers, task scheduling, synchronization,…

#pragma omp target device(cuda) copy_deps ndrange(2,NB,NB,16,16) #pragma omp task inout([NB*NB]C) in([NB*NB]A,[NB*NB]B) __global__ void Muld(REAL* A, REAL* B, int wA, int wB, REAL* C,int NB);

OmpSs@CUDA matmul

NB

NB

DIM

DIM

NB

NB

void matmul( int m, int l, int n, int mDIM, int lDIM, int nDIM, REAL **tileA, REAL **tileB, REAL **tileC )

{ int i, j, k; for(i = 0;i < mDIM; i++) for (k = 0; k < lDIM; k++) for (j = 0; j < nDIM; j++) Muld(tileA[i*lDIM+k], tileB[k*nDIM+j],NB,NB, tileC[i*nDIM+j],NB); }

#include "matmul_auxiliar_header.h" // Thread block size #define BLOCK_SIZE 16 // Device multiplication function called by Mul() // Compute C = A * B // wA is the width of A // wB is the width of B __global__ void Muld(REAL* A, REAL* B, int wA, int wB, REAL* C,

int NB) { // Block index int bx = blockIdx.x; int by = blockIdx.y; // Thread index int tx = threadIdx.x; int ty = threadIdx.y; // Index of the first sub-matrix of A processed by the block int aBegin = wA * BLOCK_SIZE * by; // Index of the last sub-matrix of A processed by the block int aEnd = aBegin + wA - 1; // Step size used to iterate through the sub-matrices of A int aStep = BLOCK_SIZE; …

#define BLOCK_SIZE 16 __constant int BL_SIZE= BLOCK_SIZE; #pragma omp target device(opencl) copy_deps ndrange(2,NB,NB,BL_SIZE,BL_SIZE) #pragma omp task in([NB*NB]A,[NB*NB]B) inout([NB*NB]C) __kernel void Muld( __global REAL* A, __global REAL* B, int wA, int wB, __global REAL* C, int NB);

OmpSs@OpenCL matmul NB

NB

DIM

DIM

NB

NB

void matmul( int m, int l, int n, int mDIM, int lDIM, int nDIM, REAL **tileA,

REAL **tileB,REAL **tileC ) { int i, j, k; for(i = 0;i < mDIM; i++) for (k = 0; k < lDIM; k++) for (j = 0; j < nDIM; j++) Muld(tileA[i*lDIM+k], tileB[k*nDIM+j],NB,NB, tileC[i*nDIM

+j],NB); }

#include "matmul_auxiliar_header.h" // defines BLOCK_SIZE // Device multiplication function // Compute C = A * B // wA is the width of A // wB is the width of B __kernel void Muld( __global REAL* A, __global REAL* B, int wA, int wB,

__global REAL* C, int NB) { // Block index, Thread index int bx = get_group_id(0); int by = get_group_id(1); int tx = get_local_id(0); int ty = get_local_id(1); // Indexes of the first/last sub-matrix of A processed by the

block int aBegin = wA * BLOCK_SIZE * by; int aEnd = aBegin + wA - 1; // Step size used to iterate through the sub-matrices of A int aStep = BLOCK_SIZE; ...

Use __global for copy_in/copy_out arguments

OmpSs: support to multiple versions int main (int argc, char **argv) { int i, j, k; … initialize(A, B, C); for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) matmul_tile( C[i][j], A[i][k], B[k][j], BS); }

#pragma omp target device (smp) copy_deps #pragma omp task inout([BS*BS]C) in([BS*BS]A, [BS*BS]B) void matmul_tile (float *A, float *B, float *C , int BS) { int i, j, k; for (i = 0; i < BS; i++) for (j = 0; j < BS; j++) for (k = 0; k < BS; k++) { C[i*BS+j] += A[i*BS+k] * B[k*BS+j]; } } #pragma omp target device(cuda) copy_deps implements(matmul_tile) #pragma omp task inout([BS*BS]C) in([BS*BS]A, [BS*BS]B) void matmul_tile_cuda (float *A, float *B, float *C, int BS) { int hA, wA, wB; hA = NB; wA = NB; wB = NB;

dim3 dimBlock, dimGrid; dimBlock.x = BS; dimBlock.y = BS; dimGrid.x = (wB / dimBlock.x); dimGrid.y = (hA / dimBlock.y);

Muld <<<dimGrid, dimBlock>>> ( A, B, wA, wB, C ); } #pragma omp target device(opencl) copy_deps\ implements(matmul_tile) ndrange(2,NB,NB,BL_SIZE,BL_SIZE) #pragma omp task inout([BS*BS]C) in([BS*BS]A,[BS*BS]B) __kernel void Muld( __global REAL* A, __global REAL* B, int wA, int wB, __global REAL* C, int BS);

20

OmpSs: support to multiple versions

Task versions Data transfers

OmpSs @ Cluster

21

void fft_round( long N_SQRT, long FFT_BS, fftw_complex (*A)[N_SQRT][N_SQRT], fftw_complex (*B)[N_SQRT][N_SQRT], char *plan, size_t plan_size ) { long innerBs = ( FFT_BS / _TARGET_THDS ); long restInnerBs = ( FFT_BS % _TARGET_THDS ); for (long J=0; J<N_SQRT; J+=FFT_BS) { #pragma omp target device(smp) copy_deps #pragma omp task firstprivate(N_SQRT, FFT_BS, J, innerBs, restInnerBs) inout( (*A)[J;FFT_BS ][0;N_SQRT])\ in( [plan_size] plan ) {

... fftw_complex (*b)[N_SQRT][N_SQRT] = malloc( N_SQRT * FFT_BS * sizeof( fftw_complex ));

for (long i=J; i<J+FFT_BS; i =i + ( innerBs + ((((i-J)/myInnerBs)< restInnerBs)?1:0))){ #pragma omp task firstprivate(myN_SQRT, i, J, myInnerBs, my_plan, myRestInnerBs )

{ for (long j=i;j<(i+(myInnerBs+(((i-J)/myInnerBs)<myRestInnerBs?1:0)))&&j< myN_SQRT; j++){

HPCC_zfft1d( my_plan->n, &(*myA1)[j][0], &(*b)[j-J][0], -1, my_plan ); }

} } #pragma omp taskwait noflush free( b ); } } }

è Focus on support distributed architectures è Same code, with nesting better suited for hierarchy

Hybrid MPI/OmpSs

  Overlap communication/computation   Extend asynchronous data-flow execution to outer level

è Focus on adoption by plethora of codes in MPI

… for (k=0; k<N; k++) {

if (mine) {

Factor_panel(A[k]);

send (A[k])

} else {

receive (A[k]);

if (necessary) resend (A[k]);

}

for (j=k+1; j<N; j++)

update (A[k], A[j]);

…

#pragma omp task inout(A[SIZE]) void Factor_panel(float *A); #pragma omp task in(A[SIZE]) inout(B[SIZE]) void update(float *A, float *B);

#pragma omp task in(A[SIZE]) void send(float *A); #pragma omp task out(A[SIZE]) void receive(float *A); #pragma omp task in(A[SIZE]) void resend(float *A);

P0 P1 P2

23

Dynamic Load Balancing: MPI/OmpSs + LeWI

Automatically achieved by the runtime

–  Load balance within node –  Fine grain. –  Complementary to application

level load balance. –  Leverage OmpSs malleability

LeWI: Lend When Idle –  An MPI process lends its CPUs

when inside a blocking MPI call –  Another MPI process in the

same node can use the lent CPUs to run with more threads.

–  When the MPI call is finished the MPI process retrieves it’s cpus

Unbalanced Application

MPI 0 MPI 1

MPI call

MPI call

Unbalanced Application with LeWI

MPI 0 MPI 1

MPI call

MPI call

OmpSs infrastructure: Mercurium Compiler

  Recognizes constructs and transforms them to calls to the runtime   Manages code restructuring for different target devices –  Device-specific handlers –  May generate code in a

separate file –  Invokes different back-end

compilers •  gcc, icc, xlc… for regular

code •  nvcc for NVIDIA

C/C++/Fortran

OmpSs infrastructure: The NANOS++ Runtime

  Nanos++ –  Common execution

runtime (C, C++ and Fortran)

–  Target specific features –  Task creation, dependency

management, resilience, …

–  Task scheduling (BF, Cilk, Priority, Socket, …)

–  Data management: Unified directory/cache architecture

•  Transparently manages separate address spaces (host, device, cluster)…

•  … and data transfer between them

26

OmpSs behaviour

int main () { for (…) { createWD(…); } wait_completion(); … }

#pragma omp task … Mercurium

Host code

Device code

Native compilers (gcc, nvcc, …)

Application binary

Scheduling

SMP

SMP

GPU

Cluster

Data directory

GPGPU

Remote node

Mercurium C/C++ source-to-source compiler

Nanos++ run-time

Some results: OmpSs @ SMP

2x Intel SandyBridge-EP E5-2670/1600 8-core at 2.6 GHz

29 29

OmpSs @ Cluster

  FFT performance (16k x 16k complex elements)   Peak performance on par with the MPI implementation

30

MPI/OmpSs

Scalapack: Cholesky factorization –  Example of the issues in porting

legacy code –  Demonstration that it is feasible –  Synchronization tasks to emulate

array sections behavior •  Overhead more than compensated

by flexibility –  The importance of scheduling

•  ~ 10% difference in global performance

–  Some difficulties with legacy codes •  Structure of sequential code •  Memory allocation

31

What is a DSL?

  Domain Specific Language –  Language tailored to solve problems in one domain –  The size of the domain can widely vary

•  Data query (SQL) •  Numerical computing (Matlab) •  Statistics (R)

32

What is a DSL for HPC?

  Domain Specific Language –  Language tailored to solve problems in one domain –  The size of the domain can widely vary

•  Data query (SQL) •  Numerical computing (Matlab) •  Statistics (R)

–  The DSL has additional performance requirements •  To solve “interesting” problems it must efficiently run on a HPC system

33

DSL advantages & drawbacks

ü  Language very close to problem domain –  Best programmer productivity

•  Easy to understand by domain experts –  Even without previous knowledge of the language!

•  Easy to map and solve domain problem •  Easy to maintain and future-proof!

–  Language fully decoupled from hardware

–  Bad/Good/Best performance

x  The development of a DSL is only justified when large community behind –  Otherwise, no way to amortize the development cost of the DSL

infrastructure

x  The complexity of developing a HPC DSL is huge! –  DSL Compiler, tools, optimizer, distributed parallel runtime system, ...

x  The complexity of developing a DSL is high –  DSL Compiler, tools, ...

34

BSC goal – CS department

  Develop a framework that can be shared by several DSLs –  Compiler Framework

•  Scala •  Lightweight Modular Staging (LMS) from EPFL •  Dataflow-superscalar framework DFL from BSC

–  Runtime Framework •  OmpSs (Mercurium & Nanox++) •  OpenCL •  MPI (future work)

35

  BSC - CASE expertise on Partial Differential Equations and HPC –  Alya Red simulation environment

  Domain: Convection-Diffusion-Reaction equations –  Well know domain (by the CASE people) –  Several implementations already available in C and Fortran –  First design decisions of the DSL

•  Level of abstraction •  Types •  Operators

BSC - CS / CASE collaboration

36

  Simple and high level syntax –  High level constructs that directly associate with domain knowledge –  Efficient development/maintenance cycle

  High performance computing for free (for the end user) –  Ability to solve large complex problems with 20 lines of clean, simple code

SAIPH: a DSL for solving CDR equations

37

def KFun(xp: Float, yp: Float, zp: Float) = { if (zp > 18.75) 0.02 else 0.15

} val c = Cartesian(12.5, 25.0, 37.5) val temp = Unknown(c) val plane = Dirichlet(lowXZ of c, temp, 400) val hv = Vector(0.5, 0.5, 0.5) val pre = PreProcess(nsteps = 100000, deltaT = 0.125, h = hv)(plane) val K = KFun _ val diffusion = K * lapla(temp) - dt(temp) val post = snapshoot each 100 steps solve(pre)(post) equation diffusion to "diffusion"

CDR: Example 1 – Pure diffusion phenomena

Runs on a system with a GPU: 10.000 time steps in 7 seconds

38


39

Underlying Technologies

Front end - Compile the program with the LMS Library and the compiler implementation together

Middle end - 1st stage - Domain Specific Opt. - LMS IR generation

Back end - 2nd stage - DFL code + OpenCL kernels

CDRs Embedded Compiler (LMS)

Scala Virtualized Compiler Diffusion.sph

Diffusion.class

Host-side CodeGen

DFL Compiler (LMS)

Diffusion.cpp

Diffusion.dfl DiffusionEquation.rsveq

OmpSs

Accelerator-side CodeGen

Equation Stencil Compiler (LMS) DiffusionKernels.cl

40


  CDRs generates –  Two OpenCL kernels (tasks) –  One I/O task –  The initialization code + body of the application + OmpSs pragmas

  OmpSs runtime orchestrates the execution –  Schedules task based on data dependencies –  Manages data transfers between host and GPU

Input/output tasks

GPU computation tasks

41

Translation process def KFun(xp: Float, yp: Float, zp: Float) = { if (zp > 18.75) 0.02 else 0.15 } // Defining mesh and conditions val c = Cartesian(12.5, 25.0, 37.5) val temp = Unknown(c) val plane = Dirichlet(lowXZ of c, temp, 400) // Defining preprocess val hv = Vector(0.5, 0.5, 0.5) val pre = PreProcess(nsteps = 150000, deltaT = 0.125, h = hv)(plane) // Defining equation val K = EqField(KFun _) val diffusion = K * lapla(temp) - dt(temp) // Defining postprocess val post = snapshoot each 5000 steps solve(pre)(post) equation diffusion to "diffusion"

Diffusion.sph

... val solveStepx31 = Kernel(kc_x31, "solveStepx31")(In, In(3), In, In, In, In, In(13), InOut(x23), InOut(x23), In(x23), In(x23), In(6), In(6)) val expandBounds = Kernel(kc_x31, "expandBounds")(In, In, In, InOut(x23), In(6), In) ... (4 until x26) foreach { i => (4 until 5) foreach { j => (4 until x25) foreach { k => x24(i*x17*x13+j*x13+k) = 400.0000000000f } } } ... (0 until 150000) foreach { i_x31 =>

if (i_x31 % 2 == 0) { solveStepx31(0.1250000000f, x0, x13, x17, x21, 4, coeffs_x31, x24, x24_back_1,

x29, x31_dirich_mask_unk0, x31_neumann_mask_unk0, x31_neumann_vals_unk0) using ndr_x31 expandBounds(x13, x17, x21, x24, x31_periodics, 4) using ndr_x31 if ((i_x31+1) % 5000 == 0) { () Task(x24, x0)(In(x23), In(3)) { writeVTI(x24, x13, x17, x21, "diffusion", x0, 4, (i_x31+1)/5000) } } }

... } taskwait

Diffusion.dfl

42

__kernel void solveStepx31( float dt, __global float *H, int dx, int dy, int dz, int halo, __global float *coeffs, __global float *unk0_0, __global float *unk0_1, __global float *field0, __global int *dirich_mask0, __global int *neumann_mask0, __global float *neumann_vals0) {

int i = get_global_id(2); int j = get_global_id(1); int k = get_global_id(0);

if (i < halo || j < halo || k < halo || i >= (dz-halo) || j >= (dy-halo) || k

>= (dx-halo)) return; int neum0DerType = 0; int neum0Direction; float neum0Value; if (i == halo) {if (neumann_mask0[0] > 0) { neum0DerType = neumann_mask0[0];

neum0Direction = 0; neum0Value = neumann_vals0[0]; } }

if (j == halo) ... } int idx = i*dx*dy + j*dx + k; float x1 = unk0_1[idx]; float x3 = unk0_1[idx]; float x2 = field0[idx]; float x4 = sosd(&unk0_1[idx], H, dx, dy, dz, neum0DerType, neum0Direction, neum0Value); float x5 = x2 * x4; float x6 = x5 - 0.0f; if (dirich_mask0[idx] == 0) { unk0_0[idx] = unk0_1[idx] + x6*dt; } else unk0_0[idx] = unk0_1[idx]; }

__kernel void expandBounds( int dx, int dy, int dz, __global float *unk, __global int *periodics, int halo) { int i = get_global_id(2); int j = get_global_id(1); int k = get_global_id(0); if (i > (dz-1) || j > (dy-1) || k > (dx-1)) return; int idx = i*dx*dy + j*dx + k; int di = i; int dj = j; int dk = k; if (i < halo) di = periodics[0]; else if (i >= (dz-halo)) di = periodics[3]; if (j < halo) dj = periodics[1]; else if (j >= (dy-halo)) dj = periodics[4]; if (k < halo) dk = periodics[2]; else if (k >= (dx-halo)) dk = periodics[5]; if (i != di || j != dj || k != dk) unk[idx] = unk[di*dx*dy + dj*dx + dk]; }

Diffusion.cl

43

... for(int x126=0; x126 < 150000; x126++) {

int x127 = x126 % 2; bool x128 = x127 == 0; if (x128) { #pragma omp target device(opencl) ndrange(3, 0, 0, 0, x8, x12, x16, 16, 16, 4) copy_deps #pragma omp task in([3] xa1, [13] xa6, [x18] xa9, [x18] xa10, [6] xa11, [6] xa12) inout([x18]

xa7, [x18] xa8) __kernel void solveStepx31(float xa0, __global float* xa1, int xa2, int xa3, int xa4, int xa5,

__global float* xa6, __global float* xa7, __global float* xa8, __global float* xa9, __global int* xa10, __global int* xa11, __global float* xa12);

solveStepx31(0.125f, x4, x8, x12, x16, 4, x112, x19, x89, x60, x90, x101, x104); #pragma omp target device(opencl) ndrange(3, 0, 0, 0, x8, x12, x16, 16, 16, 4) copy_deps #pragma omp task in([6] xa4) inout([x18] xa3) __kernel void expandBounds(int xa0, int xa1, int xa2, __global float* xa3, __global int* xa4,

int xa5); expandBounds(x8, x12, x16, x19, x111, 4);

int x133 = x126 + 1; int x134 = x133 % 5000; bool x135 = x134 == 0; if (x135) { int x136 = x133 / 5000; #pragma omp target device(smp) copy_deps #pragma omp task in([x18] x19, [3] x4) writeVTI(x19, x8, x12, x16, string("diffusion"), x4, 4, x136); }

... #pragma omp taskwait

Diffusion.cpp

44


  CDRs generates –  Two OpenCL kernels (tasks) –  One I/O task –  The initialization code + body of the application + OmpSs pragmas

  OmpSs runtime orchestrates the execution –  Schedules task based on data dependencies –  Manages data transfers between host and GPU

Input/output tasks

GPU computation tasks

45

def hotCube(cx: Float, cy: Float, cz: Float, edgeSize: Float) (xp: Float, yp: Float, zp: Float) = { if (xp >= cx - edgeSize && xp <= cx + edgeSize && yp >= cy - edgeSize && yp <= cy + edgeSize &&

zp >= cz - edgeSize && zp <= cz + edgeSize) Some(10) else Some(5)

} val c = Cartesian(25, 50, 75) val temp = Unknown(c) val cube = Source(hotCube(12.5, 25, 37.5, 6) _, temp) val hv = Vector(1, 1, 1) val pre = PreProcess(nsteps = 500, deltaT = 1, h = hv)(cube)(PeriodicHighZ) val v = Vector(0, 0, 1) val convection = dt(temp) + grad(temp) * v solve(pre)(flush) equation convection to "convection"

CDR: Example 2 – Pure convection phenomena

Stabilization scheme done internally by CDR

46

CDR: Example 2 – Pure convection phenomena

The numerical scheme do not introduce artificial diffusion due to the stabilization.The cubic form is preserved

val v = Vector(0, 0, 1) val convection = dt(temp) + grad(temp) * v solve(pre)(flush) equation convection to "convection"

Stabilization scheme done internally by CDR

47

  Incomplete code def CDef(x: Rep[Float], y: Rep[Float], z: Rep[Float]) = {

if (x >= 300 && x <= 400 && y >= 300 && y <= 400) (1700*1700) else (2000*2000)

} val c = Cartesian(500, 500, 9) val pressure = Unknown(c) val waveSource = PointSourceSource(250,250,5)(rickerWalet(20)_,pressure) val hv = Vector(1, 1, 1) val pre = PreProcess(nsteps = 50000, deltaT = 0.003333, h = hv)(waveSource) val C = CDef _ val wavePropagation = C * lapla(pressure) – dt2(pressure) val post = snapshoot each 10 steps solve(pre)(post) equation wavePropagation to ”wave”

CDR: Example 3 – Acoustic wave equation in a heterogeneous env.

48

CDR: Example 3 – Acoustic wave equation

49

Conclusions

OmpSs is a task based programming model –  Supports asynchronous task execution model –  Supports heterogeneity and distributed memory –  Extends OpenMP

•  Some OmpSs characteristics are now in the standard, i.e. Dependence clauses

•  Continuous feedback to the standardisation body –  OmpSs can improve MPI behavior, by enabling the overlap of

communication with computation

OmpSs is not just a research project –  Whole team of researchers, developers and PhD students contributing –  Distributed as open source, in a pseudo-professional way (i.e., git

repository, ticketing…) –  Open to collaborations!

50

Conclusions

  Enabling developers to be it is comfort zone when programming for Exascale computing is still a challenge   Efforts like DSLs with powerful runtimes such as OmpSs seems to be a good strategy –  Offer a language tailored to solve problems in one domain –  Run efficiently on a HPC system

  Future work –  Further develop/optimize the environment –  Combine it with MPI –  Continue optimizing runtime for further scaling, fault tolerance, …

•  Contact: [email protected] •  Source code available from http://pm.bsc.es/ompss/

  Jesus Labarta   Eduard Ayguade   Rosa M. Badia   Xavier Martorell Vicenç Bertran   Alex Duran (Intel)   Roger Ferrer   Xavier Teruel   Javier Bueno   Judit Planas   Sergi Mateo   Guillermo Miranda

Florentino Sainz   Victor Lopez   Marta Garcia Josep M. Perez   Omer Subasi   Javier Arias   Harald Servat Judit Gimenez Kallia Chronaki   Alejandro Fernández   …

Contributors

www.bsc.es

Thank you!

52

Date post:	13-Jul-2015
Category:	Education
Upload:	universidad-complutense-de-madrid
View:	295 times
Download:	1 times

Task based Programming with OmpSs and its Application

Education