+ All Categories
Home > Documents > OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

Date post: 17-Jan-2016
Category:
Upload: asher-watts
View: 228 times
Download: 0 times
Share this document with a friend
Popular Tags:
52
OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing
Transcript
Page 1: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

OpenACC for FortranPGI Compilers for Heterogeneous Supercomputing

Page 2: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

OpenACC Features

Single source – many targets (host+gpu, multicore, ...)Data management- structured data region, unstructured data lifetime- user managed data coherence

Parallelism management- parallel construct, kernels construct, loop directive- gang, worker, vector levels of parallelism

Concurrency (async, wait)Interoperability (CUDA, OpenMP)

Page 3: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

`!$acc data copyin(a(:,:), v(:)) copy(x(:)) !$acc parallel !$acc loop gang do j = 1, n sum = 0.0 !$acc loop vector reduction(+:sum) do i = 1, n sum = sum + a(i,j) + v(i) enddo x(j) = sum enddo !$acc end parallel!$acc end data

Page 4: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

`!$acc data copyin(a(:,:), v(:)) copy(x(:)) call matvec( a, v, x, n )!$acc end data...subroutine matvec( m, v, r, n ) real :: m(:,:), v(:), r(:) !$acc parallel present(a,v,r) !$acc loop gang do j = 1, n sum = 0.0 !$acc loop vector reduction(+:sum) do i = 1, n sum = sum + m(i,j) + v(i) enddo r(j) = sum enddo !$acc end parallelend subroutine

Page 5: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

`!$acc data copyin(a(:,:), v(:)) copy(x(:)) call matvec( a, v, x, n )!$acc end data...subroutine matvec( m, v, r, n ) real :: m(:,:), v(:), r(:) !$acc parallel default(present) !$acc loop gang do j = 1, n sum = 0.0 !$acc loop vector reduction(+:sum) do i = 1, n sum = sum + m(i,j) + v(i) enddo r(j) = sum enddo !$acc end parallelend subroutine

Page 6: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

`!$acc data copyin(a, v, ...) copy(x) call init( x, n ) do iter = 1, niter call matvec( a, v, x, n ) call interp( b, x, n ) !$acc update host( x ) write(...) x call exch( x ) !$acc update device( x ) enddo!$acc end data...

Page 7: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

call init( v, n )call fill( a, n )!$acc data copy( x ) do iter = 1, niter call matvec( a, v, x, n ) call interp( b, x, n ) !$acc update host( x ) write(...) x call exch( x ) !$acc update device( x ) enddo!$acc end data...

subroutine init( v, n ) real, allocatable :: v(:) allocate(v(n)) v(1) = 0 do i = 2, n v(i) = .... enddo !$acc enter data copyin(v)end subroutine

Page 8: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

use vmoduse amodcall initv( n )call fill( n )!$acc data copy( x ) do iter = 1, niter call matvec( a, v, x, n ) call interp( b, x, n ) !$acc update host( x ) write(...) x call exch( x ) !$acc update device( x ) enddo!$acc end data...

module vmod real, allocatable :: v(:)contains subroutine initv( v, n ) allocate(v(n)) v(1) = 0 do i = 2, n v(i) = .... enddo !$acc enter data copyin(v) end subroutine subroutine finiv !$acc exit data delete(v) deallocate(v) end subroutineend module

Page 9: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

use vmoduse amodcall initv( n )call fill( n )!$acc data copy( x ) do iter = 1, niter call matvec( a, v, x, n ) call interp( b, x, n ) !$acc update host( x ) write(...) x call exch( x ) !$acc update device( x ) enddo!$acc end data...

module vmod real, allocatable :: v(:) !$acc declare create(v)contains subroutine initv( v, n ) allocate(v(n)) v(1) = 0 do i = 2, n v(i) = .... enddo !$acc update device(v) end subroutine subroutine finiv deallocate(v) end subroutineend module

Page 10: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

Data Management

Data construct – from acc data to acc end data- single-entry, single exit (no goto in or out, no return)

Data region – dynamic extent of data construct- region includes any routines called during data construct

Dynamic data lifetime – from enter data to exit dataData is present or not present on the device

Page 11: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

!$acc data copy(x) copyin(v)do iter = 1, niter call matvec( a, v, x, n ) call interp( b, x, n ) !$acc update host( x ) write(...) x call exch( x ) !$acc update device( x )enddo!$acc end data...

subroutine matvec(m,v,r,n ) real :: m(:,:), v(:), r(:) !$acc parallel present(v,r) !$acc loop gang do j = 1, n sum = 0.0 !$acc loop vector & !$& reduction(+:sum) do i = 1, n sum = sum+m(i,j)+v(i) enddo r(j) = sum enddo !$acc end parallelend subroutine

Page 12: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

Data Management

Data Clauses:- copy – allocate+copyin at entry, copyout+deallocate at

exit- copyin – allocate+copyin at entry, dealloate at exit- copyout – allocate at entry, copyout+deallocate at exit- create – allocate at entry, deallocate at exit- delete – deallocate at exit (only on exit data)- present – data must be present

No data movement if data is already present- use update directive for unconditional data movement

Page 13: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

!$acc data copy(x) copyin(v)do iter = 1, niter call matvec( a, v, x, n ) call interp( b, x, n ) !$acc update host( x ) write(...) x call exch( x ) !$acc update device( x )enddo!$acc end data...

subroutine matvec(m,v,r,n ) real :: m(:,:), v(:), r(:) !$acc parallel copy(r) & !$acc& copyin(v,m) !$acc loop gang do j = 1, n sum = 0.0 !$acc loop vector & !$& reduction(+:sum) do i = 1, n sum = sum+m(i,j)+v(i) enddo r(j) = sum enddo !$acc end parallelend subroutine

Page 14: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

Data Management

Declare directive- create

- allocatable: allocate on both host and device- static: statically allocated on both host and device

- copyin- in procedure, allocate and initialize for lifetime of procedure

- present- in procedure, data must be present during the procedure

Page 15: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

!$acc data copy(x) copyin(v)do iter = 1, niter call matvec( a, v, x, n ) call interp( b, x, n ) !$acc update host( x ) write(...) x call exch( x ) !$acc update device( x )enddo!$acc end data...

subroutine matvec(m,v,r,n ) real :: m(:,:), v(:), r(:) !$acc declare copyin(v,m) !$acc declare present(r) !$acc parallel !$acc loop gang do j = 1, n sum = 0.0 !$acc loop vector & !$& reduction(+:sum) do i = 1, n sum = sum+m(i,j)+v(i) enddo r(j) = sum enddo !$acc end parallelend subroutine

Page 16: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

Data Management

Update directive- device(x,y,z)- host(x,y,z) or self(x,y,z)- data must be present- subarrays allowed, even noncontiguous subarrays

Page 17: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

!$acc data copy(x) copyin(v)do iter = 1, niter call matvec( a, v, x, n ) call interp( b, x, n ) !$acc update host( x ) write(...) x call exch( x ) !$acc update device( x )enddo!$acc end data...

subroutine matvec(m,v,r,n ) real :: m(:,:), v(:), r(:) !$acc declare copyin(v,m) !$acc declare present(r) !$acc parallel !$acc loop gang do j = 1, n sum = 0.0 !$acc loop vector & !$& reduction(+:sum) do i = 1, n sum = sum+m(i,j)+v(i) enddo r(j) = sum enddo !$acc end parallelend subroutine

Page 18: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

Parallelism Management

Parallel construct- from acc parallel to acc end parallel

Parallel region- dynamic extent of parallel construct- may call procedures on the device (acc routine directive)

gang, worker, vector parallelismlaunches a kernel with fixed #gangs, #workers, vlengthusually use acc parallel loop ....

Page 19: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

!$acc parallel present(a,b,c)

do i = 1, n a(i) = b(i) + c(i)enddo!$acc end parallel...

Page 20: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

!$acc parallel present(a,b,c)!$acc loop gang vectordo i = 1, n a(i) = b(i) + c(i)enddo!$acc end parallel...

Page 21: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

!$acc parallel present(a,b,c)!$acc loop gangdo j = 1, n !$acc loop vector do i = 1, n a(i,j) = b(i) + c(j) enddoenddo!$acc end parallel...

Page 22: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

Parallelism Management

Loop directive- acc loop seq = run this loop sequentially- acc loop gang = run this loop across gangs- acc loop vector = run this loop in vector/SIMD mode- acc loop auto = detect whether this loop is parallel- acc loop independent = this loop IS parallel- acc loop reduction(+:variable) = sum reduction- acc loop private(t) = copy of t for each loop iteration- add loop collapse(2) = two nested loops together

Page 23: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

Parallelism Management

Kernels construct- from acc kernels to acc end kernels

Kernels region- dynamic extent of kernels construct- may call procedures on the device (acc routine directive)

gang, worker, vector parallelismlaunches one or more kernelsusually use acc kernels loop ....

Page 24: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

!$acc kernels present(a,b,c)

do i = 1, n a(i) = b(i) + c(i)enddo!$acc end kernels...

Page 25: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

!$acc kernels present(a,b,c) do j = 1, n do i = 1, n a(i,j) = b(i) + c(j) enddoenddo!$acc end kernels...

Page 26: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

Parallelism Management

acc parallel- more prescriptive, more like OpenMP parallel- user-specified parallelism- acc loop implies loop independent

acc kernels- more descriptive, depends more on compiler analysis- compiler-discovered parallelism- acc loop implies acc loop auto- less useful in C/C++

Page 27: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

Building OpenACC Programs

pgfortran (pgf90, pgf95)–help (–help –ta, –help –acc)–acc – enable OpenACC directives–ta – select target accelerator (–ta=tesla)–Minfo or –Minfo=accelcompile, link, run as normal

Page 28: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

Building OpenACC Programs

–acc=sync – ignore async clauses–acc=noautopar – disable autoparallelization–ta=tesla:cc20,cc30,cc35 – select compute capability–ta=tesla:cuda7.0 – select CUDA toolkit version–ta=tesla:nofma – disable fused multiply-add–ta=tesla:nordc – disable relocatable device code–ta=tesla:fastmath – use fast, low precision library–ta=tesla:managed – allocate in managed memory–ta=multicore – generate parallel multicore (host) code

Page 29: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

Building OpenACC Programs

–Minline – enable procedure inlining–Minline=levels:2 – two levels of inlining–O – enable optimization–fast – more optimization–tp – set target processor (default is build processor)

Page 30: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

Running OpenACC Programs

ACC_DEVICE_NUM – set device number to usePGI_ACC_TIME – set to collect profile informationPGI_ACC_NOTIFY – bitmask for activity- 1 – kernel launch- 2 – data upload/download- 4 – wait events- 8 – region entry/exit- 16 – data allocate/free

Page 31: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

Performance Tuning

Data Management- data regions or dynamic data management- minimize frequency and volume of data traffic

Parallelism Management- as many loops running in parallel as possible

Kernel Schedule Tuning- which loops are running in gang mode, vector mode

Page 32: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

Data Management

Profile to find where data movement occursInsert data directives to remove data movementInsert update directives to manage coherenceSee async below

Page 33: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

Kernel Schedule Tuning

Look at –Minfo messages, profile, PGI_ACC_TIMEEnough gang parallelism generated?- gangs = thread blocks- gangs << SM count

Too much vector parallelism generated- vector = thread- vector length >> loop trip count

Loop collapsingWorker parallelism for intermediate loops

Page 34: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

!$acc parallel present(a,b,c)!$acc loop gang vectordo i = 1, n a(i) = b(i) + c(i)enddo!$acc end parallel...

Page 35: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

!$acc parallel present(a,b,c) num_gangs(30) vector_length(64)!$acc loop gangdo j = 1, n !$acc loop vector do i = 1, n a(i,j) = b(i) + c(j) enddoenddo!$acc end parallel...

Page 36: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

!$acc kernels present(a,b,c)!$acc loop gang(32)do j = 1, n !$acc loop vector(64) do i = 1, n a(i,j) = b(i) + c(j) enddoenddo!$acc end kernels...

Page 37: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

!$acc kernels present(a,b,c)!$acc loop gang vector(4)do j = 1, n !$acc loop gang vector(32) do i = 1, n a(i,j) = b(i) + c(j) enddoenddo!$acc end kernels...

Page 38: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

Routines

Must tell compiler what routines to compile for device- acc routine

Must tell compiler what parallelism is used in the routine- acc routine gang / worker / vector / seq

May be used to interface to native CUDA C

Page 39: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

subroutine asub( a, b, x, n )

real a(*), b(*) real x integer n integer i !$acc loop gang vector do i = 1, n a(i) = x*b(i) enddoend subroutine

Page 40: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

subroutine asub( a, b, x, n ) !$acc routine gang real a(*), b(*) real, value :: x integer, value:: n integer i !$acc loop gang vector do i = 1, n a(i) = x*b(i) enddoend subroutine

Page 41: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

!$acc routine(asub) gang

interface subroutine asub(a,b,x,n) !$acc routine gang real a(*), b(*) real, value :: x integer, value :: n end subroutineend interface

use asub_mod

!$acc parallel present(a,b,x) num_gangs(n/32) vector_length(32) call asub(a, b, x, n)!$acc end parallel

Page 42: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

!$acc parallel present(a,b,c) num_gangs(n) vector_length(64)!$acc loop gangdo j = 1, n call asubv( a(1,j), b, c(j), n )enddo!$acc end parallel...subroutine asubv( a, b, x, n ) !$acc routine vector ... !$acc loop vector do i = 1, n a(i) = x*b(i) enddoend subroutine

Page 43: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

!$acc parallel present(a,b,c) num_gangs(n) vector_length(64) call msub( a, b, c, n )!$acc end parallel...subroutine msub( a, b, c, n ) !$acc routine gang !$acc routine(asuby) vector ... !$acc loop gang do j = 1, n call asubv( a(1,j), b, c(j), n ) enddoend subroutine

Page 44: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

Routines

Routine must know it’s being compiled for deviceCaller and callee must agree on level of parallelism- modules

Scalar arguments passed by value are more efficient

Page 45: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

Asynchronous operation

async clause on parallel, kernels, enter data, exit data, update (and data – PGI extension)async argument is the queue number to use- PGI supports 16 queues; map to CUDA streams- default is “synchronous” queue (not null queue)

wait directive to synchronize host with async queue(s)wait directive to synchronize between async queuesbehavior of synchronous queue with –Mcuda[lib]

Page 46: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

!$acc parallel loop gang ... asyncdo j = 1, n call asubv( a(1,j), b, c(j), n )enddo...!$acc parallel loop gang ... asyncdo j = 1, n call dosomethingelse(...)enddo...!$acc parallel loop gang ... asyncdo j = 1, n call doother(...)enddo!$acc wait

Page 47: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

!$acc parallel loop gang ... async(1)do j = 1, n call asubv( a(1,j), b, c(j), n )enddo...!$acc parallel loop gang ... async(1)do j = 1, n call dosomethingelse(...)enddo...!$acc parallel loop gang ... async(1)do j = 1, n call doother(...)enddo!$acc wait

Page 48: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

!$acc parallel loop gang ... async(1)do j = 1, n call asubv( a(1,j), b, c(j), n )enddo...!$acc parallel loop gang ... async(1)do j = 1, n call dosomethingelse(...)enddo...!$acc parallel loop gang ... async(1)do j = 1, n call doother(...)enddo!$acc wait(1)

Page 49: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

!$acc parallel loop gang ... async(1)do j = 1, n call asubv( a(1,j), b, c(j), n )enddo...!$acc parallel loop gang ... async(2)do j = 1, n call dosomethingelse(...)enddo...!$acc parallel loop gang ... async(3)do j = 1, n call doother(...)enddo!$acc wait

Page 50: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

!$acc parallel loop gang ... async(1)do j = 1, n call asubv( a(1,j), b, c(j), n )enddo...!$acc parallel loop gang ... async(2)do j = 1, n call dosomethingelse(...)enddo...!$acc parallel loop gang ... async(3)do j = 1, n call doother(...)enddo!$acc wait(1,2)

Page 51: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

!$acc parallel loop gang ... async(1)do j = 1, n call asubv( a(1,j), b, c(j), n )enddo...!$acc parallel loop gang ... async(2)do j = 1, n call dosomethingelse(...)enddo...!$acc wait(2) async(1)!$acc parallel loop gang ... async(1)do j = 1, n call doother(...)enddo!$acc wait(1)

Page 52: OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

!$acc parallel loop gang ... async(1)do j = 1, n call asubv( a(1,j), b, c(j), n )enddo...!$acc update host(a) async(1)...!$acc parallel loop gang ... async(1)do j = 1, n call doother(...)enddo...!$acc wait(1)


Recommended