Date post: | 16-Jan-2016 |
Category: |
Documents |
Upload: | malcolm-johns |
View: | 217 times |
Download: | 0 times |
Programming in Co-Array Fortran
John M. Levesque
CTO Office
Applications
With Help from Bob Numrich and
Jim Schwarzmeier
Outline
What is Co-Array Fortran Why you need assistance from the compiler Co-arrays and the Interconnect Why Co-Arrays are better than MPI Things to watch out for Tricks of the CAF coder Results
The Guiding Principle behindCo-Array Fortran What is the smallest change required to make Fortran 90 an
effective parallel language? How can this change be expressed so that it is intuitive and
natural for Fortran programmers? How can it be expressed so that existing compiler
technology can implement it easily and efficiently?
3
What is Co-Array Syntax? Co-Array syntax is a simple extension to normal
Fortran syntax.• It uses normal rounded brackets ( ) to point to data in
local memory.• It uses square brackets [ ] to point to data in remote
memory.• Syntactic and semantic rules apply separately but equally
to ( ) and [ ].
4
Examples of Co-Array Declarations
5
real :: s[*]real :: a(n)[*]complex :: z[*]integer :: index(n)[*]real :: b(n)[p, *]real :: c(n,m)[0:p, -7:q, 11:*]real, allocatable :: w(:)[:]type(field) :: maxwell[p,*]
CAF Memory Model
6
X(1)
X(N)
X(1)
X(N)
X(1) [p]
X(N)[p]
X(1)[q]
X(N)[q]
X(1)
X(N)
P Q
X(1)[q]
X(N)[p]
One to One Model
7
X(1)
X(N)
X(1)
X(N)
X(1) [p]
X(N)[p]
X(1)[q]
X(N)[q]
X(1)
X(N)
P Q
X(1)[q]
X(N)[p]
One physical processor
MANY to One Model (OpenMP on Node)
8
X(1)
X(N)
X(1)
X(N)
X(1) [p]
X(N)[p]
X(1)[q]
X(N)[q]
X(1)
X(N)
P Q
X(1)[q]
X(N)[p]
Many physical processors
One to One Model (multiple images on Node)
9
X(1)
X(N)
X(1)
X(N)
X(1) [p]
X(N)[p]
X(1)[q]
X(N)[q]
X(1)
X(N)
P Q
X(1)[q]
X(N)[p]
One physical processors
Node
What Do Co-Dimensions Mean? real :: x(n)[p,q,*]
• Replicate an array of length n, one on each image.• Build a map so each image knows how to find the array
on any other image.• Organize images in a logical (not physical) three
dimensional grid.• The last co-dimension acts like an assumed size array: *
num_images()/(pxq)
A specific implementation could choose to represent memory hierarchy through the co-dimensions.
10
The CAF Execution Model The number of images is fixed and each image has its own
index, retrievable at run-time:• 1 <,= num_images()• 1 <,= this_image() <,= num_images()
Each image executes the same program independently of the others.
The programmer inserts explicit synchronization and branching as needed.
An “object” has the same name in each image. Each image works on its own local data. An image moves remote data to local data through, and only
through, explicit CAF syntax.
11
Co-Array Fortran Extension Incorporate the SPMD Model into Fortran 90
• Multiple images of the same program• Text and data are replicated in each image
Mark some variables with co-dimensions• Co-dimensions behave like normal dimensions• Co-dimensions express a logical problem decomposition• One-sided data exchange between co-arrays using a Fortran-like
syntax
Require the underlying run-time system to map the logical problem decomposition onto specific hardware.
12
What Do Co-Dimensions Mean?real :: x(n)[p,q,*] Replicate an array of length n, one on each image. Build a map so each image knows how to find the
array on any other image. Organize images in a logical (not physical) three
dimensional grid. The last co-dimension acts like an assumed size array: * =
num_images()/(pxq) A specific implementation could choose to represent
memory hierarchy through the co-dimensions.
13
Communication Using CAF Syntax
y(:) = x(:)[p] myIndex(:) = index(:) yourIndex(:) = index(:)[you] x(index(:)) = y[index(:)] x(:)[q] = x(:) + x(:)[p]
Absent co-dimension defaults to the local object.
14
Irregular and Changing Data Structures
15
Z%ptr Z%ptrZ[p]%ptr
X
X
Co-Array Fortran
Can be implemented:• Directly in the compiler; on those systems where the
compiler can issue memory fetches and stores directly to remote processors memory, the statement becomes a simple remote store. Allows co-array reference in a loop to be combined
into a vector load or store Allows compiler to use normal prefetch mechanism to
move fetches ahead of reference• Via a pre-processor; Rice University is currently working
on such a translator which generates subroutine calls for transferring data to the remote processor Significantly more difficult to get performance better
than MPI
Importance of Vectorizing loop with the CAF reference
17
7. iz = this_image(a) 8. V----< do ix = 1, kx 9. V r--< do iy = 1, ky
10. V r a(ix,iy) = b(iy,iz)[ix] 11. V r--> end do 12. V----> end do
ftn-3021 ftn: INLINE File = data_distro.f90, Line = 7 Routine _THIS_IMAGE3 was not inlined because the compiler was unable to
locate the routine to expand it inline.ftn-6204 ftn: VECTOR File = data_distro.f90, Line = 8
A loop starting at line 8 was vectorized.ftn-6005 ftn: SCALAR File = data_distro.f90, Line = 9 A loop starting at line 9 was unrolled 4 times.
ftn-6208 ftn: VECTOR File = data_distro.f90, Line = 9 A loop starting at line 9 was vectorized as part of the loop starting at line
8.
Another Example
18
629. V------------< do im = 1, 10000 630. V if(blockid.eq.imon_in(4,im) .and.
631. V & ibegin(sx) .le.imon_in(1,im) .and. 632. V & ibegin(sx+1).gt.imon_in(1,im) .and. 633. V & jbegin(sy) .le.imon_in(2,im) .and. 634. V & jbegin(sy+1).gt.imon_in(2,im) .and. 635. V & kbegin(sz) .le.imon_in(3,im) .and. 636. V & kbegin(sz+1).gt.imon_in(3,im)) then
637. V num_mon_me = num_mon_me+1 638. V lmon(im) = .true.
639. V proc_mon[ioid]%array(im)=procid_global 640. V endif 641. V------------> end do
ftn-6375 ftn: VECTOR File = main_3d.f, Line = 629
A loop starting at line 629 would benefit from "!dir$ safe_address".
ftn-6204 ftn: VECTOR File = main_3d.f, Line = 629 A loop starting at line 629 was vectorized.
19
Special features of Baker relating to CAF/UPC
On X1, X1E, and ‘BlackWidow’, the custom processor directly emits addresses for any memory location in the machine. Scalar or vector loads/stores can be done to any global address in the system
On Baker the Gemini NIC used to ‘extend’ address space of Opteron references to access memory on remote nodes• Fortran or C compilers recognize CAF references, x(i)[dest_pe], or
UPC ‘shared’ references, x[i][threads], and generates appropriate ncHT messages to Gemini to load from or store to remote memory
• Users can stride on local offsets or across processor space with any stride, including Gather/Scatter
• Compiler should generate vector requests as appropriate
Thing to watch out for
Typically one must use CAF on symmetric arrays – the virtual address is the same on all processors• This is typically done by allocating an array as a Co-array
Static arrays can be used Allocatable arrays can be used* Automatic arrays can be used*
•These can be costly – it takes time to allocate a symmetric array across all processors
Tricks of the CAF Coder
Since CAF pointer variables are not allowed one can use a derived type that contains a pointer.
TYPE RB real*8, dimension(:,:), pointer :: p_precv_bufEND TYPE RBTYPE (RB) precv_buf[0:*]
Precv_buf%p_precv_buf => recv_buf(1:nx,1:ny)
Using Derived Types
This is particularily useful when modifying a message passing library and you do not know the sizes of the arrays. You would have to allocate the co-array each time, perform an extra copy into the co-array
By using derived types you perform the minimum amount of data transfer, thus completely reducing the overhead of performing the transfer
Using dervived types in a MPI Library
23
!**************************************************************** subroutine mpigatherv (sendbuf, sendcnt, sendtype, recvbuf, recvcnts, & displs, recvtype, root, comm)!! Collects different messages from each thread on masterproc! use shr_kind_mod, only: r8 => shr_kind_r8 use mpishorthand implicit none real (r8), intent(in) :: sendbuf(*) real (r8), intent(out) :: recvbuf(*) integer, intent(in) :: displs(*) integer, intent(in) :: sendcnt integer, intent(in) :: sendtype integer, intent(in) :: recvcnts(*) integer, intent(in) :: recvtype integer, intent(in) :: root integer, intent(in) :: comm integer ier ! MPI error code
24
#if ( defined CAF )
integer i, j, start, end integer mytid,nproc,info target sendbuf
TYPE R4 real(r8),dimension(:), POINTER :: p_ptmp END TYPE R4 TYPE(R4) :: ptmp[*]
call mpi_comm_rank(MPI_COMM_WORLD,mytid,info)
i = this_image() ptmp[i]%p_ptmp => sendbuf(1:sendcnt)
CALL mpi_barrier(MPI_COMM_WORLD,info)
if(mytid .eq. root ) then do i = 1, num_images()
start = displs(i)+1end = start+recvcnts(i)-1recvbuf(start:end) = ptmp[i]%p_ptmp(1:recvcnts(i))
end doend if
CALL mpi_barrier(MPI_COMM_WORLD,info)
25
#else call t_startf ('mpi_gather') call mpi_gatherv (sendbuf, sendcnt, sendtype, recvbuf, recvcnts, displs, recvtype, & root, comm, ier) if (ier /= mpi_success) then write(6,*)'mpi_gather failed ier=',ier call endrun end if call t_stopf ('mpi_gather') #endif return end subroutine mpigatherv
Pointers in Derived Types
TYPE P4 integer len1
real(REAL8),dimension(:), POINTER :: p_send_low
END TYPE P4
TYPE R4
integer len2
real(REAL8),dimension(:), POINTER :: p_send_scratch
END TYPE R4
TYPE S4
integer len3
integer, dimension(:),POINTER :: p_rsend_index
END TYPE S4
TYPE(P4) :: send_low[*]
TYPE(R4) :: send_scratch[*]
TYPE(S4) :: rsend_index[*]
! set Co- array pointer to location of output array
send_scratch%p_send_scratch => input(1:length)
rsend_index%p_rsend_index => send_index(1:length)
send_low%p_send_low => send_lo(0:maxpe)
Must Barrier before using pointer
And then use them
do n=1,recv_num
pe = recv_pe(n)
tc = ilenght(recv_length(pe),pe)
ll = send_low[pe+1]%p_send_low(mype)
do l=1,tc
!dir$ concurrent
do lll=ilength(l,pe),ilenght(l,pe)-1
rindex = rsend_index[pe+1]%p_rsend_index(ll)
output(recv_index(lll))=output(recv_index(lll)) + &
send_scratch[pe+1]%p_send_scratch(rindex)
ll = ll + 1
enddo ! lll
enddo ! l
enddo ! n
Don’t do buffering of message
One of the tremendous advantages of Co-arrays is that one does not have to do buffering to build message blocks
Typical MPI codepack buffer
Send/recv bufferunpack buffer
Good CAF codeput data directly into
remote processor’s memory
How to write a Global_sum using Co-arrays
All processors come into the routine.• Everyone does local sum and/or stores local scalar into
Co-array scalar• Tells master (Processor 1 (or 0)) that it has set value• Spins on master ready flag
Master reads all scalars, performs sum• Broadcasts scalar to all processors• Broadcasts master_ready flag to all processors
What the Children Do! sum local contributions
reduce_real_local = c0
do j=jphys_b,jphys_e
do i=iphys_b,iphys_e
reduce_real_local = reduce_real_local + X(i,j)*MASK(i,j)
end do
end do
!
! send local sum to master
reduce_real_global(1,me)[1] = reduce_real_local
call sync_memory()
child_ready(1,me)[1] = .true.
If(me.eq.1)then
This is the Master code
else
do while (.not. master_ready(1,me))
enddo
master_ready(1,me) = .false.
endif
global_sum_caf = reduce_real_global(`,me)
end function global_sum_caf
What the Master doesif(me.eq.1)then
! wait until all local results have arrived
children_ready = .false.
do while (.not. children_ready)
children_ready = .true.
do i = 2,NPROC_X*NPROC_Y
children_ready = children_ready .and. child_ready(1,i)
enddo
enddo
do i = 2,NPROC_X*NPROC_Y
child_ready(1,i) = .false.
enddo
! global sum
global_sum = reduce_real_global(1)
do i = 2,NPROC_X*NPROC_Y
global_sum = global_sum + reduce_real_global(1,i)
enddo
What the Master does! broadcast
do i = 1,NPROC_X*NPROC_Y
reduce_real_global(1,i)[i] = global_sum
enddo
call sync_memory()
do i = 2,NPROC_X*NPROC_Y
master_ready(1,I)[i] = .true.
enddo
Make sure that child_ready and master_ready areTyped volatile.
Cray Inc. Preliminary and Proprietary 33
Taking full advantage of CAF/UPC
CAF/UPC can be used to do lightweight ‘message passing’• CAF/UPC do ‘zero-sided’ messaging by directly copying data from (local)
source arrays to (remote) destination arrays, without intervening buffer copying
• references generated by compiler no library call overhead• however, this still the basic ‘compute’/’communicate’ approach, so does not
overlap communication with computation
Here we propose that last step in ‘compute’ phase include direct store of latest array values to remote memory of consumer processor• that is, just after final array values stored to local memory, while values still
in processor registers, also store them directly to memory locations needed by remote consumer processor for next iteration or time step
• saves re-loading array values later ala conventional CAF. Also ‘meters out’ remote PUTs on the network while other values of arrays are computed reduces network computation (*)
Why is this important? Because strong scaling forces small grids per MPI process short messages and more benefit from fine-grain overlapping of communication and computation for fixed global problem, we can strongly scale runs on Baker to reduce runtime
(*) ala Norm Troullier, Cray Inc.
Cray Inc. Preliminary and Proprietary 34
Optimizing short-message communication with CAF/UPC Illustrate with generic nearest neighbor explicit algorithms, such as
Jacobi iteration of Laplace’s equation on unit square
Simple explicit differencing leads to
where
Iterate until global MAX , To maintain numerical stability choose
yyuyuxuxuy
u
x
u
),1(),0(,0.1)1,(,0.0)0,(,02
2
2
2
0.1
duuu nji
nji 4,1
,
)4( ,1,1,,1,1nji
nji
nji
nji
nji uuuuudu n
jiu ,
)(du 410
Cray Inc. Preliminary and Proprietary 35
Parallelize with domain decomposition, SPMD Give each processor ‘halo’ cells, here
for 4x4 processor grid, PX = PY = 4
jy,
ix,
nx
ny
0i 1nxi
0j
1nyj
Real*8 u(0:nx+1, 0:ny+1, 0:1)
1: nn‘time’ level =
Cray Inc. Preliminary and Proprietary 36
High-level structure of code
INIT Use #ifdef MPI, #ifdef CAF, #ifdef CAF_overlap in single source Set global boundary conditions DO iter = 1, maxiter
• Communicate halo exchanges via MPI or CAF• Compute for each processor
Call Laplace (u, k0, k1, dumax, nx, ny, alpha) After loops in Laplace for MPI, CAF return. For CAF_overlap
communicate a) halo data, and b) each PE PUT local dumax to ‘master’• Communicate dumax:
MPI – Allreduce CAF – master reads dumax[source_pe], computes global_dumax,
broadcasts global_dumax[dest_pe] CAF_overlap – master computes global_dumax from local dumax array
(filled by all PEs while in Laplace), broadcasts global_dumax[dest_pe]
If(global_dumax < ) go to 1000 convergence test
Cray Inc. Preliminary and Proprietary 37
High-level source code
integer(4), parameter :: nx = 100, ny = 100, maxiter = 20000 ! WEAK scaling real(8), parameter :: epsilon = 1.d-4, alpha = 0.95d0 integer(4) :: iter, k0, k1 integer(4) :: PPX, PPY, px, py, master, pxmaster, pymaster#ifdef MPI real(8), dimension(0:nx+1, 0:ny+1, 0:1) :: u real(8) :: dumax, global_dumax call mpi_comm_rank(mpi_comm_world, myrank, ierror) ! myrank=mype call mpi_comm_size(mpi_comm_world, mysize, ierror) ! mysize=numpes mype = myrank numpes = mysize#endif #ifdef CAF real(8), allocatable, dimension(:,:,:)[:,:] :: u real(8), allocatable, dimension[:,:] :: dumax real(8) :: global_dumax mype = this_image() - 1 numpes = num_images()#endif
Cray Inc. Preliminary and Proprietary 38
High-level source code (cont)
#ifdef CAF_overlap ! ‘_overlap’ is optimized CAF version real(8), allocatable, dimension(:,:,:)[:,:] :: u real(8), allocatable, dimension(:)[:,:] :: dumax real(8) :: global_dumax common /CAFstuff/ PPX, PPY, px, py, master, pxmaster, pymaster, mype, numpes mype = this_image() - 1 numpes = num_images()#endif
PPX = INT(sqrt(dfloat(numpes))) ; PPY = numpes/PPX . . .#ifdef CAF allocate( u(0:nx+1, 0:ny+1, 0:1)[0:PPX-1, 0:*] ) allocate( dumax[0:PPX-1, 0:*] )#endif#ifdef CAF_overlap allocate( u(0:nx+1, 0:ny+1, 0:1)[0:PPX-1, 0:*] ) allocate( dumax(0:numpes-1)[0:PPX-1, 0:*] )#endif
Cray Inc. Preliminary and Proprietary 39
High-level source code (cont)!...main iteration loop k0 = 0 DO iter = 1, maxiter k1 = mod( 1 + mod(k0, 2), 2 ) px = MOD(mype, PPX) py = mype/PPY
!...before next compute step, communicate data in NSEW directions
#ifdef MPI! send to North neighbor if(py < PPY-1) then dest = px + (py+1)*PPX buf_send(1:nx) = u(1:nx,ny,k0) call MPI_send(buf_send, nx, MPI_real8, dest, 1, MPI_COMM_WORLD, status, ierror) endif! recvNorth message from South neighbor if(py > 0) then dest = px + (py-1)*PPX call MPI_recv(buf_recv, nx, MPI_real8, dest, 1, MPI_COMM_WORLD, status, ierror) u(1:nx,0,k0) = buf_recv(1:nx) endif. . .
Cray Inc. Preliminary and Proprietary 40
High-level source code (cont)
#ifdef CAF! send to North neighbor with stride 1 if(py < PPY-1) u(1:nx,0,k0)[px,py+1] = u(1:nx,ny,k0)! send to South neighbor with stride 1 if(py > 0) u(1:nx,ny+1,k0)[px,py-1] = u(1:nx,1,k0)! send to East neighbor with stride nx+2 if(px < PPX-1) u(0,1:ny,k0)[px+1,py] = u(nx,1:ny,k0)! send to West neighbor with stride nx+2 if(px > 0) u(nx+1,1:ny,k0)[px-1,py] = u(1,1:ny,k0) call sync_all()#endif
NOTE: NO halo communication for CAF_overlap
Cray Inc. Preliminary and Proprietary 41
High-level source code (cont)
call Laplace (u, k0, k1, dumax, nx, ny, alpha)
. . .subroutine Laplace (u, k0, k1, dumax, nx, ny, alpha) integer(4) :: k0, k1, nx, ny, i, j real(8) :: du, alpha#ifndef CAF_overlap real(8), dimension(0:nx+1, 0:ny+1, 0:1) :: u MPI, CAF use same real(8) :: dumax subroutine declarations#endif#ifdef CAF_overlap integer(4) :: PPX, PPY, px, py, master, pxmaster, pymaster, mype, numpes common /CAFstuff/ PPX, PPY, px, py, master, pxmaster, pymaster, mype, numpes real(8), dimension(0:nx+1, 0:ny+1, 0:1)[0:PPX-1,0:*] :: u real(8), dimension(0:*)[0:PPX-1, 0:*] :: dumax#endif
Cray Inc. Preliminary and Proprietary 42
High-level source code (cont)
For MPI, CAF, no communication in Laplace
#ifndef CAF_overlap!...do five-point iterative update on u(i,j)
dumax = 0.d0
!dir$ concurrent do j = 1, ny!dir$ concurrent do i = 1, nx du = u(i-1,j,k0) + u(i+1,j,k0) + u(i,j-1,k0) + u(i,j+1,k0) - 4.d0*u(i,j,k0) u(i,j,k1) = u(i,j,k0) + 0.25d0*alpha*du if(dabs(du) >= dumax) dumax = dabs(du) enddo ! i enddo ! j
return#endif
Cray Inc. Preliminary and Proprietary 43
High-level source code (cont)
#ifdef CAF_overlap!...peel off surface layers of doubly nested loop to overlap communication with computation!...do five-point iterative update on u(i,j)
dumax(mype) = 0.d0
!...North + South layers!dir$ concurrent do i = 1, nx j = 1 ; du = u(i-1,j,k0) + u(i+1,j,k0) + u(i,j-1,k0) + u(i,j+1,k0) - 4.d0*u(i,j,k0) u(i,j,k1) = u(i,j,k0) + 0.25d0*alpha*du u(i,j,k1) in vector register if(py > 0) u(i,ny+1,k1)[px, py-1] = u(i,1,k1) Vstore register again if(dabs(du) >= dumax(mype)) dumax(mype) = dabs(du) j = ny ; du = u(i-1,j,k0) + u(i+1,j,k0) + u(i,j-1,k0) + u(i,j+1,k0) - 4.d0*u(i,j,k0) u(i,j,k1) = u(i,j,k0) + 0.25d0*alpha*du if(py < PPY-1) u(i,0,k1)[px, py+1] = u(i,ny,k1) if(dabs(du) >= dumax(mype)) dumax(mype) = dabs(du) enddo ! I
NOTE: PUT-based CAF_overlap more efficient than GET-based if #stores < #loads
Cray Inc. Preliminary and Proprietary 44
High-level source code (cont)!...East + West layers!dir$ concurrent do j = 1, ny i = 1; du = u(i-1,j,k0) + u(i+1,j,k0) + u(i,j-1,k0) + u(i,j+1,k0) - 4.d0*u(i,j,k0) u(i,j,k1) = u(i,j,k0) + 0.25d0*alpha*du if(px > 0) u(nx+1,j,k1)[px-1, py] = u(1,j,k1) if(dabs(du) >= dumax(mype)) dumax(mype) = dabs(du) i = nx; du = u(i-1,j,k0) + u(i+1,j,k0) + u(i,j-1,k0) + u(i,j+1,k0) - 4.d0*u(i,j,k0) u(i,j,k1) = u(i,j,k0) + 0.25d0*alpha*du if(px < PPX-1) u(0,j,k1)[px+1, py] = u(nx,j,k1) if(dabs(du) >= dumax(mype)) dumax(mype) = dabs(du) enddo ! j!...interior!dir$ concurrent do j = 2, ny-1!dir$ concurrent do i = 2, nx-1 du = u(i-1,j,k0) + u(i+1,j,k0) + u(i,j-1,k0) + u(i,j+1,k0) - 4.d0*u(i,j,k0) u(i,j,k1) = u(i,j,k0) + 0.25d0*alpha*du if(dabs(du) >= dumax(mype)) dumax(mype) = dabs(du) dumax(mype) in enddo ; enddo scalar register dumax(mype)[pxmaster,pymaster] = dumax(mype) ! PUT dumax to pe=master#endif
Cray Inc. Preliminary and Proprietary 45
High-level source code (cont)
Finish iteration loop with communication for global convergence test
#ifdef MPI call MPI_ALLREDUCE(dumax, global_dumax, 1, MPI_DOUBLE_PRECISION, & MPI_MAX, MPI_COMM_WORLD, ierror) if(ierror .ne. 0) go to 999#endif#ifdef CAF call sync_all() ensure all PEs have computed their local dumax if(mype == master) then do ipy = 0,PPY-1 ; do ipx = 0,PPX-1 tmp = dumax[ipx,ipy] master does remote GETs of dumax with vector load dumax = MAX(dumax, tmp) enddo ; enddo do ipy = 0,PPY-1 ; do ipx = 0,PPX-1 dumax[ipx,ipy] = dumax enddo ; enddo endif ! mype == master ierror = 0 call sync_all() ensure all PEs have received global_dumax global_dumax = dumax#endif
Cray Inc. Preliminary and Proprietary 46
High-level source code (cont)
#ifdef CAF_overlap call sync_all() ensure all PEs have computed their local dumax if(mype == master) then in routine Laplace do i= 0,numpes-1 dumax(master) = MAX( dumax(master), dumax(i) ) read dumax(i) enddo ! i from local memory do ipy = 0,PPY-1 ; do ipx = 0,PPX-1 dest_pe = ipx + ipy*PPX dumax(dest_pe)[ipx,ipy] = dumax(master) broadcast global_dumax enddo ; enddo via remote vector stores endif ! mype == master ierror = 0 call sync_all() ensure all PEs have received global_dumax global_dumax = dumax#endif k0 = k1 interchange old and new iteration copies if(global_dumax < epsilon) go to 1000 enddo ! iter
1000 continue
Cray Inc. Preliminary and Proprietary 47
Weak Scaling Results on X1E Scaling results in units of GFLOPS/MSP
n = 100 n = 200 n = 400
P=4 (%
)
P=16(%
)
P=64(%
)
P=4(%
)
P=16(%
)
P=64(%
)
P=4(%
)
P=16(%
)
P=64(%
)
MPI.948 (22)
.432 (12)
.101 (2.6)
2.72 (42)
1.36 (24)
.697 (11)
3.17 (73)
2.09 (60)
.986 (27)
CAF1.62 (75)
1.31 (63)
1.12 (53)
3.60 (85)
3.13 (75)
2.81 (68)
3.59 (92)
2.72 (87)
2.41 (79)
CAF_overlap
2.59 (79)
2.06 (69)
1.71 (57)
5.23 (88)
4.60 (83)
3.99 (68)
3.61 (92)
2.82
(92)
2.67
(87)
Strong scaling:
constPn
compT compT compT compT compT compT compT compT compT
Cray Inc. Preliminary and Proprietary 48
Conclusions
In terms of performance of 2-D Laplace example on X1E:• For small surface-to-volume (P = 4, n = 400), MPI, CAF,
CAF_overlap within 13% of one another• For weak scaling, CAF_overlap > CAF > MPI in all cases• For strong scaling limit (P = 64 and n = 100), CAF_overlap =
1.5*CAF, CAF = 11.*MPI
Baker will have hardware support for efficient use of CAF/UPC/SHMEM (and excellent support for MPI )
Moreover, users can program for even better strong scaling performance on Baker by using CAF/UPC to do fine-grain overlapping of communication and computation as illustrated here
References
49
ISO/IEC JTC1/SC22/WG5 N1762Coarrays in the next Fortran Standard
John Reid, JKR Associates, UKDecember 8, 2008