Delivering High Performance to Delivering High Performance to Parallel Applications Using Parallel Applications Using
Advanced SchedulingAdvanced Scheduling
Nikolaos Drosinos, Georgios GoumasMaria Athanasaki and Nectarios Koziris
National Technical University of Athens
Computing Systems Laboratory
{ndros,goumas,maria,nkoziris}@cslab.ece.ntua.grwww.cslab.ece.ntua.gr
5/9/2003 Parallel Computing 2003 2
OverviewOverview
Introduction Background Code Generation
• Computation/Data Distribution• Communication Schemes• Summary
Experimental Results Conclusions – Future Work
5/9/2003 Parallel Computing 2003 3
IntroductionIntroduction
Motivation:
• A lot of theoretical work has been done on arbitrary tiling but there are no actual experimental results!• There is no complete method to generate code for non-rectangular tiles
5/9/2003 Parallel Computing 2003 4
IntroductionIntroduction
Contribution:
• Complete end-to-end SPMD code generation method for arbitrarily tiled iteration spaces• Simulation of blocking and non-blocking communication primitives• Experimental evaluation of proposed scheduling scheme
5/9/2003 Parallel Computing 2003 5
OverviewOverview
Introduction Background Code Generation
• Computation/Data Distribution• Communication Schemes• Summary
Experimental Results Conclusions – Future Work
5/9/2003 Parallel Computing 2003 6
BackgroundBackground
Algorithmic Model:
FOR j1 = min1 TO max1 DO
…
FOR jn = minn TO maxn DO
Computation(j1,…,jn);
ENDFOR
…
ENDFOR
Perfectly nested loops Constant flow data dependencies (D)
5/9/2003 Parallel Computing 2003 7
BackgroundBackground
Tiling:
Popular loop transformation Groups iterations into atomic units Enhances locality in uniprocessors Enables coarse-grain parallelism in distributed memory systems Valid tiling matrix H: 0DH
5/9/2003 Parallel Computing 2003 8
Tiling TransformationTiling Transformation
Example:
FOR j1=0 TO 11 DO FOR j2=0 TO 8 DO A[j1,j2]:=A[j1-1,j2] + A[j1-1,j2-1]; ENDFORENDFOR
10
11D
5/9/2003 Parallel Computing 2003 9
Rectangular Tiling Rectangular Tiling TransformationTransformation
j1
j2
3 0P = 0 3
1/3 0H = 0 1/3
5/9/2003 Parallel Computing 2003 10
Non-rectangular Tiling Non-rectangular Tiling TransformationTransformation
j1
j2
3 3P = 0 3
1/3 -1/3H = 0 1/3
5/9/2003 Parallel Computing 2003 11
Why Non-rectangular Tiling?Why Non-rectangular Tiling?
Reduces communication
Enables more efficient scheduling schemes
8 communication points
6 communication points
6 time steps 5 time steps
5/9/2003 Parallel Computing 2003 12
OverviewOverview
Introduction Background Code Generation
• Computation/Data Distribution• Communication Schemes• Summary
Experimental Results Conclusions – Future Work
5/9/2003 Parallel Computing 2003 13
Computation DistributionComputation Distribution
We map tiles along the longest dimension to the same processor because:
• It reduces the number of processors required• It simplifies message-passing• It reduces total execution times when overlapping computation with communication
5/9/2003 Parallel Computing 2003 14
Computation DistributionComputation Distribution
P3
P2
P1
j1
j2
5/9/2003 Parallel Computing 2003 15
Data DistributionData Distribution
Computer-owns rule: Each processor owns the data it computes Arbitrary convex iteration space, arbitrary tiling Rectangular local iteration and data spaces
5/9/2003 Parallel Computing 2003 16
Data DistributionData Distribution
j1
j2
Tile Iteration SpaceTIS
j'2
Transformed Tile Iteration SpaceTTIS
j'1
H'
P'
5/9/2003 Parallel Computing 2003 17
Data DistributionData Distribution
ma
pp
ing
dim
en
sio
n
t=0
t=1
t=2
t=3
j'1
j''2
j''1
j'2
map-1
map
LDS TTIS...
off1
off2
Computation Storage
Communication Storage
Unused Space
5/9/2003 Parallel Computing 2003 18
Data DistributionData Distribution
)( ypidLDS
Jn
)( xpidLDS
loc()
loc–1()
fw()
DS
loc()loc–1()
j2
j1
w1
w2
j2''
j2''
j1''
j1''
5/9/2003 Parallel Computing 2003 19
OverviewOverview
Introduction Background Code Generation
• Computation/Data Distribution• Communication Schemes• Summary
Experimental Results Conclusions – Future Work
5/9/2003 Parallel Computing 2003 20
P3
P2
P1
Communication SchemesCommunication Schemes
With whom do I communicate?
j1
j2
5/9/2003 Parallel Computing 2003 21
P3
P2
P1
With whom do I communicate?
j1
j2
Communication SchemesCommunication Schemes
5/9/2003 Parallel Computing 2003 22
What do I send?
Tile Iteration SpaceTIS
j1
j2
21
13D
j'2
Transformed Tile Iteration SpaceTTIS
j'1
50
05D
Communication SchemesCommunication Schemes
5/9/2003 Parallel Computing 2003 23
Blocking SchemeBlocking Scheme
P3
P2
P1
j1
j2
12 time steps
5/9/2003 Parallel Computing 2003 24
Non-blocking SchemeNon-blocking Scheme
P3
P2
P1
j1
j2
6 time steps
5/9/2003 Parallel Computing 2003 25
OverviewOverview
Introduction Background Code Generation
• Computation/Data Distribution• Communication Schemes• Summary
Experimental Results Conclusions – Future Work
5/9/2003 Parallel Computing 2003 26
Code Generation SummaryCode Generation Summary
Advanced Scheduling = Suitable Tiling + Non-blocking Communication Scheme
Sequential Code
Dependence Analysis
ParallelizationParallel SPMD
Code
Tiling Transformation
Sequential Tiled Code
Tiling
• Computation Distribution
• Data Distribution
• Communication Primitives
5/9/2003 Parallel Computing 2003 27
Code Summary – Blocking Code Summary – Blocking SchemeSchemeFORACROSS Spid 11 min TO S
1max DO … FORACROSS S
nnpid 11 min TO Sn 1max DO
/*Sequential execution of tiles*/ FOR S
nSt min TO S
nmax DO /*Receive data from neighbouring tiles*/
BLOCKING_RECV( CCDtpid SS ,,, ); /*Traverse the internal of the tile*/ FOR 11 lj TO 1u STEP= 1c DO … FOR nn lj TO nu STEP= nc DO /*Perform computations on LDS*/ t := St - S
nl ;
LA[map( tj , )]:=F(LA[map( tdj ,1 )],…,
LA[map( tdj q , )]); ENDFOR … ENDFOR /*Send data to neighbouring processors*/
BLOCKING_SEND( CCDtpid mS ,,, ); ENDFOR ENDFORACROSS … ENDFORACROSS
5/9/2003 Parallel Computing 2003 28
FORACROSS Spid 11 min TO S1max DO
… FORACROSS S
nnpid 11 min TO Sn 1max DO
/*Sequential execution of tiles*/ FOR S
nSt min TO S
nmax DO /*Receive data for next tile*/
NON_BLOCKING_RECV( CCDtpid SS ,,1, ); /*Send data computed at previous tile*/
NON_BLOCKING_SEND( CCDtpid mS ,,1, ); /*Compute current tile*/ FOR 11 lj TO 1u STEP= 1c DO … FOR nn lj TO nu STEP= nc DO /*Perform computations on LDS*/ t := St - S
nl ;
LA[map( tj , )]:=F(LA[map( tdj ,1 )],…,
LA[map( tdj q , )]);
ENDFOR … ENDFOR /*Wait for communication completion*/ WAIT_ALL ENDFOR ENDFORACROSS … ENDFORACROSS
Code Summary – Non-blocking Code Summary – Non-blocking SchemeScheme
5/9/2003 Parallel Computing 2003 29
OverviewOverview
Introduction Background Code Generation
• Computation/Data Distribution• Communication Schemes• Summary
Experimental Results Conclusions – Future Work
5/9/2003 Parallel Computing 2003 30
Experimental ResultsExperimental Results
8-node SMP Linux Cluster (800 MHz PIII, 128 MB RAM, kernel 2.4.20) MPICH v.1.2.5 (--with-device=p4, --with-comm=shared) g++ compiler v.2.95.4 (-O3) FastEthernet interconnection 2 micro-kernel benchmarks (3D):
• Gauss Successive Over-Relaxation (SOR)• Texture Smoothing Code (TSC)
Simulation of communication schemes
5/9/2003 Parallel Computing 2003 31
SORSOR
Iteration space M x N x N Dependence matrix:
01010
00101
11100
D
Rectangular Tiling:
Non-rectangular Tiling:
z
y
x
Pr00
00
00
zx
y
x
Pnr0
00
00
5/9/2003 Parallel Computing 2003 32
SORSOR
5/9/2003 Parallel Computing 2003 33
SORSOR
5/9/2003 Parallel Computing 2003 34
TSCTSC
Iteration space T x N x N Dependence matrix:
10111101
11100111
11110000
D
Rectangular Tiling:
Non-rectangular Tiling:
z
y
x
Pr00
00
00
zyx
yx
x
Pnr 0
00
5/9/2003 Parallel Computing 2003 35
TSCTSC
5/9/2003 Parallel Computing 2003 36
TSCTSC
5/9/2003 Parallel Computing 2003 37
OverviewOverview
Introduction Background Code Generation
• Computation/Data Distribution• Communication Schemes• Summary
Experimental Results Conclusions – Future Work
5/9/2003 Parallel Computing 2003 38
ConclusionsConclusions
Automatic code generation for arbitrary tiled spaces can be efficient High performance can be achieved by means of
a suitable tiling transformation overlapping computation with communication
5/9/2003 Parallel Computing 2003 39
Future WorkFuture Work
Application of methodology to imperfectly nested loops and non-constant dependencies Investigation of hybrid programming models (MPI+OpenMP) Performance evaluation on advanced interconnection networks (SCI, Myrinet)
5/9/2003 Parallel Computing 2003 40
Questions?Questions?
http://www.cslab.ece.ntua.gr/~ndros