GWDG
Matrix Transpose Resultswith Hybrid OpenMP / MPI
O. Haan
Gesellschaft für wissenschaftliche DatenverarbeitungGöttingen, Germany
( GWDG )
SCICOMP 2000, SDSC, La Jolla
O. Haan, Matrix Transpose Results, SCICOMP 2000 2
GWDG
Overview
• Hybrid Programming Model
• Distributed Matrix Transpose
• Performance Measurements
• Summary of Results
O. Haan, Matrix Transpose Results, SCICOMP 2000 3
GWDG
Architecture of Scalable Parallel Computers
Two level hierarchy
• cluster of SMP nodesdistributed memoryhigh speed interconnect
• SMP nodes with multiple processorsshared memorybus or switch connected
O. Haan, Matrix Transpose Results, SCICOMP 2000 4
GWDG
Programming Models
• message passing over all processorsMPI implementation for shared memorymultiple access to switch adaptersSP: 4-way Winterhawk2 +
8-way Nighthawk -
• shared memory over all processorsvirtual global address space SP: -
• hybrid message passing - shared memorymessage passing between nodesshared memory within nodesSP: +
O. Haan, Matrix Transpose Results, SCICOMP 2000 5
GWDG
Hybrid Programming Model
SPMD programwith MPI tasks
OpenMP threadswithin each task
communicationbetween MPI tasks
O. Haan, Matrix Transpose Results, SCICOMP 2000 6
GWDG
Example of Hybrid Program program hybrid_example include “mpif.h“ com = MPI_COMM_WORLD call MPI_INIT(ierr) call MPI_COMM_SIZE(com,nk,ierr) call MPI_COMM_RANK(com,my_task,ierr) kp = OMP_GET_NUM_PROCS()!$OMP PARALLEL PRIVATE(my_thread) my_thread = OMP_GET_THREAD_NUM() call work(my_thread,kp,my_task,nk,thread_res)!$OMP END PARALLEL do i = 0 , kp-1 node_res = node_res + thread_res(i) end do call MPI_REDUCE(node_res,glob_res,1, : MPI_REAL,MPI_SUM,0,com,ierr) call MPI_FINALIZE(ierr) stop end
O. Haan, Matrix Transpose Results, SCICOMP 2000 7
GWDG
Hybrid Programming vs.Pure Message Passing
+• works on all SP configuration
• coarser internode communication granularity
• faster intranode communication
-• larger programming effort
• additional synchronization steps
• reduced reuse of cached data
the net score depends on the problem
O. Haan, Matrix Transpose Results, SCICOMP 2000 8
GWDG
Distributed Matrix Transpose
O. Haan, Matrix Transpose Results, SCICOMP 2000 9
GWDG
3-step Transpose
n1 x n2 matrix A( i1 , i2 ) --> n2 x n1 matrix B( i2 , i1 )
decompose n1, n2 in local and global parts:n1 = n1l * np n2 = n2l * np
write matrices A, B as 4-dim arrays:A( i1l , i1g , i2l ; i2g ) , B( i2l , i2g , i1l ; i1g )
step 1 : local reorderA( i1l , i1g , i2l ; i2g ) -> a1( i1l , i2l , i1g ; i2g )
step 2 : global reordera1( i1l , i2l , i1g ; i2g ) -> a2( i1l , i2l , i2g ; i1g )
step 3 : local transposea2( i1l , i2l , i2g ; i1g ) -> B( i2l , i2g , i1l ; i1g )
O. Haan, Matrix Transpose Results, SCICOMP 2000 10
GWDG
Local Steps: Copy with Reorder
• data in memory:speed limited by performance of bus and memory subsystems
Winterhawk2 : all processors share the same bus bandwidth : 1.6 GB/s
• data in cache:speed limited by processor performance
Winterhawk2 : one load plus one store per cyclebandwidth : 8 MB / (1/375) s = 3 GB / s
O. Haan, Matrix Transpose Results, SCICOMP 2000 11
GWDG
Copy: Data in Memory
O. Haan, Matrix Transpose Results, SCICOMP 2000 12
GWDG
Copy : Prefetch
O. Haan, Matrix Transpose Results, SCICOMP 2000 13
GWDG
Copy : Data in Cache
O. Haan, Matrix Transpose Results, SCICOMP 2000 14
GWDG
Global Reorder
a1( *, *, i1g ; i2g ) -> a2( * , * , i2g ; i1g )global reorder on np processors in np steps
p0 p1 p2
step0
step1
step2
O. Haan, Matrix Transpose Results, SCICOMP 2000 15
GWDG
Performance Modelling
Hardware model: nk nodes with kp procs each
np = nk * kp is total procs count
Switch model: nk concurrent links between nodes
latency tlat , bandwidth c
execution model for Hybrid: reorder on nk nodes:
nk steps with n1*n2 / nk**2 data per node
execution model for MPI: reorder on np processors:
np steps with n1*n2 / np**2 data per nodeswitch links shared between kp procs
O. Haan, Matrix Transpose Results, SCICOMP 2000 16
GWDG
Performance Modelling
Hybrid timing model:
21 /21 nknnctnkt latsmp
nknnctnk lat /211
MPI timing model:
21 /21/ npnnkpctnpt latmpi
nknnctkpnk lat /211
O. Haan, Matrix Transpose Results, SCICOMP 2000 17
GWDG
Timing of Global Reorder (internode part)
O. Haan, Matrix Transpose Results, SCICOMP 2000 18
GWDG
Timing of Global Reorder (internode part)
O. Haan, Matrix Transpose Results, SCICOMP 2000 19
GWDG
Timing of Global Reorder
O. Haan, Matrix Transpose Results, SCICOMP 2000 20
GWDG
Timing of Transpose
O. Haan, Matrix Transpose Results, SCICOMP 2000 21
GWDG
Scaling of Transpose
O. Haan, Matrix Transpose Results, SCICOMP 2000 22
GWDG
Timing of Transpose Steps
O. Haan, Matrix Transpose Results, SCICOMP 2000 23
GWDG
Summary of Results: Hardware
• Memory access in Winterhawk2 is not adaquate:copy rate of 400 MB/s = 50 Mwords/s peak CPU rate of 6000 Mflops/sa factor of 100 between computational speed and memory speed
• Sharing of switch link by 4 processors degrades communication speed:bandwidth smaller by more than a factor of 4
( factor of 4 expected )latency larger by nearly a factor of 4
( factor of 1 expected )
O. Haan, Matrix Transpose Results, SCICOMP 2000 24
GWDG
Summary of Results: Hybrid vs. MPI
• hybrid OpenMP / MPI programming is profitable for distributed matrix tranpose :1000 x 1000 matrix on 16 nodes : 2.3 times faster10000 x 10000 matrix on 16 nodes : 1.1 times faster
• Competing influences :MPI programming enhances use of cached dataHybrid programming has lower communication latency and coarser communication granularity
O. Haan, Matrix Transpose Results, SCICOMP 2000 25
GWDG
Summary of Results: Use of Transpose in FFT
2-dim complex array of size
112 /2/ln5 cnknrnknnt
21 nnn
Execution time on nk nodes :
where r : computational speed per node c : transpose speed per node
effective execution speed per node :
cr
n
rreff
2ln52
1
1
O. Haan, Matrix Transpose Results, SCICOMP 2000 26
GWDG
Summary of Results: Use of Transpose in FFT- Example SP
r = 4 * 200 Mflop/s = 800 Mflop/sc depends on n, nk and programming model
nk = 16 n = 10**6 10**9
hybrid c = 5.6 7.8 Mword/sMPI c = 2.5 7.0 Mword/s
effective execution speed per node
hybrid = 208 338 Mflop/s
MPI = 108 317 Mflop/s
effr
effr