GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche...

GWDG

Matrix Transpose Resultswith Hybrid OpenMP / MPI

O. Haan

Gesellschaft für wissenschaftliche DatenverarbeitungGöttingen, Germany

( GWDG )

SCICOMP 2000, SDSC, La Jolla

O. Haan, Matrix Transpose Results, SCICOMP 2000 2

GWDG

Overview

• Hybrid Programming Model

• Distributed Matrix Transpose

• Performance Measurements

• Summary of Results


GWDG

Architecture of Scalable Parallel Computers

Two level hierarchy

• cluster of SMP nodesdistributed memoryhigh speed interconnect

• SMP nodes with multiple processorsshared memorybus or switch connected


GWDG

Programming Models

• message passing over all processorsMPI implementation for shared memorymultiple access to switch adaptersSP: 4-way Winterhawk2 +

8-way Nighthawk -

• shared memory over all processorsvirtual global address space SP: -

• hybrid message passing - shared memorymessage passing between nodesshared memory within nodesSP: +


GWDG

Hybrid Programming Model

SPMD programwith MPI tasks

OpenMP threadswithin each task

communicationbetween MPI tasks


GWDG

Example of Hybrid Program program hybrid_example include “mpif.h“ com = MPI_COMM_WORLD call MPI_INIT(ierr) call MPI_COMM_SIZE(com,nk,ierr) call MPI_COMM_RANK(com,my_task,ierr) kp = OMP_GET_NUM_PROCS()!$OMP PARALLEL PRIVATE(my_thread) my_thread = OMP_GET_THREAD_NUM() call work(my_thread,kp,my_task,nk,thread_res)!$OMP END PARALLEL do i = 0 , kp-1 node_res = node_res + thread_res(i) end do call MPI_REDUCE(node_res,glob_res,1, : MPI_REAL,MPI_SUM,0,com,ierr) call MPI_FINALIZE(ierr) stop end


GWDG

Hybrid Programming vs.Pure Message Passing

+• works on all SP configuration

• coarser internode communication granularity

• faster intranode communication

-• larger programming effort

• additional synchronization steps

• reduced reuse of cached data

the net score depends on the problem


GWDG

Distributed Matrix Transpose


GWDG

3-step Transpose

n1 x n2 matrix A( i1 , i2 ) --> n2 x n1 matrix B( i2 , i1 )

decompose n1, n2 in local and global parts:n1 = n1l * np n2 = n2l * np

write matrices A, B as 4-dim arrays:A( i1l , i1g , i2l ; i2g ) , B( i2l , i2g , i1l ; i1g )

step 1 : local reorderA( i1l , i1g , i2l ; i2g ) -> a1( i1l , i2l , i1g ; i2g )

step 2 : global reordera1( i1l , i2l , i1g ; i2g ) -> a2( i1l , i2l , i2g ; i1g )

step 3 : local transposea2( i1l , i2l , i2g ; i1g ) -> B( i2l , i2g , i1l ; i1g )


GWDG

Local Steps: Copy with Reorder

• data in memory:speed limited by performance of bus and memory subsystems

Winterhawk2 : all processors share the same bus bandwidth : 1.6 GB/s

• data in cache:speed limited by processor performance

Winterhawk2 : one load plus one store per cyclebandwidth : 8 MB / (1/375) s = 3 GB / s


GWDG

Copy: Data in Memory


GWDG

Copy : Prefetch


GWDG

Copy : Data in Cache


GWDG

Global Reorder

a1( *, *, i1g ; i2g ) -> a2( * , * , i2g ; i1g )global reorder on np processors in np steps

p0 p1 p2

step0

step1

step2


GWDG

Performance Modelling

Hardware model: nk nodes with kp procs each

np = nk * kp is total procs count

Switch model: nk concurrent links between nodes

latency tlat , bandwidth c

execution model for Hybrid: reorder on nk nodes:

nk steps with n1*n2 / nk**2 data per node

execution model for MPI: reorder on np processors:

np steps with n1*n2 / np**2 data per nodeswitch links shared between kp procs


GWDG

Performance Modelling

Hybrid timing model:

21 /21 nknnctnkt latsmp

nknnctnk lat /211

MPI timing model:

21 /21/ npnnkpctnpt latmpi

nknnctkpnk lat /211


GWDG

Timing of Global Reorder (internode part)


GWDG

Timing of Global Reorder (internode part)


GWDG

Timing of Global Reorder


GWDG

Timing of Transpose


GWDG

Scaling of Transpose


GWDG

Timing of Transpose Steps


GWDG

Summary of Results: Hardware

• Memory access in Winterhawk2 is not adaquate:copy rate of 400 MB/s = 50 Mwords/s peak CPU rate of 6000 Mflops/sa factor of 100 between computational speed and memory speed

• Sharing of switch link by 4 processors degrades communication speed:bandwidth smaller by more than a factor of 4

( factor of 4 expected )latency larger by nearly a factor of 4

( factor of 1 expected )


GWDG

Summary of Results: Hybrid vs. MPI

• hybrid OpenMP / MPI programming is profitable for distributed matrix tranpose :1000 x 1000 matrix on 16 nodes : 2.3 times faster10000 x 10000 matrix on 16 nodes : 1.1 times faster

• Competing influences :MPI programming enhances use of cached dataHybrid programming has lower communication latency and coarser communication granularity


GWDG

Summary of Results: Use of Transpose in FFT

2-dim complex array of size

112 /2/ln5 cnknrnknnt

21 nnn

Execution time on nk nodes :

where r : computational speed per node c : transpose speed per node

effective execution speed per node :

cr

n

rreff

2ln52

1

1


GWDG

Summary of Results: Use of Transpose in FFT- Example SP

r = 4 * 200 Mflop/s = 800 Mflop/sc depends on n, nk and programming model

nk = 16 n = 10**6 10**9

hybrid c = 5.6 7.8 Mword/sMPI c = 2.5 7.0 Mword/s

effective execution speed per node

hybrid = 208 338 Mflop/s

MPI = 108 317 Mflop/s

effr

effr

Date post:	16-Dec-2015
Category:	Documents
Upload:	ryan-viel
View:	228 times
Download:	5 times

GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche...

Documents