Summary of presentation
Theory– Architecture of the SP node– Architecture of the SP switch (detailed)– The software layer
Practice– Performance measurement using MPI– The PUB library– MPI vs. PUB
SP System Overview
Flexibility: db server, web server, storage server, “number cruncher”Scalability: up to 512 nodesModularity; building blocks:– SMP nodes– 16-ports switches
Many different building blocks availableResult: a cluster of SMPs
Scalability
8192 processorsNovember 2002: #4 on the TOP 500 list
POWER3 SMP Thin Node
4 processors, disks & AIX o.s.CPU: 64-bit POWER3-II @375MHz– 2 FP units, 1500 MFLOPS (peak performance)
– L1 cache: 32 KB inst, 64 KB data
L2 cache: 8 MB @ 250 MHz per CPU– 256 bit width, dedicated bus
Main memory: up to 16 GB @ 62.5 MHz– Shared among processors; 512 bit bus
Thin node: 6xx-MX bus
64 bit wide; runs @ 60 MHzIndependent from memory busesPeak bandwidth: 480 MB/s6xx-MX bus shared by:– ultra SCSI disks– 100 Mbps Ethernet– PCI expansion slots– SP Switch adapter
The SP Switch
Scalable, reconfigurableRedundant ReliableHigh bandwidthLow latency
Split into 2 parts:– the SP Switch Board– the SP Switch Adapter (on each node)
SP Switch Port
Synchronous BidirectionalPhit size: 1 byteFlit size: 2 bytesFlow control:credits and tokens
Peak bw: 150 MB/sper direction
SP Switch Basics
Link-level flow controlBuffered wormhole routingCut-through switchingRouting strategy is:– source-based– designed to choose shortest paths– non-adaptive– “non-deterministic”
P1 P2 P3 payloadBOP EOP
SP Switch Chip
16x16 unbuffered crossbarConflict resolution:least recently served (LRS)Packet refusedby output port if…– port is already transmitting– input port is not the LRS
Refused packets go toa unified buffer– Dynamic allocation of chunks– Multiple next-chunk lists
SP Switch Board
2-stage BMIN4 shortest paths foreach (src, dst) pair16 ports on the rightfor (up to) 16 nodes16 ports on the leftfor multiple boards
2-dimensional 4-ary butterfly
Multiple Boards
SP Switch Adapter
On the 6XX-MX Bus(480 MB/s peak bandwidth)$ 12,500 (?!)
SP Switch: the Software
IP or User Space
IP: “easy” but slow– Kernel extensions to handle IP packets– Any IP application can use the switch– TCP/IP and system call overheads
User Space: high performance solution– DirectX-like: low overhead– Portability/complexity force a protocol stack
User Space Protocols
MPI: industry standard message passing interfaceMPCI: point-to-point transport;hides multi-thread issuesPIPE: byte-stream transport;split messages into flits;chooses among 4 routes;does buffering/DMA;manages flow control/ECC
MPI implementation is nativeAnything faster? LAPI
User application
MPI
MPCI
PIPE
HW Abstraction Layer
SP Switch
UDP (5+ apps)
Testing the Switch
ANSI CNon-threaded MPI 1.2 librarySwitch tuning: tuning.scientificInside a node: shared memoryOutside a node: switch+User SpaceExpect hierarchy effectsBeware: extraneous loads beyond LoadLeveler control !
50%
Measuring Latency
Latency measured as “round trip time”
Packet size: 0 bytesUnloaded latency (best scenario)Synchronous calls: they give better results
P1 P2
Latency: Results
Same latency for intra- and inter-node commsBest latency ever seen: 7.13 s (inter-node!)
Worst latency ever seen: 330.64 s“Catastrophic events” (> 9 ms) happen!What about Ethernet?
[verona:~] bambu% ping -q -r -c 10 pandoraPING pandora.dei.unipd.it (147.162.96.156): 56 data bytes
--- pandora.dei.unipd.it ping statistics ---10 packets transmitted, 10 packets received, 0% packet lossround-trip min/avg/max = 0.379/0.477/0.917 msround-trip min/avg/max = 0.379/0.477/0.917 ms
Latency: the True Story
Data taken from a 16-CPU jobAverage for the job: 81 sAt least 6 times better than Ethernet…
Latency for P1, Job #7571
0
20
40
60
80
100
120
140
160
10 20 30 40 50 60 70 80 90 100
110
120
130
140
150
160
More
Latency (microseconds)
Freq
uenc
y
Latency for P4, Job #7571
0
5
10
15
20
25
30
35
40
10 30 50 70 90 110
130
150
170
190
210
230
250
270
Latency (microseconds)
Freq
uenc
y
2 different events? 3 different events?
Measuring Bandwidth
Big packets (tens of MB)to overcome buffers and latency effectsOnly one sender/receiver pair active:P0 sends, Pi receives
Separate buffers for send and receiveUnidirectional BW: MPI_Send & MPI_RecvBidirectional BW: MPI_Sendrecv is bad!Use MPI_Isend and MPI_Irecv instead
Bandwidth: Results (1)
2-level hierarchy:no advantage for same-switch-chip nodes
Unidirectional results– Some catastrophic events, as usual– Intra-node: best 338.8, worst 250.1 MB/s
Typically over 333 MB/s– Inter-node: best 133.9, worst 101.3 MB/s
Typically over 133 MB/s (89% of p.p.)
Bandwidth: Results (2)
Bidirectional results– Intra-node: best 351.3, worst 274.6 MB/s– Inter-node: best 183.5, worst 106.7 MB/s;
61% of p.p. or even lessBandwidth Oscillation in Job # 7571
050
100150200250300350400
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Run #
Band
wid
th (
MB/
s)
P1P4P8
Barrier Synchronization
Data averaged over hundreds of calls
Not influenced by node allocation, but…
MPI_Barrier() times in different runs
0,0
50,0
100,0
150,0
200,0
250,0
300,0
350,0
400,0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Processors involved
Sync
tim
e (m
icro
seco
nds)
Run 7736Run 7737Run 7738Run 7741Run 7742Run 7743Run 7744
Barrier: the True Story
For 24 processors: 325 vs 534 sWhich value should we use?
UNFILTERED MPI_Barrier() times in different runs
0,0
100,0
200,0
300,0
400,0
500,0
600,0
700,0
800,0
900,0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24Processors involved
Sync
tim
e (m
icro
seco
nds)
Run 7736Run 7737Run 7738Run 7741Run 7742Run 7743Run 7744Filtered
PUB Library: Facts
PUB = “Paderborn University BSP”Alternative to MPI (higher level); GPL’edUses the BSP computational model– Put messages in buffer– Call barrier synchronization– Messages are now at their destinations
Native implementation for old architectures, runs over TCP/IP or MPI otherwise
MPI vs. PUB: Interfaces
MPI_SendMPI_IsendMPI_BarrierMPI_Comm_splitMPI_ReduceMPI_ScanMPI_Bcast
bsp_sendbsp_hpsendbsp_syncbsp_partitionbsp_reducebsp_scanbsp_gsend
MPI: scatter/gather, topologiesPUB: bsp_oblsync, virtual processors
MPI vs. PUB: latency
Measuring latency is against the spirit of PUB!Best is 42 s, but average is 86 s (like MPI)
Latency for P4, Job #7835
020406080
100120140
50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90M
ore
Latency (microseconds)
Freq
uenc
y
MPI vs. PUB: bandwidth
Unidirectional results– Intra-node: best 124.9, worst 74.0 MB/s
Typically over 124 MB/s (no shared memory?)– Inter-node: best 79.4, worst 74.8 MB/s
Typically over 79 MB/s (53% of p.p.)
Bidirectional results– Intra-node: best 222.4, worst 121.4 MB/s– Inter-node: best 123.7, worst 82.3 MB/s
Much slower than MPI, but……sometimes bw is higher than link capacity ?!
MPI vs PUB: barrier
PUB is slower for few processors, then fasterPUB also needs filtering for >16 processors
MPI_Barrier() versus bsp_sync(), filtered
0,0
50,0
100,0
150,0
200,0
250,0
300,0
350,0
400,0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24Processors involved
Sync
tim
e (m
icro
seco
nds)
7810 (MPI)7811 (MPI)7812 (MPI)7813 (MPI)7810 (PUB)7811 (PUB)7812 (PUB)7813 (PUB)
50 s
Further Results (1)
Further performance measureswith more complex communication patterns
10% difference in results from job to jobMPI: previous figures still holdif we consider aggregate bandwidth per node
PUB is at least 20% slower than MPI(much more for bidirectional patterns)Some PUB figures are, again, meaningless
Further Results (2)
Switch topology emergesin complex patterns
Main bottleneck:nodeswitch channel
Other effects present(e.g. concurrency handling)
N1
N3
N5
N6
2/3?
1/2?!
1/2?!
3 streams
1/3
2/3
Conclusions
Variability in results……due to load effects? Due to SW?Variability makes a switch model impossiblePUB benefits?
If I had more time/resources…– Higher level: collective communications– Lower level: the LAPI interface– GigaEthernet, Myrinet, CINECA’s SP4– Native PUB library on the SP