IBM SP Switch: Theory and Practice

Post on 13-Feb-2016

66 views 0 download

Tags:

description

IBM SP Switch: Theory and Practice. Carlo Fantozzi ( carlo.fantozzi@dei.unipd.it ). Summary of presentation. Theory Architecture of the SP node Architecture of the SP switch (detailed) The software layer Practice Performance measurement using MPI The PUB library MPI vs. PUB. - PowerPoint PPT Presentation

transcript

IBM SP Switch:Theory and Practice

Carlo Fantozzi(carlo.fantozzi@dei.unipd.it)

Summary of presentation

Theory– Architecture of the SP node– Architecture of the SP switch (detailed)– The software layer

Practice– Performance measurement using MPI– The PUB library– MPI vs. PUB

SP System Overview

Flexibility: db server, web server, storage server, “number cruncher”Scalability: up to 512 nodesModularity; building blocks:– SMP nodes– 16-ports switches

Many different building blocks availableResult: a cluster of SMPs

Scalability

8192 processorsNovember 2002: #4 on the TOP 500 list

POWER3 SMP Thin Node

4 processors, disks & AIX o.s.CPU: 64-bit POWER3-II @375MHz– 2 FP units, 1500 MFLOPS (peak performance)

– L1 cache: 32 KB inst, 64 KB data

L2 cache: 8 MB @ 250 MHz per CPU– 256 bit width, dedicated bus

Main memory: up to 16 GB @ 62.5 MHz– Shared among processors; 512 bit bus

Thin node: 6xx-MX bus

64 bit wide; runs @ 60 MHzIndependent from memory busesPeak bandwidth: 480 MB/s6xx-MX bus shared by:– ultra SCSI disks– 100 Mbps Ethernet– PCI expansion slots– SP Switch adapter

The SP Switch

Scalable, reconfigurableRedundant ReliableHigh bandwidthLow latency

Split into 2 parts:– the SP Switch Board– the SP Switch Adapter (on each node)

SP Switch Port

Synchronous BidirectionalPhit size: 1 byteFlit size: 2 bytesFlow control:credits and tokens

Peak bw: 150 MB/sper direction

SP Switch Basics

Link-level flow controlBuffered wormhole routingCut-through switchingRouting strategy is:– source-based– designed to choose shortest paths– non-adaptive– “non-deterministic”

P1 P2 P3 payloadBOP EOP

SP Switch Chip

16x16 unbuffered crossbarConflict resolution:least recently served (LRS)Packet refusedby output port if…– port is already transmitting– input port is not the LRS

Refused packets go toa unified buffer– Dynamic allocation of chunks– Multiple next-chunk lists

SP Switch Board

2-stage BMIN4 shortest paths foreach (src, dst) pair16 ports on the rightfor (up to) 16 nodes16 ports on the leftfor multiple boards

2-dimensional 4-ary butterfly

Multiple Boards

SP Switch Adapter

On the 6XX-MX Bus(480 MB/s peak bandwidth)$ 12,500 (?!)

SP Switch: the Software

IP or User Space

IP: “easy” but slow– Kernel extensions to handle IP packets– Any IP application can use the switch– TCP/IP and system call overheads

User Space: high performance solution– DirectX-like: low overhead– Portability/complexity force a protocol stack

User Space Protocols

MPI: industry standard message passing interfaceMPCI: point-to-point transport;hides multi-thread issuesPIPE: byte-stream transport;split messages into flits;chooses among 4 routes;does buffering/DMA;manages flow control/ECC

MPI implementation is nativeAnything faster? LAPI

User application

MPI

MPCI

PIPE

HW Abstraction Layer

SP Switch

UDP (5+ apps)

Testing the Switch

ANSI CNon-threaded MPI 1.2 librarySwitch tuning: tuning.scientificInside a node: shared memoryOutside a node: switch+User SpaceExpect hierarchy effectsBeware: extraneous loads beyond LoadLeveler control !

50%

Measuring Latency

Latency measured as “round trip time”

Packet size: 0 bytesUnloaded latency (best scenario)Synchronous calls: they give better results

P1 P2

Latency: Results

Same latency for intra- and inter-node commsBest latency ever seen: 7.13 s (inter-node!)

Worst latency ever seen: 330.64 s“Catastrophic events” (> 9 ms) happen!What about Ethernet?

[verona:~] bambu% ping -q -r -c 10 pandoraPING pandora.dei.unipd.it (147.162.96.156): 56 data bytes

--- pandora.dei.unipd.it ping statistics ---10 packets transmitted, 10 packets received, 0% packet lossround-trip min/avg/max = 0.379/0.477/0.917 msround-trip min/avg/max = 0.379/0.477/0.917 ms

Latency: the True Story

Data taken from a 16-CPU jobAverage for the job: 81 sAt least 6 times better than Ethernet…

Latency for P1, Job #7571

0

20

40

60

80

100

120

140

160

10 20 30 40 50 60 70 80 90 100

110

120

130

140

150

160

More

Latency (microseconds)

Freq

uenc

y

Latency for P4, Job #7571

0

5

10

15

20

25

30

35

40

10 30 50 70 90 110

130

150

170

190

210

230

250

270

Latency (microseconds)

Freq

uenc

y

2 different events? 3 different events?

Measuring Bandwidth

Big packets (tens of MB)to overcome buffers and latency effectsOnly one sender/receiver pair active:P0 sends, Pi receives

Separate buffers for send and receiveUnidirectional BW: MPI_Send & MPI_RecvBidirectional BW: MPI_Sendrecv is bad!Use MPI_Isend and MPI_Irecv instead

Bandwidth: Results (1)

2-level hierarchy:no advantage for same-switch-chip nodes

Unidirectional results– Some catastrophic events, as usual– Intra-node: best 338.8, worst 250.1 MB/s

Typically over 333 MB/s– Inter-node: best 133.9, worst 101.3 MB/s

Typically over 133 MB/s (89% of p.p.)

Bandwidth: Results (2)

Bidirectional results– Intra-node: best 351.3, worst 274.6 MB/s– Inter-node: best 183.5, worst 106.7 MB/s;

61% of p.p. or even lessBandwidth Oscillation in Job # 7571

050

100150200250300350400

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Run #

Band

wid

th (

MB/

s)

P1P4P8

Barrier Synchronization

Data averaged over hundreds of calls

Not influenced by node allocation, but…

MPI_Barrier() times in different runs

0,0

50,0

100,0

150,0

200,0

250,0

300,0

350,0

400,0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Processors involved

Sync

tim

e (m

icro

seco

nds)

Run 7736Run 7737Run 7738Run 7741Run 7742Run 7743Run 7744

Barrier: the True Story

For 24 processors: 325 vs 534 sWhich value should we use?

UNFILTERED MPI_Barrier() times in different runs

0,0

100,0

200,0

300,0

400,0

500,0

600,0

700,0

800,0

900,0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24Processors involved

Sync

tim

e (m

icro

seco

nds)

Run 7736Run 7737Run 7738Run 7741Run 7742Run 7743Run 7744Filtered

PUB Library: Facts

PUB = “Paderborn University BSP”Alternative to MPI (higher level); GPL’edUses the BSP computational model– Put messages in buffer– Call barrier synchronization– Messages are now at their destinations

Native implementation for old architectures, runs over TCP/IP or MPI otherwise

MPI vs. PUB: Interfaces

MPI_SendMPI_IsendMPI_BarrierMPI_Comm_splitMPI_ReduceMPI_ScanMPI_Bcast

bsp_sendbsp_hpsendbsp_syncbsp_partitionbsp_reducebsp_scanbsp_gsend

MPI: scatter/gather, topologiesPUB: bsp_oblsync, virtual processors

MPI vs. PUB: latency

Measuring latency is against the spirit of PUB!Best is 42 s, but average is 86 s (like MPI)

Latency for P4, Job #7835

020406080

100120140

50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90M

ore

Latency (microseconds)

Freq

uenc

y

MPI vs. PUB: bandwidth

Unidirectional results– Intra-node: best 124.9, worst 74.0 MB/s

Typically over 124 MB/s (no shared memory?)– Inter-node: best 79.4, worst 74.8 MB/s

Typically over 79 MB/s (53% of p.p.)

Bidirectional results– Intra-node: best 222.4, worst 121.4 MB/s– Inter-node: best 123.7, worst 82.3 MB/s

Much slower than MPI, but……sometimes bw is higher than link capacity ?!

MPI vs PUB: barrier

PUB is slower for few processors, then fasterPUB also needs filtering for >16 processors

MPI_Barrier() versus bsp_sync(), filtered

0,0

50,0

100,0

150,0

200,0

250,0

300,0

350,0

400,0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24Processors involved

Sync

tim

e (m

icro

seco

nds)

7810 (MPI)7811 (MPI)7812 (MPI)7813 (MPI)7810 (PUB)7811 (PUB)7812 (PUB)7813 (PUB)

50 s

Further Results (1)

Further performance measureswith more complex communication patterns

10% difference in results from job to jobMPI: previous figures still holdif we consider aggregate bandwidth per node

PUB is at least 20% slower than MPI(much more for bidirectional patterns)Some PUB figures are, again, meaningless

Further Results (2)

Switch topology emergesin complex patterns

Main bottleneck:nodeswitch channel

Other effects present(e.g. concurrency handling)

N1

N3

N5

N6

2/3?

1/2?!

1/2?!

3 streams

1/3

2/3

Conclusions

Variability in results……due to load effects? Due to SW?Variability makes a switch model impossiblePUB benefits?

If I had more time/resources…– Higher level: collective communications– Lower level: the LAPI interface– GigaEthernet, Myrinet, CINECA’s SP4– Native PUB library on the SP