+ All Categories
Home > Documents > IBM SP Switch: Theory and Practice

IBM SP Switch: Theory and Practice

Date post: 13-Feb-2016
Category:
Upload: maxime
View: 66 times
Download: 0 times
Share this document with a friend
Description:
IBM SP Switch: Theory and Practice. Carlo Fantozzi ( [email protected] ). Summary of presentation. Theory Architecture of the SP node Architecture of the SP switch (detailed) The software layer Practice Performance measurement using MPI The PUB library MPI vs. PUB. - PowerPoint PPT Presentation
Popular Tags:
32
IBM SP Switch: Theory and Practice Carlo Fantozzi ([email protected] )
Transcript
Page 1: IBM SP Switch: Theory and Practice

IBM SP Switch:Theory and Practice

Carlo Fantozzi([email protected])

Page 2: IBM SP Switch: Theory and Practice

Summary of presentation

Theory– Architecture of the SP node– Architecture of the SP switch (detailed)– The software layer

Practice– Performance measurement using MPI– The PUB library– MPI vs. PUB

Page 3: IBM SP Switch: Theory and Practice

SP System Overview

Flexibility: db server, web server, storage server, “number cruncher”Scalability: up to 512 nodesModularity; building blocks:– SMP nodes– 16-ports switches

Many different building blocks availableResult: a cluster of SMPs

Page 4: IBM SP Switch: Theory and Practice

Scalability

8192 processorsNovember 2002: #4 on the TOP 500 list

Page 5: IBM SP Switch: Theory and Practice

POWER3 SMP Thin Node

4 processors, disks & AIX o.s.CPU: 64-bit POWER3-II @375MHz– 2 FP units, 1500 MFLOPS (peak performance)

– L1 cache: 32 KB inst, 64 KB data

L2 cache: 8 MB @ 250 MHz per CPU– 256 bit width, dedicated bus

Main memory: up to 16 GB @ 62.5 MHz– Shared among processors; 512 bit bus

Page 6: IBM SP Switch: Theory and Practice

Thin node: 6xx-MX bus

64 bit wide; runs @ 60 MHzIndependent from memory busesPeak bandwidth: 480 MB/s6xx-MX bus shared by:– ultra SCSI disks– 100 Mbps Ethernet– PCI expansion slots– SP Switch adapter

Page 7: IBM SP Switch: Theory and Practice

The SP Switch

Scalable, reconfigurableRedundant ReliableHigh bandwidthLow latency

Split into 2 parts:– the SP Switch Board– the SP Switch Adapter (on each node)

Page 8: IBM SP Switch: Theory and Practice

SP Switch Port

Synchronous BidirectionalPhit size: 1 byteFlit size: 2 bytesFlow control:credits and tokens

Peak bw: 150 MB/sper direction

Page 9: IBM SP Switch: Theory and Practice

SP Switch Basics

Link-level flow controlBuffered wormhole routingCut-through switchingRouting strategy is:– source-based– designed to choose shortest paths– non-adaptive– “non-deterministic”

P1 P2 P3 payloadBOP EOP

Page 10: IBM SP Switch: Theory and Practice

SP Switch Chip

16x16 unbuffered crossbarConflict resolution:least recently served (LRS)Packet refusedby output port if…– port is already transmitting– input port is not the LRS

Refused packets go toa unified buffer– Dynamic allocation of chunks– Multiple next-chunk lists

Page 11: IBM SP Switch: Theory and Practice

SP Switch Board

2-stage BMIN4 shortest paths foreach (src, dst) pair16 ports on the rightfor (up to) 16 nodes16 ports on the leftfor multiple boards

2-dimensional 4-ary butterfly

Page 12: IBM SP Switch: Theory and Practice

Multiple Boards

Page 13: IBM SP Switch: Theory and Practice

SP Switch Adapter

On the 6XX-MX Bus(480 MB/s peak bandwidth)$ 12,500 (?!)

Page 14: IBM SP Switch: Theory and Practice

SP Switch: the Software

IP or User Space

IP: “easy” but slow– Kernel extensions to handle IP packets– Any IP application can use the switch– TCP/IP and system call overheads

User Space: high performance solution– DirectX-like: low overhead– Portability/complexity force a protocol stack

Page 15: IBM SP Switch: Theory and Practice

User Space Protocols

MPI: industry standard message passing interfaceMPCI: point-to-point transport;hides multi-thread issuesPIPE: byte-stream transport;split messages into flits;chooses among 4 routes;does buffering/DMA;manages flow control/ECC

MPI implementation is nativeAnything faster? LAPI

User application

MPI

MPCI

PIPE

HW Abstraction Layer

SP Switch

UDP (5+ apps)

Page 16: IBM SP Switch: Theory and Practice

Testing the Switch

ANSI CNon-threaded MPI 1.2 librarySwitch tuning: tuning.scientificInside a node: shared memoryOutside a node: switch+User SpaceExpect hierarchy effectsBeware: extraneous loads beyond LoadLeveler control !

50%

Page 17: IBM SP Switch: Theory and Practice

Measuring Latency

Latency measured as “round trip time”

Packet size: 0 bytesUnloaded latency (best scenario)Synchronous calls: they give better results

P1 P2

Page 18: IBM SP Switch: Theory and Practice

Latency: Results

Same latency for intra- and inter-node commsBest latency ever seen: 7.13 s (inter-node!)

Worst latency ever seen: 330.64 s“Catastrophic events” (> 9 ms) happen!What about Ethernet?

[verona:~] bambu% ping -q -r -c 10 pandoraPING pandora.dei.unipd.it (147.162.96.156): 56 data bytes

--- pandora.dei.unipd.it ping statistics ---10 packets transmitted, 10 packets received, 0% packet lossround-trip min/avg/max = 0.379/0.477/0.917 msround-trip min/avg/max = 0.379/0.477/0.917 ms

Page 19: IBM SP Switch: Theory and Practice

Latency: the True Story

Data taken from a 16-CPU jobAverage for the job: 81 sAt least 6 times better than Ethernet…

Latency for P1, Job #7571

0

20

40

60

80

100

120

140

160

10 20 30 40 50 60 70 80 90 100

110

120

130

140

150

160

More

Latency (microseconds)

Freq

uenc

y

Latency for P4, Job #7571

0

5

10

15

20

25

30

35

40

10 30 50 70 90 110

130

150

170

190

210

230

250

270

Latency (microseconds)

Freq

uenc

y

2 different events? 3 different events?

Page 20: IBM SP Switch: Theory and Practice

Measuring Bandwidth

Big packets (tens of MB)to overcome buffers and latency effectsOnly one sender/receiver pair active:P0 sends, Pi receives

Separate buffers for send and receiveUnidirectional BW: MPI_Send & MPI_RecvBidirectional BW: MPI_Sendrecv is bad!Use MPI_Isend and MPI_Irecv instead

Page 21: IBM SP Switch: Theory and Practice

Bandwidth: Results (1)

2-level hierarchy:no advantage for same-switch-chip nodes

Unidirectional results– Some catastrophic events, as usual– Intra-node: best 338.8, worst 250.1 MB/s

Typically over 333 MB/s– Inter-node: best 133.9, worst 101.3 MB/s

Typically over 133 MB/s (89% of p.p.)

Page 22: IBM SP Switch: Theory and Practice

Bandwidth: Results (2)

Bidirectional results– Intra-node: best 351.3, worst 274.6 MB/s– Inter-node: best 183.5, worst 106.7 MB/s;

61% of p.p. or even lessBandwidth Oscillation in Job # 7571

050

100150200250300350400

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Run #

Band

wid

th (

MB/

s)

P1P4P8

Page 23: IBM SP Switch: Theory and Practice

Barrier Synchronization

Data averaged over hundreds of calls

Not influenced by node allocation, but…

MPI_Barrier() times in different runs

0,0

50,0

100,0

150,0

200,0

250,0

300,0

350,0

400,0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Processors involved

Sync

tim

e (m

icro

seco

nds)

Run 7736Run 7737Run 7738Run 7741Run 7742Run 7743Run 7744

Page 24: IBM SP Switch: Theory and Practice

Barrier: the True Story

For 24 processors: 325 vs 534 sWhich value should we use?

UNFILTERED MPI_Barrier() times in different runs

0,0

100,0

200,0

300,0

400,0

500,0

600,0

700,0

800,0

900,0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24Processors involved

Sync

tim

e (m

icro

seco

nds)

Run 7736Run 7737Run 7738Run 7741Run 7742Run 7743Run 7744Filtered

Page 25: IBM SP Switch: Theory and Practice

PUB Library: Facts

PUB = “Paderborn University BSP”Alternative to MPI (higher level); GPL’edUses the BSP computational model– Put messages in buffer– Call barrier synchronization– Messages are now at their destinations

Native implementation for old architectures, runs over TCP/IP or MPI otherwise

Page 26: IBM SP Switch: Theory and Practice

MPI vs. PUB: Interfaces

MPI_SendMPI_IsendMPI_BarrierMPI_Comm_splitMPI_ReduceMPI_ScanMPI_Bcast

bsp_sendbsp_hpsendbsp_syncbsp_partitionbsp_reducebsp_scanbsp_gsend

MPI: scatter/gather, topologiesPUB: bsp_oblsync, virtual processors

Page 27: IBM SP Switch: Theory and Practice

MPI vs. PUB: latency

Measuring latency is against the spirit of PUB!Best is 42 s, but average is 86 s (like MPI)

Latency for P4, Job #7835

020406080

100120140

50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90M

ore

Latency (microseconds)

Freq

uenc

y

Page 28: IBM SP Switch: Theory and Practice

MPI vs. PUB: bandwidth

Unidirectional results– Intra-node: best 124.9, worst 74.0 MB/s

Typically over 124 MB/s (no shared memory?)– Inter-node: best 79.4, worst 74.8 MB/s

Typically over 79 MB/s (53% of p.p.)

Bidirectional results– Intra-node: best 222.4, worst 121.4 MB/s– Inter-node: best 123.7, worst 82.3 MB/s

Much slower than MPI, but……sometimes bw is higher than link capacity ?!

Page 29: IBM SP Switch: Theory and Practice

MPI vs PUB: barrier

PUB is slower for few processors, then fasterPUB also needs filtering for >16 processors

MPI_Barrier() versus bsp_sync(), filtered

0,0

50,0

100,0

150,0

200,0

250,0

300,0

350,0

400,0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24Processors involved

Sync

tim

e (m

icro

seco

nds)

7810 (MPI)7811 (MPI)7812 (MPI)7813 (MPI)7810 (PUB)7811 (PUB)7812 (PUB)7813 (PUB)

50 s

Page 30: IBM SP Switch: Theory and Practice

Further Results (1)

Further performance measureswith more complex communication patterns

10% difference in results from job to jobMPI: previous figures still holdif we consider aggregate bandwidth per node

PUB is at least 20% slower than MPI(much more for bidirectional patterns)Some PUB figures are, again, meaningless

Page 31: IBM SP Switch: Theory and Practice

Further Results (2)

Switch topology emergesin complex patterns

Main bottleneck:nodeswitch channel

Other effects present(e.g. concurrency handling)

N1

N3

N5

N6

2/3?

1/2?!

1/2?!

3 streams

1/3

2/3

Page 32: IBM SP Switch: Theory and Practice

Conclusions

Variability in results……due to load effects? Due to SW?Variability makes a switch model impossiblePUB benefits?

If I had more time/resources…– Higher level: collective communications– Lower level: the LAPI interface– GigaEthernet, Myrinet, CINECA’s SP4– Native PUB library on the SP


Recommended