© 2005 IBM
HPS
ScicomP 11Charles Grassl
IBMMay, 2005
2 © 2005 IBM Corporation
Agenda
• Evolution• Technology• Performance characteristics
• RDMA• Collectives
3 © 2005 IBM Corporation
Evolution
• High Performance Switch (HPS)• Also Known As “Federation”
• Follow on to SP Switch2• Also known as “Colony”
Generation ProcessorsSwitch POWER2
SP Switch POWER2 → POWER3SP Switch 2 (Colony) POWER3 → POWER4
HPS (Federation) POWER4 → POWER5
4 © 2005 IBM Corporation
Technology
• Internal network• In lieu of, e.g. Gig Ethernet
• Multiple links per node• Match number of links to number of processors
SP Switch2 HPS
15 microsec.
500 Mbyte/s
1 adaptor perlogical node
Latency: 5 microsec.
Bandwidth 2 Gbyte/s
Configuration:1 adaptor, 2 links, per MCM; 16 Gbyte/s per 32 processor node
5 © 2005 IBM Corporation
HPS Packaging
• 4U, 24-inch drawer • 16 ports for server-to-switch• 16 ports for switch-to-switch connections • Host attachment directly to server bus via 2-
link or 4-link Switch Network Interface (SNI) •Up to two links per pSeries 655•Up to eight links per pSeries 690
6 © 2005 IBM Corporation
HPS Switch Configuration
Switch board
GX bus
Power4 and Power5 Servers
•Link drivers•Copper driver•Fiber optics driver
LDC
GX bus
RAM
GX busRAM
Canopus
Canopus
LDC
GX bus
RAM
GX busRAM
Canopus
Canopus
LDC
GX bus
RAM
GX busRAM
Canopus
Canopus
HPSSwitchChip HPS Adapters
LDC
GX bus
RAM
GX busRAM
Canopus
Canopus
7 © 2005 IBM Corporation
HPS Software
• MPI-LAPI (PE V4.1)•Uses LAPI as the reliable transport•Library uses threads, not signals for async activities
• Existing applications binary compatible• New performance characteristics
•Eager•Bulk Transfer• Improved collective communication
8 © 2005 IBM Corporation
HPS Software Architecture
DDIPLAPI
HPSSMA3+ Adapter
TCP
Sockets
MPI
HYP
VSD
GPFS
Application
UDP
PESSL
User Space Kernel Space
Load
Leve
ler
CSM
IF_LSHAL
ESSL
9 © 2005 IBM Corporation
Supported Communication Modes
• FIFO Mode• Message chopped into 2K
packet chunks on the host and copied by CPU
• Memory bus crossing depends on caching. At least 1 IO bus crossing
• Remote Direct Memory Access (RDMA)• No slave side protocol• CPU offload • Enhanced Programming
model• 1 IO bus crossing
UserBuffer
CPU
Network FIFO
Adapter
Ld/St
Ld/St
DMA
RDMA
10 © 2005 IBM Corporation
Underlying Message Procedures
• Protocols:• Rendezvous
• “Large” messages• Eager
• “Small” messages• MP_EAGER_LIMIT
• Range: 0 - 65536
• Mechanisms• Packet• Bulk• MP_BULK_MIN_MSG_SIZE
• Range: any non-negative integer
11 © 2005 IBM Corporation
MPI Transfer Protocols
Send Recv.Mess. Header.Small Messages:
Eager
Send Recv.Header.
Send Recv.Mess.
Large Messages:Rendezvous
12 © 2005 IBM Corporation
MPI Transfer Mechanisms
SendPackets
switch Recv.
Small Messages: Packets
SendBulk
Recv.
switch
Large Messages: Bulk
13 © 2005 IBM Corporation
Remote Direct Memory Access (RDMA)
• Overlap of computation and communication (possible)• Fragmentation and reassembly offloaded to the adapter• Minimize packet arrival interrupts• Asynchronous messaging applications
• All tasks sharing adapter not communicating at the same time• One sided programming model• Zero copy transport
• Reduced memory subsystem load• Striping of very large messages
• Implications to interference with other tasks if copying
14 © 2005 IBM Corporation
New MPI Performance Models
• Possible striping of single message•Bandwidth ~ n * 2 Gbyte/s
• Small dependence on user large pages• Collectives• Eager limit
15 © 2005 IBM Corporation
MPI Single Messages:Large Page vs. Small Page
Rate vs. Length
0500
100015002000250030003500
0 500000 1000000 1500000 2000000
Length
Mby
te/s
LPSP
p655 1.5 GHz, HPS, RDMA
16 © 2005 IBM Corporation
MPI Single Messages:LP vs. SP
Time vs. Length
0
250
500
750
1000
0 500000 1000000 1500000 2000000
Length
Mic
rose
cond
s LPSP
17 © 2005 IBM Corporation
MPI Single Messages
0
500
1000
1500
2000
2500
3000
3500
0 500000 1000000 1500000
• Message striping at ~500000 bytes
• Bandwidth:• 1.5 Gbyte/s “Modest” size• 3 Gbyte/s Large size
• Small sensitivity to page size
18 © 2005 IBM Corporation
MPI Single Messages:Eager Limits
Time vs. Length
010203040506070
0 2000 4000 6000 8000 10000
Length
Mic
rose
cond
s
Eager 4096Eager 0
19 © 2005 IBM Corporation
Eager Limit
Time vs. Length
0
10
20
30
40
50
60
70
0 2000 4000 6000 8000 10000
Length
Mic
rose
cond
s
Eager 4096Eager 0
• Reduce latency from 20 microseconds to 7 microseconds
20 © 2005 IBM Corporation
MPI Single Messages:RDMA vs. no RDMA
Rate vs. Length
0500
100015002000250030003500
0 500000 1000000 1500000 2000000
Length (bytes)
Mby
te/s RDMA
no RDMA
p655 1.5 GHz, HPS, RDMA
21 © 2005 IBM Corporation
RDMA
0
500
1000
1500
2000
2500
3000
3500
0 500000 1000000 1500000
• Message striping starts at 500000 bytes• Adjust with
MP_BULK_MIN_MSG_SIZE
22 © 2005 IBM Corporation
MPI Collectives
• Take special advantage of 64-bit addressing•More aggressive algorithms
• Example:•MPI_Bcast, 32 tasks: 25% faster with 64-bit addressing
23 © 2005 IBM Corporation
MPI Bcast:32-bit vs. 64-bit
64 and 32 Task MPI_Bcast
0500
1000150020002500300035004000
0 250000 500000 750000
Length
Mic
rose
cond
s
64-task 64-bit64-task 32-bit32-task 64-bit32 task 32-bit
p655 1.5 GHz, HPS, RDMA
24 © 2005 IBM Corporation
MPI_Bcast:32-bit vs. 64-bit
32 Tasks MPI_Bcast
0500
1000150020002500
0 250000 500000 750000
Length
Mic
rose
cond
s
64-bit32-bit
p655 1.5 GHz, HPS, RDMA
25 © 2005 IBM Corporation
MPI_Bcast:32-bit vs. 64-bit
64 Tasks MPI_Bcast
0500
1000150020002500300035004000
0 250000 500000 750000
Length
Mic
rose
cond
s
64-bit32-bit
p655 1.5 GHz, HPS, RDMA
26 © 2005 IBM Corporation
Application Considerations• MPI-LAPI has different architecture than prior version
• Bulk transfer:• Larger messages (>500K are used)
• Set MP_SINGLE_THREAD=yes (if possible)• 32-bit applications:
• Will not use LAPI shared memory for large messages• Convert to 64-bit
• Will not use MPI Collective Communication optimizations• Convert to 64-bit.
• Applications that use signal handlers may require some changes
• MPL is no longer supported
27 © 2005 IBM Corporation
MPI Environment Variables
Environment Variable Recommend Value
MP_EUILIB usMP_EUIDEVICE sn_single, sn_allMP_SHARED_MEMORY yes
MP_USE_BULK_XFER yesMP_SINGLE_THREAD yes*
MP_BULK_MIN_MSG_SIZE 128 kbyte (default)
* If possible
28 © 2005 IBM Corporation
Bandwidth Structure:Performance Aspects
• Shared memory• 3 Gbyte/s POWER4• 4 Gbyte/s POWER5
• Large pages• Reduced effect
• Bulk Transfer• 3 Gbyte/s for two adaptors
• Eager Limit• 15-20 microsec. 5-7 microsec
• Single threaded• 1-2 microsec. reduced latency
29 © 2005 IBM Corporation
HPS Performance Summary
Rate vs. Length
0
500
1000
1500
2000
2500
3000
3500
0 500000 1000000 1500000 2000000
-
LPSP
• High asymptotic peak bandwidth• ~4x vs. Colony
• Extra “kink” in performance curve• Bulk Transfer
• Small message performance improvement
• Low latency• 5 microsecond.
30 © 2005 IBM Corporation
Prescription For Use of RDMA
• Add to LL configuration file:• /usr/lpp/LoadL/full/samples/LoadL_config
• SCHEDULE_BY_RESOURCES = RDMA• Verification:
•Run with MP_INFOLEVEL=2
•Stderr must contain the text:• “Bulk Transfer is enabled”
• Running:•LoadLeveler switch
• # @ bulkxfer = yes
31 © 2005 IBM Corporation
Summary
• HPS• Bandwidth
• Bulk Transfers• 4x higher bandwidth (large messages)
• Latency• 5 microsecond.
HPSAgendaEvolutionTechnologyHPS PackagingHPS Switch ConfigurationHPS SoftwareSupported Communication ModesUnderlying Message ProceduresMPI Transfer ProtocolsMPI Transfer MechanismsRemote Direct Memory Access (RDMA)New MPI Performance ModelsMPI Single Messages:Large Page vs. Small PageMPI Single Messages:LP vs. SPMPI Single MessagesMPI Single Messages:Eager LimitsEager LimitMPI Single Messages:RDMA vs. no RDMARDMAMPI CollectivesMPI Bcast:32-bit vs. 64-bitMPI_Bcast:32-bit vs. 64-bitMPI_Bcast:32-bit vs. 64-bitApplication ConsiderationsMPI Environment VariablesBandwidth Structure:Performance AspectsHPS Performance SummaryPrescription For Use of RDMASummary