High-Performance Clusters part 1: Performance
David E. Culler
Computer Science Division
U.C. Berkeley
PODC/SPAA Tutorial
Sunday, June 28, 1998
7/28/98 SPAA/PODC Clusters 2
Clusters have Arrived
• … the SPAA / PDOC testbed going forward
7/28/98 SPAA/PODC Clusters 3
Berkeley NOW
• http://now.cs.berkeley.edu/
7/28/98 SPAA/PODC Clusters 4
NOW’s Commercial Version
• 240 procesors, Active Messages, myrinet, ...
7/28/98 SPAA/PODC Clusters 5
Berkeley Massive Storage Cluster
• serving Fine Art at www.thinker.org/imagebase/
• or try
7/28/98 SPAA/PODC Clusters 6
Commercial Scene
7/28/98 SPAA/PODC Clusters 7
What’s a Cluster?
• Collection of independent computer systems working together as if a single system.
• Coupled through a scalable, high bandwidth, low latency interconnect.
7/28/98 SPAA/PODC Clusters 8
Outline for Part 1
• Why Clusters NOW?
• What is the Key Challenge?
• How is it overcome?
• How much performance?
• Where is it going?
7/28/98 SPAA/PODC Clusters 9
Why Clusters?
• Capacity
• Availability
• Scalability• Cost-effectiveness
7/28/98 SPAA/PODC Clusters 10
Traditional Availability Clusters
• VAX Clusters => IBM sysplex => Wolf Pack
ClientsClients
Disk array ADisk array A
Disk array BDisk array B
InterconnectInterconnectServerServer AA
ServerServer BB
7/28/98 SPAA/PODC Clusters 11
Why HP Clusters NOW?
• Time to market => performance
• Technology
• internet services
0
50
100
150
200
250
300
1986 1988 1990 1992 1994
Year
SpecIntSpecFP
EngineeringLag Time
NodePerformancein Large System
7/28/98 SPAA/PODC Clusters 12
Technology Breakthrough
• Killer micro => Killer switch
CM-2
CM-5
Paragon XP/S (6768)Cray T3D
ASCI red
Y-mp
C90
T90
X-mp
0.1
1
10
100
1000
1984 1986 1988 1990 1992 1994 1996
Year
GFL
OP
S
MPP
Cray vector
single chip building block for scalable networks• high bandwidth• low latency• very reliable
7/28/98 SPAA/PODC Clusters 13
Opportunity: Rethink System Design
• Remote memory and processor are closer than local disks!
• Networking Stacks ?
• Virtual Memory ?
• File system design ?
• It all looks like parallel programming
• Huge demand for scalable, available, dedicated internet servers
– big I/O, big compute
7/28/98 SPAA/PODC Clusters 14
Example: Traditional File System
Clients Server
$$$
GlobalSharedFile Cache
RAIDDisk Storage
Fast Channel (HPPI)
• Expensive
• Complex
• Non-Scalable
• Single point of failure
$
LocalPrivate
File Cache
$
$
° ° ° Bottleneck
• Server resources at a premium
• Client resources poorly utilized
7/28/98 SPAA/PODC Clusters 15
Truly Distributed File System
• VM: page to remote memory
FileCache
P
FileCache
P
FileCache
P
FileCache
P
FileCache
P
FileCache
P
FileCache
P
FileCache
P
Scalable Low-Latency Communication Network
Network RAID striping
G = Node Comm BW / Disk BW
LocalCache
Cluster Caching
7/28/98 SPAA/PODC Clusters 16
Fast Communication Challenge
• Fast processors and fast networks
• The time is spent in crossing between them
Killer Switch
° ° °
NetworkInterface Hardware
Comm..Software
NetworkInterface Hardware
Comm.Software
NetworkInterface Hardware
Comm.Software
NetworkInterface Hardware
Comm.Software
Killer Platform
ns
µs
ms
7/28/98 SPAA/PODC Clusters 17
Opening: Intelligent Network Interfaces
• Dedicated Processing power and storage embedded in the Network Interface
• An I/O card today
• Tomorrow on chip?
$
P
M I/O bus (S-Bus)50 MB/s
MryicomNet
P
Sun Ultra 170
MyricomNIC
160 MB/s
M
$
P
M P
$
P
M
$
P$
P
M
7/28/98 SPAA/PODC Clusters 18
Our Attack: Active Messages
• Request / Reply small active messages (RPC)
• Bulk-Transfer (store & get)
• Highly optimized communication layer on a range of HW
Request
handler
handler
Reply
7/28/98 SPAA/PODC Clusters 19
NOW System Architecture
Net Inter. HW
UNIXWorkstation
Comm. SW
Net Inter. HW
Comm. SW
Net Inter. HW
Comm. SW
Net Inter. HW
Comm. SW
Global Layer UNIX Resource Management Network RAM Distributed Files Process Migration
Fast Commercial Switch (Myrinet)
UNIXWorkstation
UNIXWorkstation
UNIXWorkstation
Large Seq. Apps
Parallel Apps
Sockets, Split-C, MPI, HPF, vSM
7/28/98 SPAA/PODC Clusters 20
Cluster Communication Performance
7/28/98 SPAA/PODC Clusters 21
LogP
Interconnection Network
MPMPMP° ° °
P ( processors )
Limited Volume ( L/g to a proc)
o (overhead)
L (latency)
og (gap)
• Latency in sending a (small) message between modules
• overhead felt by the processor on sending or receiving msg
• gap between successive sends or receives (1/rate)
• Processors
Round Trip time: 2 x ( 2o + L)
7/28/98 SPAA/PODC Clusters 22
LogP Comparison
• Direct, user-level network access
• Generic AM, FM (uiuc), PM (rwc), Unet (cornell), …
0
2
4
6
8
10
12
14
16
µs
gLOrOs
Latency 1/BW
7/28/98 SPAA/PODC Clusters 23
MPI over AM: ping-pong bandwidth
0
10
20
30
40
50
60
70
10 100 1000 10000 100000 1000000
Message Size (bytes)
Ban
dw
idth
(M
B/s
)
SGI Challenge
Meiko CS2
NOW
IBM SP2
Cray T3D
7/28/98 SPAA/PODC Clusters 24
MPI over AM: start-up
0
10
20
30
40
50
60
70
80
90
SG
IC
ha
llen
ge
Me
iko
NO
W
IBM
SP2
Cra
y T
3D
mic
rose
con
ds
7/28/98 SPAA/PODC Clusters 25
Cluster Application Performance:NAS Parallel Benchmarks
7/28/98 SPAA/PODC Clusters 26
NPB2: NOW vs SP2
7/28/98 SPAA/PODC Clusters 27
NPB2: NOW vs SGI Origin
Speedup on NOW
0
5
10
15
20
25
30
35
40
0 5 10 15 20 25 30Nodes
Spee
dup
lu
mg
sp
Ideal
Speedup on SGI Origin 2000
0
5
10
15
20
25
30
35
40
0 5 10 15 20 25 30Nodes
Spee
dup
lu
mg
sp
Ideal
7/28/98 SPAA/PODC Clusters 28
Where the Time Goes: LU
Time Breakdown of LU on Origin 2000
0
500
1000
1500
2000
2500
3000
1 2 4 8 16 32Processors
Time Breakdown of LU on NOW
0
500
1000
1500
2000
2500
3000
1 2 4 8 16 32Processors
Time (
seco
nds)
Cummulative
Computation
Communication
Ideal
7/28/98 SPAA/PODC Clusters 29
Where the time goes: SP
Time Breakdown on SGI
0
500
1000
1500
2000
2500
3000
3500
1 4 9 16 25Processors
Time Breakdown on NOW
0
500
1000
1500
2000
2500
3000
3500
1 4 9 16 25Processors
Time (
seco
nds)
CummulativeComputationCommunicationIdeal
7/28/98 SPAA/PODC Clusters 30
0
2
4
6
8
10
12
14
1 10 100 1000 10000Cache Size (KB)
Mis
s R
ate
(%)
4-Node
LU Working Set
• 4-processor– traditional curve
for small caches
– Sharp knee >256KB (1 MB total)
7/28/98 SPAA/PODC Clusters 31
0
2
4
6
8
10
12
14
1 10 100 1000 10000Cache Size (KB)
Mis
s R
ate
(%)
4-Node
8-Node
16-Node
32-Node
LU Working Set (CPS scaling)
• Knee at global cache > 1MB
• machine experiences drop in miss rate at specific size
7/28/98 SPAA/PODC Clusters 32
Application Sensitivity to Communication Performance
7/28/98 SPAA/PODC Clusters 33
Adjusting L, o, and g (and G) in situ
• Martin, et al., ISCA 97
Lanai
Host Workstation
O: stall Ultraon msg write
AM lib
Lanai
Host Workstation
AM lib
g: delay Lanaiafter msg injection
(after fragment forbulk transfers)
L: defer markingmsg as valid untilRx + L
O: stall Ultraon msg read
Myrinet
7/28/98 SPAA/PODC Clusters 34
0
50
100
0 50 100
L (desired)
µs
Calibration
0
50
100
150
200
0 50 100
O (desired)
µs
o
g
L
desired
0
50
100
0 50 100
g (desired)
7/28/98 SPAA/PODC Clusters 35
Split-C Applications
Program Input P=16 P=32 (us) Msg Type Interval
Radix Integer radix sort 16M 32-bit keys 13.77.8 6.1 msg
EM3D(write) Electro-magnetic 80K Nodes, 40% rmt 88.638.0 8.0 write
EM3D(read) Electro-magnetic 80K Nodes, 40% rmt 230.0114.0 13.8 read
Sample Integer sample sort 32M 32-bit keys 24.713.2 13.0 msg
Barnes Hierarchical N-Body 1 Million Bodies 77.943.2 52.8 cached read
P-Ray Ray Tracer 1 Million pixel image 23.5 17.9 156.2 cached read
MurPHI Protocol Verification SCI protocol, 2 proc 67.735.3 183.5 Bulk
Connect Connected Comp4M nodes, 2-D mesh, 30% 2.3 1.2 212.6BSP
NOW-sort Disk-to-Disk Sort32M 100-byte records 127.2 56.9 817.4I/O
Radb Bulk version Radix 16M 32-bit keys 7.03.7 852.7 Bulk
7/28/98 SPAA/PODC Clusters 36
Sensitivity to Overhead
0
10
20
30
40
50
60
0 10 20 30 40 50 60 70 80 90 100 110
Overhead
Slo
wdow
n
Barnes
Radix
EM3D(write)
EM3D(read)
Sample
P-Ray
Murphi
Connect
NOWsort
RadB
7/28/98 SPAA/PODC Clusters 37
Comparative Impact
0
10
20
30
40
50
60
0 20 40 60 80 100Overhead
Slow
dow
n
0
10
20
30
40
50
60
0 20 40 60 80 100gap
0
10
20
30
40
50
60
0 20 40 60 80 100Latency
7/28/98 SPAA/PODC Clusters 38
Sensitivity to bulk BW (1/G)
0
0.5
1
1.5
2
2.5
0 10 20 30 40
MB/s
Slo
wdow
n
Barnes
Radix
EM3D(write)
EM3D(read)
Sample
P-Ray
Murphi
Connect
NOWsort
RadB
7/28/98 SPAA/PODC Clusters 39
Cluster Communication Performance
• Overhead, Overhead, Overhead– hypersensitive due to increased serialization
• Sensitivity to gap reflects bursty communication
• Surprisingly latency tolerant
• Plenty of room for overhead improvement
- How sensitive are distributed systems?
7/28/98 SPAA/PODC Clusters 40
Extrapolating to Low Overhead
0
1
2
3
4
5
0 5 10 15
Overhead
Slo
wdow
n
Barnes
Radix
EM3D(write)
EM3D(read)
Sample
P-Ray
Murphi
Connect
NOWsort
RadB
7/28/98 SPAA/PODC Clusters 41
Direct Memory Messaging
• Send region and receive region for each end of communication channel
• Write through send region into remote rcv region
S1
S2
R1
R2
S3
R3
VA0
VA2
Nodei
S0
S2
R0
R2
VANodej
S0
S1
R1
R0
VANodek
S1
S2
S3
R1
R2
R3
PhysicalAddrress
IO
7/28/98 SPAA/PODC Clusters 42
Direct Memory Interconnects
• DEC Memory Channels
– 3 us end-to-end
– ~ 1us o, L
• SCI
• SGI
• Shrimp (Princeton)
° ° °
B/APCI (33 MHz)
Alpha
rcv dma
LinkInterface
Bus Interface
Mem
PCTtxctr
rxctr
P - $
AlphaServerSMP
MEMORY CHANNEL interconnect
100 MB/s
7/28/98 SPAA/PODC Clusters 43
Scalability, Availability, and Performance
• Scale disk, memory, proc independently
• Random node serves query, all search
• On (hw or sw) failure, lose random cols of index
• On overload, lose random rows
P P P P P P P P
Myrinet
FE FE FE
100 Million Document Index
Inktomi
7/28/98 SPAA/PODC Clusters 44
Summary
• Performance => Generality (see Part 2)
• From Technology “Shift” to Technology “Trend”
• Cluster communication becoming cheap– gigabit ethernet
• System Area Networks becoming commodity– Myricom OEM, Tandem/Compaq ServerNet, SGI, HAL, Sun
• Improvements in interconnect BW– gigabyte per second and beyond
• Bus connections improving– PCI, ePCI, Pentium II cluster slot, …
• Operating system out of the way– VIA
7/28/98 SPAA/PODC Clusters 45
Advice
• Clusters are cheap, easy to build, flexible, powerful, general purpose and fun
• Everybody doing SPAA or PODC should have one to try out their ideas
• Can use Berkeley NOW through npaci – www.npaci.edu