High-Performance Clusters part 1: Performance

High-Performance Clusters part 1: Performance

David E. Culler

Computer Science Division

U.C. Berkeley

PODC/SPAA Tutorial

Sunday, June 28, 1998

7/28/98 SPAA/PODC Clusters 2

Clusters have Arrived

• … the SPAA / PDOC testbed going forward


Berkeley NOW

• http://now.cs.berkeley.edu/


NOW’s Commercial Version

• 240 procesors, Active Messages, myrinet, ...


Berkeley Massive Storage Cluster

• serving Fine Art at www.thinker.org/imagebase/

• or try


Commercial Scene


What’s a Cluster?

• Collection of independent computer systems working together as if a single system.

• Coupled through a scalable, high bandwidth, low latency interconnect.


Outline for Part 1

• Why Clusters NOW?

• What is the Key Challenge?

• How is it overcome?

• How much performance?

• Where is it going?


Why Clusters?

• Capacity

• Availability

• Scalability• Cost-effectiveness


Traditional Availability Clusters

• VAX Clusters => IBM sysplex => Wolf Pack

ClientsClients

Disk array ADisk array A

Disk array BDisk array B

InterconnectInterconnectServerServer AA

ServerServer BB


Why HP Clusters NOW?

• Time to market => performance

• Technology

• internet services

0

50

100

150

200

250

300

1986 1988 1990 1992 1994

Year

SpecIntSpecFP

EngineeringLag Time

NodePerformancein Large System


Technology Breakthrough

• Killer micro => Killer switch

CM-2

CM-5

Paragon XP/S (6768)Cray T3D

ASCI red

Y-mp

C90

T90

X-mp

0.1

1

10

100

1000

1984 1986 1988 1990 1992 1994 1996

Year

GFL

OP

S

MPP

Cray vector

single chip building block for scalable networks• high bandwidth• low latency• very reliable


Opportunity: Rethink System Design

• Remote memory and processor are closer than local disks!

• Networking Stacks ?

• Virtual Memory ?

• File system design ?

• It all looks like parallel programming

• Huge demand for scalable, available, dedicated internet servers

– big I/O, big compute


Example: Traditional File System

Clients Server

$$$

GlobalSharedFile Cache

RAIDDisk Storage

Fast Channel (HPPI)

• Expensive

• Complex

• Non-Scalable

• Single point of failure

$

LocalPrivate

File Cache

$

$

° ° ° Bottleneck

• Server resources at a premium

• Client resources poorly utilized


Truly Distributed File System

• VM: page to remote memory

FileCache

P

FileCache

P

FileCache

P

FileCache

P

FileCache

P

FileCache

P

FileCache

P

FileCache

P

Scalable Low-Latency Communication Network

Network RAID striping

G = Node Comm BW / Disk BW

LocalCache

Cluster Caching


Fast Communication Challenge

• Fast processors and fast networks

• The time is spent in crossing between them

Killer Switch

° ° °

NetworkInterface Hardware

Comm..Software


Comm.Software


Comm.Software


Comm.Software

Killer Platform

ns

µs

ms


Opening: Intelligent Network Interfaces

• Dedicated Processing power and storage embedded in the Network Interface

• An I/O card today

• Tomorrow on chip?

$

P

M I/O bus (S-Bus)50 MB/s

MryicomNet

P

Sun Ultra 170

MyricomNIC

160 MB/s

M

$

P

M P

$

P

M

$

P$

P

M


Our Attack: Active Messages

• Request / Reply small active messages (RPC)

• Bulk-Transfer (store & get)

• Highly optimized communication layer on a range of HW

Request

handler

handler

Reply


NOW System Architecture

Net Inter. HW

UNIXWorkstation

Comm. SW

Net Inter. HW

Comm. SW

Net Inter. HW

Comm. SW

Net Inter. HW

Comm. SW

Global Layer UNIX Resource Management Network RAM Distributed Files Process Migration

Fast Commercial Switch (Myrinet)

UNIXWorkstation

UNIXWorkstation

UNIXWorkstation

Large Seq. Apps

Parallel Apps

Sockets, Split-C, MPI, HPF, vSM


Cluster Communication Performance


LogP

Interconnection Network

MPMPMP° ° °

P ( processors )

Limited Volume ( L/g to a proc)

o (overhead)

L (latency)

og (gap)

• Latency in sending a (small) message between modules

• overhead felt by the processor on sending or receiving msg

• gap between successive sends or receives (1/rate)

• Processors

Round Trip time: 2 x ( 2o + L)


LogP Comparison

• Direct, user-level network access

• Generic AM, FM (uiuc), PM (rwc), Unet (cornell), …

0

2

4

6

8

10

12

14

16

µs

gLOrOs

Latency 1/BW


MPI over AM: ping-pong bandwidth

0

10

20

30

40

50

60

70

10 100 1000 10000 100000 1000000

Message Size (bytes)

Ban

dw

idth

(M

B/s

)

SGI Challenge

Meiko CS2

NOW

IBM SP2

Cray T3D


MPI over AM: start-up

0

10

20

30

40

50

60

70

80

90

SG

IC

ha

llen

ge

Me

iko

NO

W

IBM

SP2

Cra

y T

3D

mic

rose

con

ds


Cluster Application Performance:NAS Parallel Benchmarks


NPB2: NOW vs SP2


NPB2: NOW vs SGI Origin

Speedup on NOW

0

5

10

15

20

25

30

35

40

0 5 10 15 20 25 30Nodes

Spee

dup

lu

mg

sp

Ideal

Speedup on SGI Origin 2000

0

5

10

15

20

25

30

35

40

0 5 10 15 20 25 30Nodes

Spee

dup

lu

mg

sp

Ideal


Where the Time Goes: LU

Time Breakdown of LU on Origin 2000

0

500

1000

1500

2000

2500

3000

1 2 4 8 16 32Processors

Time Breakdown of LU on NOW

0

500

1000

1500

2000

2500

3000

1 2 4 8 16 32Processors

Time (

seco

nds)

Cummulative

Computation

Communication

Ideal


Where the time goes: SP

Time Breakdown on SGI

0

500

1000

1500

2000

2500

3000

3500

1 4 9 16 25Processors

Time Breakdown on NOW

0

500

1000

1500

2000

2500

3000

3500

1 4 9 16 25Processors

Time (

seco

nds)

CummulativeComputationCommunicationIdeal


0

2

4

6

8

10

12

14

1 10 100 1000 10000Cache Size (KB)

Mis

s R

ate

(%)

4-Node

LU Working Set

• 4-processor– traditional curve

for small caches

– Sharp knee >256KB (1 MB total)


0

2

4

6

8

10

12

14

1 10 100 1000 10000Cache Size (KB)

Mis

s R

ate

(%)

4-Node

8-Node

16-Node

32-Node

LU Working Set (CPS scaling)

• Knee at global cache > 1MB

• machine experiences drop in miss rate at specific size


Application Sensitivity to Communication Performance


Adjusting L, o, and g (and G) in situ

• Martin, et al., ISCA 97

Lanai

Host Workstation

O: stall Ultraon msg write

AM lib

Lanai

Host Workstation

AM lib

g: delay Lanaiafter msg injection

(after fragment forbulk transfers)

L: defer markingmsg as valid untilRx + L

O: stall Ultraon msg read

Myrinet


0

50

100

0 50 100

L (desired)

µs

Calibration

0

50

100

150

200

0 50 100

O (desired)

µs

o

g

L

desired

0

50

100

0 50 100

g (desired)


Split-C Applications

Program Input P=16 P=32 (us) Msg Type Interval

Radix Integer radix sort 16M 32-bit keys 13.77.8 6.1 msg

EM3D(write) Electro-magnetic 80K Nodes, 40% rmt 88.638.0 8.0 write

EM3D(read) Electro-magnetic 80K Nodes, 40% rmt 230.0114.0 13.8 read

Sample Integer sample sort 32M 32-bit keys 24.713.2 13.0 msg

Barnes Hierarchical N-Body 1 Million Bodies 77.943.2 52.8 cached read

P-Ray Ray Tracer 1 Million pixel image 23.5 17.9 156.2 cached read

MurPHI Protocol Verification SCI protocol, 2 proc 67.735.3 183.5 Bulk

Connect Connected Comp4M nodes, 2-D mesh, 30% 2.3 1.2 212.6BSP

NOW-sort Disk-to-Disk Sort32M 100-byte records 127.2 56.9 817.4I/O

Radb Bulk version Radix 16M 32-bit keys 7.03.7 852.7 Bulk


Sensitivity to Overhead

0

10

20

30

40

50

60

0 10 20 30 40 50 60 70 80 90 100 110

Overhead

Slo

wdow

n

Barnes

Radix

EM3D(write)

EM3D(read)

Sample

P-Ray

Murphi

Connect

NOWsort

RadB


Comparative Impact

0

10

20

30

40

50

60

0 20 40 60 80 100Overhead

Slow

dow

n

0

10

20

30

40

50

60

0 20 40 60 80 100gap

0

10

20

30

40

50

60

0 20 40 60 80 100Latency


Sensitivity to bulk BW (1/G)

0

0.5

1

1.5

2

2.5

0 10 20 30 40

MB/s

Slo

wdow

n

Barnes

Radix

EM3D(write)

EM3D(read)

Sample

P-Ray

Murphi

Connect

NOWsort

RadB


Cluster Communication Performance

• Overhead, Overhead, Overhead– hypersensitive due to increased serialization

• Sensitivity to gap reflects bursty communication

• Surprisingly latency tolerant

• Plenty of room for overhead improvement

- How sensitive are distributed systems?


Extrapolating to Low Overhead

0

1

2

3

4

5

0 5 10 15

Overhead

Slo

wdow

n

Barnes

Radix

EM3D(write)

EM3D(read)

Sample

P-Ray

Murphi

Connect

NOWsort

RadB


Direct Memory Messaging

• Send region and receive region for each end of communication channel

• Write through send region into remote rcv region

S1

S2

R1

R2

S3

R3

VA0

VA2

Nodei

S0

S2

R0

R2

VANodej

S0

S1

R1

R0

VANodek

S1

S2

S3

R1

R2

R3

PhysicalAddrress

IO


Direct Memory Interconnects

• DEC Memory Channels

– 3 us end-to-end

– ~ 1us o, L

• SCI

• SGI

• Shrimp (Princeton)

° ° °

B/APCI (33 MHz)

Alpha

rcv dma

LinkInterface

Bus Interface

Mem

PCTtxctr

rxctr

P - $

AlphaServerSMP

MEMORY CHANNEL interconnect

100 MB/s


Scalability, Availability, and Performance

• Scale disk, memory, proc independently

• Random node serves query, all search

• On (hw or sw) failure, lose random cols of index

• On overload, lose random rows

P P P P P P P P

Myrinet

FE FE FE

100 Million Document Index

Inktomi


Summary

• Performance => Generality (see Part 2)

• From Technology “Shift” to Technology “Trend”

• Cluster communication becoming cheap– gigabit ethernet

• System Area Networks becoming commodity– Myricom OEM, Tandem/Compaq ServerNet, SGI, HAL, Sun

• Improvements in interconnect BW– gigabyte per second and beyond

• Bus connections improving– PCI, ePCI, Pentium II cluster slot, …

• Operating system out of the way– VIA


Advice

• Clusters are cheap, easy to build, flexible, powerful, general purpose and fun

• Everybody doing SPAA or PODC should have one to try out their ideas

• Can use Berkeley NOW through npaci – www.npaci.edu

Date post:	11-Jan-2016
Category:	Documents
Upload:	kendra
View:	27 times
Download:	1 times

High-Performance Clusters part 1: Performance

Documents