+ All Categories
Home > Documents > High-Performance Clusters part 1: Performance

High-Performance Clusters part 1: Performance

Date post: 11-Jan-2016
Category:
Upload: kendra
View: 27 times
Download: 1 times
Share this document with a friend
Description:
High-Performance Clusters part 1: Performance. David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June 28, 1998. Clusters have Arrived. … the SPAA / PDOC testbed going forward. Berkeley NOW. http://now.cs.berkeley.edu/. NOW’s Commercial Version. - PowerPoint PPT Presentation
45
High-Performance Clusters part 1: Performance David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June 28, 1998
Transcript
Page 1: High-Performance Clusters  part 1: Performance

High-Performance Clusters part 1: Performance

David E. Culler

Computer Science Division

U.C. Berkeley

PODC/SPAA Tutorial

Sunday, June 28, 1998

Page 2: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 2

Clusters have Arrived

• … the SPAA / PDOC testbed going forward

Page 3: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 3

Berkeley NOW

• http://now.cs.berkeley.edu/

Page 4: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 4

NOW’s Commercial Version

• 240 procesors, Active Messages, myrinet, ...

Page 5: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 5

Berkeley Massive Storage Cluster

• serving Fine Art at www.thinker.org/imagebase/

• or try

Page 6: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 6

Commercial Scene

Page 7: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 7

What’s a Cluster?

• Collection of independent computer systems working together as if a single system.

• Coupled through a scalable, high bandwidth, low latency interconnect.

Page 8: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 8

Outline for Part 1

• Why Clusters NOW?

• What is the Key Challenge?

• How is it overcome?

• How much performance?

• Where is it going?

Page 9: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 9

Why Clusters?

• Capacity

• Availability

• Scalability• Cost-effectiveness

Page 10: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 10

Traditional Availability Clusters

• VAX Clusters => IBM sysplex => Wolf Pack

ClientsClients

Disk array ADisk array A

Disk array BDisk array B

InterconnectInterconnectServerServer AA

ServerServer BB

Page 11: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 11

Why HP Clusters NOW?

• Time to market => performance

• Technology

• internet services

0

50

100

150

200

250

300

1986 1988 1990 1992 1994

Year

SpecIntSpecFP

EngineeringLag Time

NodePerformancein Large System

Page 12: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 12

Technology Breakthrough

• Killer micro => Killer switch

CM-2

CM-5

Paragon XP/S (6768)Cray T3D

ASCI red

Y-mp

C90

T90

X-mp

0.1

1

10

100

1000

1984 1986 1988 1990 1992 1994 1996

Year

GFL

OP

S

MPP

Cray vector

single chip building block for scalable networks• high bandwidth• low latency• very reliable

Page 13: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 13

Opportunity: Rethink System Design

• Remote memory and processor are closer than local disks!

• Networking Stacks ?

• Virtual Memory ?

• File system design ?

• It all looks like parallel programming

• Huge demand for scalable, available, dedicated internet servers

– big I/O, big compute

Page 14: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 14

Example: Traditional File System

Clients Server

$$$

GlobalSharedFile Cache

RAIDDisk Storage

Fast Channel (HPPI)

• Expensive

• Complex

• Non-Scalable

• Single point of failure

$

LocalPrivate

File Cache

$

$

° ° ° Bottleneck

• Server resources at a premium

• Client resources poorly utilized

Page 15: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 15

Truly Distributed File System

• VM: page to remote memory

FileCache

P

FileCache

P

FileCache

P

FileCache

P

FileCache

P

FileCache

P

FileCache

P

FileCache

P

Scalable Low-Latency Communication Network

Network RAID striping

G = Node Comm BW / Disk BW

LocalCache

Cluster Caching

Page 16: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 16

Fast Communication Challenge

• Fast processors and fast networks

• The time is spent in crossing between them

Killer Switch

° ° °

NetworkInterface Hardware

Comm..Software

NetworkInterface Hardware

Comm.Software

NetworkInterface Hardware

Comm.Software

NetworkInterface Hardware

Comm.Software

Killer Platform

ns

µs

ms

Page 17: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 17

Opening: Intelligent Network Interfaces

• Dedicated Processing power and storage embedded in the Network Interface

• An I/O card today

• Tomorrow on chip?

$

P

M I/O bus (S-Bus)50 MB/s

MryicomNet

P

Sun Ultra 170

MyricomNIC

160 MB/s

M

$

P

M P

$

P

M

$

P$

P

M

Page 18: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 18

Our Attack: Active Messages

• Request / Reply small active messages (RPC)

• Bulk-Transfer (store & get)

• Highly optimized communication layer on a range of HW

Request

handler

handler

Reply

Page 19: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 19

NOW System Architecture

Net Inter. HW

UNIXWorkstation

Comm. SW

Net Inter. HW

Comm. SW

Net Inter. HW

Comm. SW

Net Inter. HW

Comm. SW

Global Layer UNIX Resource Management Network RAM Distributed Files Process Migration

Fast Commercial Switch (Myrinet)

UNIXWorkstation

UNIXWorkstation

UNIXWorkstation

Large Seq. Apps

Parallel Apps

Sockets, Split-C, MPI, HPF, vSM

Page 20: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 20

Cluster Communication Performance

Page 21: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 21

LogP

Interconnection Network

MPMPMP° ° °

P ( processors )

Limited Volume ( L/g to a proc)

o (overhead)

L (latency)

og (gap)

• Latency in sending a (small) message between modules

• overhead felt by the processor on sending or receiving msg

• gap between successive sends or receives (1/rate)

• Processors

Round Trip time: 2 x ( 2o + L)

Page 22: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 22

LogP Comparison

• Direct, user-level network access

• Generic AM, FM (uiuc), PM (rwc), Unet (cornell), …

0

2

4

6

8

10

12

14

16

µs

gLOrOs

Latency 1/BW

Page 23: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 23

MPI over AM: ping-pong bandwidth

0

10

20

30

40

50

60

70

10 100 1000 10000 100000 1000000

Message Size (bytes)

Ban

dw

idth

(M

B/s

)

SGI Challenge

Meiko CS2

NOW

IBM SP2

Cray T3D

Page 24: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 24

MPI over AM: start-up

0

10

20

30

40

50

60

70

80

90

SG

IC

ha

llen

ge

Me

iko

NO

W

IBM

SP2

Cra

y T

3D

mic

rose

con

ds

Page 25: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 25

Cluster Application Performance:NAS Parallel Benchmarks

Page 26: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 26

NPB2: NOW vs SP2

Page 27: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 27

NPB2: NOW vs SGI Origin

Speedup on NOW

0

5

10

15

20

25

30

35

40

0 5 10 15 20 25 30Nodes

Spee

dup

lu

mg

sp

Ideal

Speedup on SGI Origin 2000

0

5

10

15

20

25

30

35

40

0 5 10 15 20 25 30Nodes

Spee

dup

lu

mg

sp

Ideal

Page 28: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 28

Where the Time Goes: LU

Time Breakdown of LU on Origin 2000

0

500

1000

1500

2000

2500

3000

1 2 4 8 16 32Processors

Time Breakdown of LU on NOW

0

500

1000

1500

2000

2500

3000

1 2 4 8 16 32Processors

Time (

seco

nds)

Cummulative

Computation

Communication

Ideal

Page 29: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 29

Where the time goes: SP

Time Breakdown on SGI

0

500

1000

1500

2000

2500

3000

3500

1 4 9 16 25Processors

Time Breakdown on NOW

0

500

1000

1500

2000

2500

3000

3500

1 4 9 16 25Processors

Time (

seco

nds)

CummulativeComputationCommunicationIdeal

Page 30: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 30

0

2

4

6

8

10

12

14

1 10 100 1000 10000Cache Size (KB)

Mis

s R

ate

(%)

4-Node

LU Working Set

• 4-processor– traditional curve

for small caches

– Sharp knee >256KB (1 MB total)

Page 31: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 31

0

2

4

6

8

10

12

14

1 10 100 1000 10000Cache Size (KB)

Mis

s R

ate

(%)

4-Node

8-Node

16-Node

32-Node

LU Working Set (CPS scaling)

• Knee at global cache > 1MB

• machine experiences drop in miss rate at specific size

Page 32: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 32

Application Sensitivity to Communication Performance

Page 33: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 33

Adjusting L, o, and g (and G) in situ

• Martin, et al., ISCA 97

Lanai

Host Workstation

O: stall Ultraon msg write

AM lib

Lanai

Host Workstation

AM lib

g: delay Lanaiafter msg injection

(after fragment forbulk transfers)

L: defer markingmsg as valid untilRx + L

O: stall Ultraon msg read

Myrinet

Page 34: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 34

0

50

100

0 50 100

L (desired)

µs

Calibration

0

50

100

150

200

0 50 100

O (desired)

µs

o

g

L

desired

0

50

100

0 50 100

g (desired)

Page 35: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 35

Split-C Applications

Program Input P=16 P=32 (us) Msg Type Interval

Radix Integer radix sort 16M 32-bit keys 13.77.8 6.1 msg

EM3D(write) Electro-magnetic 80K Nodes, 40% rmt 88.638.0 8.0 write

EM3D(read) Electro-magnetic 80K Nodes, 40% rmt 230.0114.0 13.8 read

Sample Integer sample sort 32M 32-bit keys 24.713.2 13.0 msg

Barnes Hierarchical N-Body 1 Million Bodies 77.943.2 52.8 cached read

P-Ray Ray Tracer 1 Million pixel image 23.5 17.9 156.2 cached read

MurPHI Protocol Verification SCI protocol, 2 proc 67.735.3 183.5 Bulk

Connect Connected Comp4M nodes, 2-D mesh, 30% 2.3 1.2 212.6BSP

NOW-sort Disk-to-Disk Sort32M 100-byte records 127.2 56.9 817.4I/O

Radb Bulk version Radix 16M 32-bit keys 7.03.7 852.7 Bulk

Page 36: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 36

Sensitivity to Overhead

0

10

20

30

40

50

60

0 10 20 30 40 50 60 70 80 90 100 110

Overhead

Slo

wdow

n

Barnes

Radix

EM3D(write)

EM3D(read)

Sample

P-Ray

Murphi

Connect

NOWsort

RadB

Page 37: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 37

Comparative Impact

0

10

20

30

40

50

60

0 20 40 60 80 100Overhead

Slow

dow

n

0

10

20

30

40

50

60

0 20 40 60 80 100gap

0

10

20

30

40

50

60

0 20 40 60 80 100Latency

Page 38: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 38

Sensitivity to bulk BW (1/G)

0

0.5

1

1.5

2

2.5

0 10 20 30 40

MB/s

Slo

wdow

n

Barnes

Radix

EM3D(write)

EM3D(read)

Sample

P-Ray

Murphi

Connect

NOWsort

RadB

Page 39: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 39

Cluster Communication Performance

• Overhead, Overhead, Overhead– hypersensitive due to increased serialization

• Sensitivity to gap reflects bursty communication

• Surprisingly latency tolerant

• Plenty of room for overhead improvement

- How sensitive are distributed systems?

Page 40: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 40

Extrapolating to Low Overhead

0

1

2

3

4

5

0 5 10 15

Overhead

Slo

wdow

n

Barnes

Radix

EM3D(write)

EM3D(read)

Sample

P-Ray

Murphi

Connect

NOWsort

RadB

Page 41: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 41

Direct Memory Messaging

• Send region and receive region for each end of communication channel

• Write through send region into remote rcv region

S1

S2

R1

R2

S3

R3

VA0

VA2

Nodei

S0

S2

R0

R2

VANodej

S0

S1

R1

R0

VANodek

S1

S2

S3

R1

R2

R3

PhysicalAddrress

IO

Page 42: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 42

Direct Memory Interconnects

• DEC Memory Channels

– 3 us end-to-end

– ~ 1us o, L

• SCI

• SGI

• Shrimp (Princeton)

° ° °

B/APCI (33 MHz)

Alpha

rcv dma

LinkInterface

Bus Interface

Mem

PCTtxctr

rxctr

P - $

AlphaServerSMP

MEMORY CHANNEL interconnect

100 MB/s

Page 43: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 43

Scalability, Availability, and Performance

• Scale disk, memory, proc independently

• Random node serves query, all search

• On (hw or sw) failure, lose random cols of index

• On overload, lose random rows

P P P P P P P P

Myrinet

FE FE FE

100 Million Document Index

Inktomi

Page 44: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 44

Summary

• Performance => Generality (see Part 2)

• From Technology “Shift” to Technology “Trend”

• Cluster communication becoming cheap– gigabit ethernet

• System Area Networks becoming commodity– Myricom OEM, Tandem/Compaq ServerNet, SGI, HAL, Sun

• Improvements in interconnect BW– gigabyte per second and beyond

• Bus connections improving– PCI, ePCI, Pentium II cluster slot, …

• Operating system out of the way– VIA

Page 45: High-Performance Clusters  part 1: Performance

7/28/98 SPAA/PODC Clusters 45

Advice

• Clusters are cheap, easy to build, flexible, powerful, general purpose and fun

• Everybody doing SPAA or PODC should have one to try out their ideas

• Can use Berkeley NOW through npaci – www.npaci.edu


Recommended