Download - NOW and Beyond

NOW and Beyond

Workshop on Clusters and Computational Grids for Scientific Computing

David E. Culler

Computer Science Division

Univ. of California, Berkeley

http://now.cs.berkeley.edu/

7/30/98 HPDC Panel 2

NOW Project Goals

• Make a fundamental change in how we design and construct large-scale systems

– market reality:

» 50%/year performance growth => cannot allow 1-2 year engineering lag

– technological opportunity:

» single-chip “Killer Switch” => fast, scalable communication

• Highly integrated building-wide system

• Explore novel system design concepts in this new “cluster” paradigm


Berkeley NOW

• 100 Sun UltraSparcs

– 200 disks

• Myrinet SAN

– 160 MB/s

• Fast comm.– AM, MPI, ...

• Ether/ATM switched external net

• Global OS

• Self Config


Minute Sort

SGI Power Challenge

SGI Orgin

0123456789

0 10 20 30 40 50 60 70 80 90 100

Processors

Gig

abyt

es s

orted

Landmarks

• Top 500 Linpack Performance List

• MPI, NPB performance on par with MPPs

• RSA 40-bit Key challenge

• World Leading External Sort

• Inktomi search engine

• NPACI resource site


Taking Stock

• Surprising successes– virtual networks

– implicit co-scheduling

– reactive IO

– service-based applications

– automatic network mapping

• Surprising unsuccesses– global system layer

– xFS file system

• New directions for Millennium– Paranoid construction

– Computational Economy

– Smart Clients


Fast Communication

• Fast communication on clusters is obtained through direct access to the network, as on MPPs

• Challenge is make this general purpose– system implementation should not dictate how it can be used

0

2

4

6

8

10

12

14

16

µs

gLOrOs


Virtual Networks

• Endpoint abstracts the notion of “attached to the network”

• Virtual network is a collection of endpoints that can name each other.

• Many processes on a node can each have many endpoints, each with own protection domain.


Process 3

How are they managed?

• How do you get direct hardware access for performance with a large space of logical resources?

• Just like virtual memory– active portion of large logical space is bound to physical

resources

Process n

Process 2Process 1

***

HostMemory

Processor

NICMem

Network Interface

P


Network Interface Support

• NIC has endpoint frames

• Services active endpoints

• Signals misses to driver– using a system endpont

Frame 0

Frame 7

Transmit

Receive

EndPoint Miss


Communication under Load

Client

Client

Client

ServerServerServer

Msgburst work

0

10000

20000

30000

40000

50000

60000

70000

80000

1 4 7 10

13

16

19

22

25

28

Number of virtual networks

Ag

gre

ga

te m

sg

s/s continuous

1024 msgs

2048 msgs

4096 msgs

8192 msgs

16384 msgs

=> Use of networking resources adapts to demand.


Implicit Coscheduling

• Problem: parallel programs designed to run in parallel => huge slowdowns with local scheduling– gang scheduling is rigid, fault prone, and complex

• Coordinate schedulers implicitly using the communication in the program– very easy to build, robust to component failures

– inherently “service on-demand”, scalable

– Local service component can evolve.

A

LS

A

GS

A

LS

GS

A

LS

A

GS

LS

A

GS


Why it works

• Infer non-local state from local observations

• React to maintain coordinationobservation implication action

fast response partner scheduled spin

delayed response partner not scheduled block

WS 1 Job A Job A

WS 2 Job B Job A

WS 3 Job B Job A

WS 4 Job B Job A

sleep

spin

request response


Example

• Range of granularity and load imbalance– spin wait 10x slowdown


I/O Lessons from NOW sort

• Complete system on every node powerful basis for data intensive computing

– complete disk sub-system

– independent file systems

» MMAP not read, MADVISE

– full OS => threads

• Remote I/O (with fast comm.) provides same bandwidth as local I/O.

• I/O performance is very tempermental– variations in disk speeds

– variations within a disk

– variations in processing, interrupts, messaging, ...


Reactive I/O

• Loosen data semantics– ex: unordered bag of records

• Build flows from producers (eg. Disks) to consumers (eg. Summation)

• Flow data to where it can be consumed

D A

D A

D A

D A

D A

D A

D A

D ADis

trib

ute

d Q

ue

ue

Static Parallel Aggregation Adaptive Parallel Aggregation


Performance Scaling

• Allows more data to go to faster consumer

0%10%20%30%40%50%60%70%80%90%

100%

0 5 10 15

Nodes

% o

f Pea

k I/O

Rat

e

Adpative Agr.

Static Agr.

0%10%20%30%40%50%60%70%80%90%

100%

0 5 10 15

Nodes Perturbed%

of P

eak

I/O R

ate

Adpative Agr.

Static Agr.


Service Based Applications

• Application provides services to clients

• Grows/Shrinks according to demand, availability, and faults

Service request

Front-endservice threads

Caches

User ProfileDatabaseManager

Physicalprocessor

Transcend Transcoding Proxy


On the other hand

• Glunix– offered much that was not available elsewhere

» interactive use, load balancing, transparency (partial), …

– straightforward master-slaves architecture

– millions of jobs served, reasonable scalability, flexible partitioning

– crash-prone, inscrutable, unaware, …

• xFS– very sophisticated co-operative caching + network RAID

– integrated at vnode layer

– never robust enough for real use

Both are hard, outstanding problems


Lessons

• Strength of clusters comes from – complete, independent components

– incremental scalability (up and down)

– nodal isolation

• Performance heterogeneity and change are fundamental

• Subsystems and applications need to be reactive and self-tuning

• Local intelligence + simple, flexible composition


Millennium

• Campus-wide cluster of clusters

• PC based (Solaris/x86 and NT)

• Distributed ownership and control

• Computational science and internet systems testbed

Gigabit Ethernet

SIMS

C.S.

E.E.

M.E.

BMRC

N.E.

IEORC. E. MSME

NERSC

Transport

Business

Chemistry

Astro

Physics

Biology

Economy Math


Paranoid Construction

• What must work for RSH, dCOM, RMI, read, …?

• A page of C to safely read a line from a socket!

=> carefully controlled set of cluster system op’s

=> non-blocking with timeout and full error checking– even if need a watcher thread

=> optimistic with fail-over of implementation

=> global capability at physical level

=> indirection used for transparency must track fault envelope, not just provide mapping


Computational Economy Approach

• System has a supply of various resources

• Demand for resources revealed in price– distinct from the cost of acquiring the resources

• User has unique assessment of value

• Client agent negotiates for system resources on user’s behalf

– submits requests, receives bids or participates in auctions

– selects resources of highest value at least cost


Advantages of the Approach

• Decentralized load balancing– according to user’s perception of importance, not system’s

– adapts to system and workload changes

• Creates Incentive to adopt efficient modes of use– maintain resources in usable form

– avoid excessive usage when needed by others

– exploit under-utilized resources

– maximize flexibility (e.g., migratable, restartable applications)

• Establishes user-to-user feedback on resource usage– basis for exchange rate across resources

• Powerful framework for system design– Natural for client to be watchful, proactive, and wary

– Generalizes from resources to services

• Rich body of theory ready for application


Resource Allocation

• Traditional approach allocates requests to resources to optimize some system utility function

– e.g., put work on least loaded, most free mem, short queue, ...

• Economic approach views each user as having a distinct utility function

– e.g., can exchange resource and have both happy!

Allocator

Stream of (incomplete)Client Requests

Stream of (partial, delayed, or incomplete)resource status information


Pricing and all that

• What’s the value of a CPU-minute, a MB-sec, a GB-day?

• Many iterative market schemes– raise price till load drops

• Auctions avoid setting a price– Vikrey (second price sealed bid) will cause resources to go to

where they are most valued at the lowest price

– In self-interest to reveal true utility function!

• Small problem: auctions are awkward for most real allocation problems

• Big problem: people (and their surrogates) don’t know what value to place on computation and storage!


Smart Clients

• Adopt the NT “everything is two-tier, at least”– UI stays on the desktop and interacts with computation “in the

cluster of clusters” via distributed objects– Single-system image provided by wrapper

• Client can provide complete functionality– resource discovery, load balancing– request remote execution service

• Flexible appln’s will monitor availability and adapt.

• Higher level services 3-tier optimization– directory service, membership, parallel startup


Everything is a service

• Load-balancing

• Brokering

• Replication

• Directories

=> they need to be cost-effective or client will fall back to “self support”

– if they are cost-effective, competitors might arise

• Useful applications should be packaged as services

– their value may be greater than the cost of resources consumed


Conclusions

• We’ve got the building blocks for very interesting clustered systems

– fast communication, authentication, directories, distributed object models

• Transparency and uniform access are convenient, but...

• It is time to focus on exploiting the new characteristics of these systems in novel ways.

• We need to get real serious about availability.

• Agility (wary, reactive, adaptive) is fundamental.

• Gronky “F77 + MPI and no IO” codes will seriously hold us back

• Need to provide a better framework for cluster applications