NOW and Beyond
Workshop on Clusters and Computational Grids for Scientific Computing
David E. Culler
Computer Science Division
Univ. of California, Berkeley
http://now.cs.berkeley.edu/
7/30/98 HPDC Panel 2
NOW Project Goals
• Make a fundamental change in how we design and construct large-scale systems
– market reality:
» 50%/year performance growth => cannot allow 1-2 year engineering lag
– technological opportunity:
» single-chip “Killer Switch” => fast, scalable communication
• Highly integrated building-wide system
• Explore novel system design concepts in this new “cluster” paradigm
7/30/98 HPDC Panel 3
Berkeley NOW
• 100 Sun UltraSparcs
– 200 disks
• Myrinet SAN
– 160 MB/s
• Fast comm.– AM, MPI, ...
• Ether/ATM switched external net
• Global OS
• Self Config
7/30/98 HPDC Panel 4
Minute Sort
SGI Power Challenge
SGI Orgin
0123456789
0 10 20 30 40 50 60 70 80 90 100
Processors
Gig
abyt
es s
orted
Landmarks
• Top 500 Linpack Performance List
• MPI, NPB performance on par with MPPs
• RSA 40-bit Key challenge
• World Leading External Sort
• Inktomi search engine
• NPACI resource site
7/30/98 HPDC Panel 5
Taking Stock
• Surprising successes– virtual networks
– implicit co-scheduling
– reactive IO
– service-based applications
– automatic network mapping
• Surprising unsuccesses– global system layer
– xFS file system
• New directions for Millennium– Paranoid construction
– Computational Economy
– Smart Clients
7/30/98 HPDC Panel 6
Fast Communication
• Fast communication on clusters is obtained through direct access to the network, as on MPPs
• Challenge is make this general purpose– system implementation should not dictate how it can be used
0
2
4
6
8
10
12
14
16
µs
gLOrOs
7/30/98 HPDC Panel 7
Virtual Networks
• Endpoint abstracts the notion of “attached to the network”
• Virtual network is a collection of endpoints that can name each other.
• Many processes on a node can each have many endpoints, each with own protection domain.
7/30/98 HPDC Panel 8
Process 3
How are they managed?
• How do you get direct hardware access for performance with a large space of logical resources?
• Just like virtual memory– active portion of large logical space is bound to physical
resources
Process n
Process 2Process 1
***
HostMemory
Processor
NICMem
Network Interface
P
7/30/98 HPDC Panel 9
Network Interface Support
• NIC has endpoint frames
• Services active endpoints
• Signals misses to driver– using a system endpont
Frame 0
Frame 7
Transmit
Receive
EndPoint Miss
7/30/98 HPDC Panel 10
Communication under Load
Client
Client
Client
ServerServerServer
Msgburst work
0
10000
20000
30000
40000
50000
60000
70000
80000
1 4 7 10
13
16
19
22
25
28
Number of virtual networks
Ag
gre
ga
te m
sg
s/s continuous
1024 msgs
2048 msgs
4096 msgs
8192 msgs
16384 msgs
=> Use of networking resources adapts to demand.
7/30/98 HPDC Panel 11
Implicit Coscheduling
• Problem: parallel programs designed to run in parallel => huge slowdowns with local scheduling– gang scheduling is rigid, fault prone, and complex
• Coordinate schedulers implicitly using the communication in the program– very easy to build, robust to component failures
– inherently “service on-demand”, scalable
– Local service component can evolve.
A
LS
A
GS
A
LS
GS
A
LS
A
GS
LS
A
GS
7/30/98 HPDC Panel 12
Why it works
• Infer non-local state from local observations
• React to maintain coordinationobservation implication action
fast response partner scheduled spin
delayed response partner not scheduled block
WS 1 Job A Job A
WS 2 Job B Job A
WS 3 Job B Job A
WS 4 Job B Job A
sleep
spin
request response
7/30/98 HPDC Panel 13
Example
• Range of granularity and load imbalance– spin wait 10x slowdown
7/30/98 HPDC Panel 14
I/O Lessons from NOW sort
• Complete system on every node powerful basis for data intensive computing
– complete disk sub-system
– independent file systems
» MMAP not read, MADVISE
– full OS => threads
• Remote I/O (with fast comm.) provides same bandwidth as local I/O.
• I/O performance is very tempermental– variations in disk speeds
– variations within a disk
– variations in processing, interrupts, messaging, ...
7/30/98 HPDC Panel 15
Reactive I/O
• Loosen data semantics– ex: unordered bag of records
• Build flows from producers (eg. Disks) to consumers (eg. Summation)
• Flow data to where it can be consumed
D A
D A
D A
D A
D A
D A
D A
D ADis
trib
ute
d Q
ue
ue
Static Parallel Aggregation Adaptive Parallel Aggregation
7/30/98 HPDC Panel 16
Performance Scaling
• Allows more data to go to faster consumer
0%10%20%30%40%50%60%70%80%90%
100%
0 5 10 15
Nodes
% o
f Pea
k I/O
Rat
e
Adpative Agr.
Static Agr.
0%10%20%30%40%50%60%70%80%90%
100%
0 5 10 15
Nodes Perturbed%
of P
eak
I/O R
ate
Adpative Agr.
Static Agr.
7/30/98 HPDC Panel 17
Service Based Applications
• Application provides services to clients
• Grows/Shrinks according to demand, availability, and faults
Service request
Front-endservice threads
Caches
User ProfileDatabaseManager
Physicalprocessor
Transcend Transcoding Proxy
7/30/98 HPDC Panel 18
On the other hand
• Glunix– offered much that was not available elsewhere
» interactive use, load balancing, transparency (partial), …
– straightforward master-slaves architecture
– millions of jobs served, reasonable scalability, flexible partitioning
– crash-prone, inscrutable, unaware, …
• xFS– very sophisticated co-operative caching + network RAID
– integrated at vnode layer
– never robust enough for real use
Both are hard, outstanding problems
7/30/98 HPDC Panel 19
Lessons
• Strength of clusters comes from – complete, independent components
– incremental scalability (up and down)
– nodal isolation
• Performance heterogeneity and change are fundamental
• Subsystems and applications need to be reactive and self-tuning
• Local intelligence + simple, flexible composition
7/30/98 HPDC Panel 20
Millennium
• Campus-wide cluster of clusters
• PC based (Solaris/x86 and NT)
• Distributed ownership and control
• Computational science and internet systems testbed
Gigabit Ethernet
SIMS
C.S.
E.E.
M.E.
BMRC
N.E.
IEORC. E. MSME
NERSC
Transport
Business
Chemistry
Astro
Physics
Biology
Economy Math
7/30/98 HPDC Panel 21
Paranoid Construction
• What must work for RSH, dCOM, RMI, read, …?
• A page of C to safely read a line from a socket!
=> carefully controlled set of cluster system op’s
=> non-blocking with timeout and full error checking– even if need a watcher thread
=> optimistic with fail-over of implementation
=> global capability at physical level
=> indirection used for transparency must track fault envelope, not just provide mapping
7/30/98 HPDC Panel 22
Computational Economy Approach
• System has a supply of various resources
• Demand for resources revealed in price– distinct from the cost of acquiring the resources
• User has unique assessment of value
• Client agent negotiates for system resources on user’s behalf
– submits requests, receives bids or participates in auctions
– selects resources of highest value at least cost
7/30/98 HPDC Panel 23
Advantages of the Approach
• Decentralized load balancing– according to user’s perception of importance, not system’s
– adapts to system and workload changes
• Creates Incentive to adopt efficient modes of use– maintain resources in usable form
– avoid excessive usage when needed by others
– exploit under-utilized resources
– maximize flexibility (e.g., migratable, restartable applications)
• Establishes user-to-user feedback on resource usage– basis for exchange rate across resources
• Powerful framework for system design– Natural for client to be watchful, proactive, and wary
– Generalizes from resources to services
• Rich body of theory ready for application
7/30/98 HPDC Panel 24
Resource Allocation
• Traditional approach allocates requests to resources to optimize some system utility function
– e.g., put work on least loaded, most free mem, short queue, ...
• Economic approach views each user as having a distinct utility function
– e.g., can exchange resource and have both happy!
Allocator
Stream of (incomplete)Client Requests
Stream of (partial, delayed, or incomplete)resource status information
7/30/98 HPDC Panel 25
Pricing and all that
• What’s the value of a CPU-minute, a MB-sec, a GB-day?
• Many iterative market schemes– raise price till load drops
• Auctions avoid setting a price– Vikrey (second price sealed bid) will cause resources to go to
where they are most valued at the lowest price
– In self-interest to reveal true utility function!
• Small problem: auctions are awkward for most real allocation problems
• Big problem: people (and their surrogates) don’t know what value to place on computation and storage!
7/30/98 HPDC Panel 26
Smart Clients
• Adopt the NT “everything is two-tier, at least”– UI stays on the desktop and interacts with computation “in the
cluster of clusters” via distributed objects– Single-system image provided by wrapper
• Client can provide complete functionality– resource discovery, load balancing– request remote execution service
• Flexible appln’s will monitor availability and adapt.
• Higher level services 3-tier optimization– directory service, membership, parallel startup
7/30/98 HPDC Panel 27
Everything is a service
• Load-balancing
• Brokering
• Replication
• Directories
=> they need to be cost-effective or client will fall back to “self support”
– if they are cost-effective, competitors might arise
• Useful applications should be packaged as services
– their value may be greater than the cost of resources consumed
7/30/98 HPDC Panel 28
Conclusions
• We’ve got the building blocks for very interesting clustered systems
– fast communication, authentication, directories, distributed object models
• Transparency and uniform access are convenient, but...
• It is time to focus on exploiting the new characteristics of these systems in novel ways.
• We need to get real serious about availability.
• Agility (wary, reactive, adaptive) is fundamental.
• Gronky “F77 + MPI and no IO” codes will seriously hold us back
• Need to provide a better framework for cluster applications