CS294, Yelick Load Balancing, p1
CS 294-8Distributed Load
Balancing http://www.cs.berkeley.edu/~yelick/294
CS294, Yelick Load Balancing, p2
Load Balancing• Problem: distribute items into
buckets– Data to memory locations– Files to disks– Tasks to processors– Web pages to caches
• Goal: even distribution
• Slides stolen from Karger at MIT: http://theory.lcs.mit.edu/~karger
CS294, Yelick Load Balancing, p3
Load Balancing
Enormous and diverse literature on load balancing
• Computer Science systems – operating systems– parallel computing– distributed computing
• Computer Science theory• Operations research (IEOR)• Application domains
CS294, Yelick Load Balancing, p4
Agenda• Overview• Load Balancing Data• Load Balancing Computation
– (if there is time)
CS294, Yelick Load Balancing, p5
The Web
CMU
MIT
USC
CNN
Browers (clients)
Servers
UCB
CS294, Yelick Load Balancing, p6
Hot Spots
UCB
Browers (clients)
Servers
OceanStore
BANE
IRAM
Telegraph
CS294, Yelick Load Balancing, p7
Temporary Loads• For permanent loads, use bigger server• Must also deal with “flash crowds”
– IBM chess match– Florida election tally
• Inefficient to design for max load– Rarely attained– Much capacity wasted
• Better to offload peak load elsewhere
CS294, Yelick Load Balancing, p8
Proxy Caches Balance Load
CMU
MIT
USC
CNN
Browers (clients)
Servers
UCB
OceanStore
BANE IRAM
Telegraph
CS294, Yelick Load Balancing, p9
Proxy Caching• Old: server hit once for each
browser• New: serve his once for each page• Adapts to changing access
patterns
CS294, Yelick Load Balancing, p10
Proxy Caching• Every server can also be a cache• Incentives:
– Provides a social good– Reduces load at sites you want to
contact
• Costs you little, if done right– Few accesses– Small amount of storage (times many
servers)
CS294, Yelick Load Balancing, p11
Who Caches What?• Each cache should hold few items
– Otherwise swamped by clients
• Each item should be in few caches– Otherwise server swamped by caches– And cache invalidates/updates
expensive
• Browser must know right cache– Could ask for server to redirect– But server gets swamped by redirects
CS294, Yelick Load Balancing, p12
Hashing• Simple and powerful load balancing• Constant time to find bucket for item• Example: map to n buckets. Pick a,
b:y=ax+b (mod n)
• Intuition: hash maps each itme to one random bucket– No bucket gets many items
CS294, Yelick Load Balancing, p13
Problem: Adding Caches• Suppose a new cache arrives• How to work it into has function?• Natural change: y = ax + b (mod n+1)• Problem: changes bucket for every item
– Every cache will be flushed– Server swamped with new requests
• Goal: when add bucket, few items move
CS294, Yelick Load Balancing, p14
Problem: Inconsistent Views• Each client knows about a different
set of caches: its view• View affects choice of cache for item• With many views, each cache will be
asked for item• Item in all caches – swamps server
• Goal: item in few caches despite views
CS294, Yelick Load Balancing, p15
Problem: Inconsistent Views
0 1 2 3
UCB
ax + b (mod 4) = 2
caches
my view
CS294, Yelick Load Balancing, p16
Problem: Inconsistent Views
0 31 2
UCB
ax + b (mod 4) = 2
caches
Joe’s view
CS294, Yelick Load Balancing, p17
Problem: Inconsistent Views
20 31
UCB
ax + b (mod 4) = 2
caches
Sue’s view
CS294, Yelick Load Balancing, p18
Problem: Inconsistent Views
1 20 3
UCB
ax + b (mod 4) = 2
caches
Mike’s view
CS294, Yelick Load Balancing, p19
Problem: Inconsistent Views
22 22
UCB
caches
CS294, Yelick Load Balancing, p20
Consistent Hashing• A new kind of hash function• Maps any item to a bucket in my view• Computable in constant time, locally
– 1 standard hash function
• Adding bucket to view takes log time– Logarithmic # of standard hash functions
• Handle incremental and inconsistent views
CS294, Yelick Load Balancing, p21
Single View Properties• Balance: all buckets get roughly
same number of items
• Smooth: when a kth bucket is added, only a 1/k fraction of items move– And only from O(log n) servers– Minimum needed to preserve balance
CS294, Yelick Load Balancing, p22
Multiple View Properties• Consider n views, each of an
arbitrary constant fraction of the buckets
• Load: number of items a bucket gets from all views is O(log n) times average– Despite views, load balanced
• Spread: over all views, each item appears in O(log n) buckets– Despite views, few caches for each item
CS294, Yelick Load Balancing, p23
Implementation• Use standard hash function H to map
items and caches to unit circle• If H maps to [0..M], divide by M
x Y
• Map each item to the closest cache (going clockwise)• A holds 1,2,3• B holds 4, 5
A
B2
1
3
4
5
CS294, Yelick Load Balancing, p24
Implementation
C
A
B2
1
3
4
5
• To add a new cache– Hash the cache id– Move items that should be assigned to it
• Items do not move between A and B• A holds 3• B holds 4, 5• C holds 1, 2
CS294, Yelick Load Balancing, p25
Implementation• Cache “points” stored in a pre-
computed binary tree• Lookup for cached item requires:
– Hash of item key (e.g., URL)– BST lookup of successor
• Consistent hashing with n caches requires O(log n) time– An alternative that breaks the unit
circle into equal-length intervals can make this constant time
CS294, Yelick Load Balancing, p26
Balance• Cache points uniformly distributed
by H• Each cache “owns” equal portion of
the unit circle• Item position random by H• So each cache gets about the same
number of items
CS294, Yelick Load Balancing, p27
Smoothness• To add kth cache, hash it to circle• Captures items between it and
nearest cache– 1/k fraction of total items– Only from 1 other bucket– O(log n) to find is, as with lookup
CS294, Yelick Load Balancing, p28
Low Spread• Some views might not see nearest
cache to an item, hash it elsewhere
• But every view will have a bucket near the item (on the circle) by “random” placement
• So only buckets near the item will ever have to hold it
• Only a few buckets are near the item by “random” placement
CS294, Yelick Load Balancing, p29
Low Load• Cache only gets item I if no other
cache is closer to I• Under any view, some cache is
close to I by random place of caches
• So a cache only gets items close to it
• But an item is unlikely to be close• So cache doesn’t get many items
CS294, Yelick Load Balancing, p30
Fault Tolerance• Suppose contacted cache is down• Delete from cache set view (BST)
and find next closest cache in interval
• Just a small change in view• Even with many failures, uniform
load and other properties still hold
CS294, Yelick Load Balancing, p31
Experimental Setup• Cache Resolver System
– Cache machines for content– Users’ browsers that direct requests toward
virtual caches– Resolution units (DNS) that use consistent
hashing to map virtual caches to physical machines
• Surge web load generator from BU• Two modes:
– Common mode (fixed cache for set of clients)– Cache resolver mode using consistent hashing
CS294, Yelick Load Balancing, p32
Performance
CS294, Yelick Load Balancing, p33
Summary of Consistent Hashing
• Trivial to implement• Fast to compute• Uniformly distributes items• Can cheaply add/remove cache• Even with multiple views
– No cache gets too many items– Each item in only a few caches
CS294, Yelick Load Balancing, p34
Consistent Hashing for Caching
• Works well– Client maps known caches to unit circle– When item arrives, hash to cache– Server gets O(log n) requests for its own
pages
• Each server can also be a cache– Gets small number of requests for others’
pages
• Robust to failures– Caches can come and go– Different browsers can know different caches
CS294, Yelick Load Balancing, p35
Refinement: BW Adaptation• Browser bandwidth to machines may vary• If bandwidth to server is high, unwilling to
use lower bandwidth cache• Consistently hash item only to caches
with bandwidth as good as server• Theorem: all previous properties still hold
– Uniform cache loads– Low server loads (few caches per item)
CS294, Yelick Load Balancing, p36
Refinement: Hot Pages• What if one page gets popular?
– Cache responsible for it gets swamped
• Use of tree of caches– Cache at root gets swamped
• Use a different tree for each page– Build using consistent hashing
• Balances load for hot pages and hot servers
CS294, Yelick Load Balancing, p37
Cache Tree Result• Using cache trees of log depth, for
any set of page accesses, can adaptively balance load such that every server gets at most log times the average load of the system (browser/server ratio)
• Modulo some theory caveats
CS294, Yelick Load Balancing, p38
Agenda• Overview• Load Balancing Data• Load Balancing Computation
– (if there is time)
CS294, Yelick Load Balancing, p39
Load Balancing Spectrum• Tasks costs
– Do all tasks have equal costs?– If not, when are the costs known?
• Task dependencies– Can all tasks be run in any order (including parallel)?– If not, when are the dependencies known?
• Locality– Is it important for some tasks to be scheduled on the same
processor (or nearby)?– When is the information about communication known?
• Heterogeneity– Are all the machines equally fast?– If not, when do we know their performance?
CS294, Yelick Load Balancing, p40
Task cost spectrum
CS294, Yelick Load Balancing, p41
Task Dependency Spectrum
CS294, Yelick Load Balancing, p42
Task Locality Spectrum
CS294, Yelick Load Balancing, p43
Machine Heterogeneity Spectrum
• Easy: All nodes (e.g., processors) are equally powerful
• Harder: Nodes differ, but resources are fixed– Different physical characteristic
• Hardest: Nodes change dynamically:– Other loads on the system (dynamic)– Data layout (inner vs. out track on
disks)
CS294, Yelick Load Balancing, p44
Spectrum of SolutionsWhen is load balancing information known:• Static scheduling. All information is
available to scheduling algorithm, which runs before any real computation starts. (offline algorithms)
• Semi-static scheduling. Information may be known at program startup, or the beginning of each timestep, or at other points. Offline algorithms may be used.
• Dynamic scheduling. Information is not known until mid-execution. (online offline algorithms)
CS294, Yelick Load Balancing, p45
Approaches• Static load balancing• Semi-static load balancing• Self-scheduling• Distributed task queues• Diffusion-based load balancing• DAG scheduling
Note: these are not all-inclusive, but represent some of the problems for which good solutions exist.
CS294, Yelick Load Balancing, p46
Static Load Balancing• Static load balancing is use when all
information is available in advance, e.g.,– dense matrix algorithms, such as LU
factorization • done using blocked/cyclic layout• blocked for locality, cyclic for load balance
– most computations on a regular mesh, e.g., FFT
• done using cyclic+transpose+blocked layout for 1D• similar for higher dimensions, i.e., with transpose
– explicit methods and iterative methods on an unstructured mesh
• use graph partitioning• assumes graph does not change over time (or at
least within a timestep during iterative solve)
CS294, Yelick Load Balancing, p47
Semi-Static Load Balance• If domain changes slowly over time and
locality is important• Often used in:
– particle simulations, particle-in-cell (PIC) methods
• poor locality may be more of a problem than load imbalance as particles move from one grid partition to another
– tree-structured computations (Barnes Hut, etc.)
– grid computations with dynamically changing grid, which changes slowly
CS294, Yelick Load Balancing, p48
Self-Scheduling• Self scheduling:
– Keep a pool of tasks that are available to run– When a processor completes its current task, look at the
pool– If the computation of one task generates more, add
them to the pool
• Originally used for:– Scheduling loops by compiler (really the runtime-
system)– Original paper by Tang and Yew, ICPP 1986
CS294, Yelick Load Balancing, p49
When to Use Self-SchedulingUseful when:• A batch (or set) of tasks without
dependencies– can also be used with dependencies, but
most analysis has only been done for task sets without dependencies
• The cost of each task is unknown• Locality is not important• Using a shared memory multiprocessor,
so a centralized solution is fine
CS294, Yelick Load Balancing, p50
Variations on Self-Scheduling• Typically, don’t want to grab smallest unit of
parallel work.• Instead, choose a chunk of tasks of size K.
– If K is large, access overhead for task queue is small
– If K is small, we are likely to have even finish times (load balance)
• Variations:– Use a fixed chunk size– Guided self-scheduling– Tapering– Weighted Factoring– Note: there are more
CS294, Yelick Load Balancing, p51
V1: Fixed Chunk Size• Kruskal and Weiss give a technique for
computing the optimal chunk size
• Requires a lot of information about the problem characteristics– e.g., task costs, number
• Results in an off-line algorithm. Not very useful in practice. – For use in a compiler, for example, the compiler
would have to estimate the cost of each task– All tasks must be known in advance
CS294, Yelick Load Balancing, p52
V2: Guided Self-Scheduling• Idea: use larger chunks at the beginning to
avoid excessive overhead and smaller chunks near the end to even out the finish times.
• The chunk size Ki at the ith access to the task pool is given by
ceiling(Ri/p)• where Ri is the total number of tasks
remaining and• p is the number of processors• See Polychronopolous, “Guided Self-Scheduling: A Practical
Scheduling Scheme for Parallel Supercomputers,” IEEE Transactions on Computers, Dec. 1987.
CS294, Yelick Load Balancing, p53
V3: Tapering• Idea: the chunk size, Ki is a function of
not only the remaining work, but also the task cost variance– variance is estimated using history
information– high variance => small chunk size should
be used– low variant => larger chunks OK
• See S. Lucco, “Adaptive Parallel Programs,” PhD Thesis, UCB, CSD-95-864, 1994.– Gives analysis (based on workload distribution)– Also gives experimental results -- tapering always
works at least as well as GSS, although difference is often small
CS294, Yelick Load Balancing, p54
V4: Weighted Factoring• Idea: similar to self-scheduling, but divide
task cost by computational power of requesting node
• Useful for heterogeneous systems• Also useful for shared resource NOWs, e.g.,
built using all the machines in a building– as with Tapering, historical information is used to
predict future speed– “speed” may depend on the other loads currently
on a given processor• See Hummel, Schmit, Uma, and Wein, SPAA ‘96
– includes experimental data and analysis
CS294, Yelick Load Balancing, p55
V5: Distributed Task Queues• The obvious extension of self-scheduling to
distributed memory is:– a distributed task queue (or bag)
• When are these a good idea?– Distributed memory multiprocessors– Or, shared memory with significant
synchronization overhead– Locality is not (very) important– Tasks that are:
• known in advance, e.g., a bag of independent ones• dependencies exist, i.e., being computed on the fly
– The costs of tasks is not known in advance
CS294, Yelick Load Balancing, p56
Theory of Distributed Queues Main result: A simple randomized
algorithm is optimal with high probability• Karp and Zhang [88] show this for a tree
of unit cost (equal size) tasks• Chakrabarti et al [94] show this for a tree
of variable cost tasks– using randomized pushing of tasks
• Blumofe and Leiserson [94] show this for a fixed task tree of variable cost tasks– uses task pulling (stealing), which is better for locality– Also have (loose) bounds on the total memory required
CS294, Yelick Load Balancing, p57
Engineering Distributed Queues
A lot of papers on engineering these systems on various machines, and their applications
• If nothing is known about task costs when created– organize local tasks as a stack (push/pop from top)– steal from the stack bottom (as if it were a queue)
• If something is known about tasks costs and communication costs, can be used as hints. (See Wen, UCB PhD, 1996.)
• Goldstein, Rogers, Grunwald, and others (independent work) have all shown – advantages of integrating into the language framework– very lightweight thread creation
CS294, Yelick Load Balancing, p58
Diffusion-Based Load Balancing
• In the randomized schemes, the machine is treated as fully-connected.
• Diffusion-based load balancing takes topology into account– Locality properties better than prior work– Load balancing somewhat slower than
randomized– Cost of tasks must be known at creation time– No dependencies between tasks
CS294, Yelick Load Balancing, p59
Diffusion-based load balancing• The machines is modeled as a graph
• At each step, we compute the weight of task remaining on each processor– This is simply the number if they are unit cost tasks
• Each processor compares its weight with its neighbors and performs some averaging
• See Ghosh et al, SPAA96 for a second order diffusive load balancing algorithm– takes into account amount of work sent last time– avoids some oscillation of first order schemes
• Note: locality is not directly addressed
CS294, Yelick Load Balancing, p60
DAG Scheduling• Some problems involve a DAG of tasks:
– nodes represent computation (may be weighted)– edges represent orderings and usually communication
(may also be weighted)
• Two application domains– Digital Signal Processing computations– Sparse direct solvers (mainly Cholesky, since it doesn’t
require pivoting).
• The basic strategy: partition DAG to minimize communication and keep all processors busy– NP complete– See Gerasoulis and Yang, IEEE Transaction on P&DS, Jun
‘93.
CS294, Yelick Load Balancing, p61
Heterogeneous Machines• Diffusion-based load balancing for
heterogeneous environment– Fizzano, Karger, Stein, Wein
• Graduate declustering– Remzi Arpaci-Dusseau et al
• And more…