Download - CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p1

CS 294-8Distributed Load

Balancing http://www.cs.berkeley.edu/~yelick/294


Load Balancing• Problem: distribute items into

buckets– Data to memory locations– Files to disks– Tasks to processors– Web pages to caches

• Goal: even distribution

• Slides stolen from Karger at MIT: http://theory.lcs.mit.edu/~karger


Load Balancing

Enormous and diverse literature on load balancing

• Computer Science systems – operating systems– parallel computing– distributed computing

• Computer Science theory• Operations research (IEOR)• Application domains


Agenda• Overview• Load Balancing Data• Load Balancing Computation

– (if there is time)


The Web

CMU

MIT

USC

CNN

Browers (clients)

Servers

UCB


Hot Spots

UCB

Browers (clients)

Servers

OceanStore

BANE

IRAM

Telegraph


Temporary Loads• For permanent loads, use bigger server• Must also deal with “flash crowds”

– IBM chess match– Florida election tally

• Inefficient to design for max load– Rarely attained– Much capacity wasted

• Better to offload peak load elsewhere


Proxy Caches Balance Load

CMU

MIT

USC

CNN

Browers (clients)

Servers

UCB

OceanStore

BANE IRAM

Telegraph


Proxy Caching• Old: server hit once for each

browser• New: serve his once for each page• Adapts to changing access

patterns


Proxy Caching• Every server can also be a cache• Incentives:

– Provides a social good– Reduces load at sites you want to

contact

• Costs you little, if done right– Few accesses– Small amount of storage (times many

servers)


Who Caches What?• Each cache should hold few items

– Otherwise swamped by clients

• Each item should be in few caches– Otherwise server swamped by caches– And cache invalidates/updates

expensive

• Browser must know right cache– Could ask for server to redirect– But server gets swamped by redirects


Hashing• Simple and powerful load balancing• Constant time to find bucket for item• Example: map to n buckets. Pick a,

b:y=ax+b (mod n)

• Intuition: hash maps each itme to one random bucket– No bucket gets many items


Problem: Adding Caches• Suppose a new cache arrives• How to work it into has function?• Natural change: y = ax + b (mod n+1)• Problem: changes bucket for every item

– Every cache will be flushed– Server swamped with new requests

• Goal: when add bucket, few items move


Problem: Inconsistent Views• Each client knows about a different

set of caches: its view• View affects choice of cache for item• With many views, each cache will be

asked for item• Item in all caches – swamps server

• Goal: item in few caches despite views


Problem: Inconsistent Views

0 1 2 3

UCB

ax + b (mod 4) = 2

caches

my view



0 31 2

UCB

ax + b (mod 4) = 2

caches

Joe’s view



20 31

UCB

ax + b (mod 4) = 2

caches

Sue’s view



1 20 3

UCB

ax + b (mod 4) = 2

caches

Mike’s view



22 22

UCB

caches


Consistent Hashing• A new kind of hash function• Maps any item to a bucket in my view• Computable in constant time, locally

– 1 standard hash function

• Adding bucket to view takes log time– Logarithmic # of standard hash functions

• Handle incremental and inconsistent views


Single View Properties• Balance: all buckets get roughly

same number of items

• Smooth: when a kth bucket is added, only a 1/k fraction of items move– And only from O(log n) servers– Minimum needed to preserve balance


Multiple View Properties• Consider n views, each of an

arbitrary constant fraction of the buckets

• Load: number of items a bucket gets from all views is O(log n) times average– Despite views, load balanced

• Spread: over all views, each item appears in O(log n) buckets– Despite views, few caches for each item


Implementation• Use standard hash function H to map

items and caches to unit circle• If H maps to [0..M], divide by M

x Y

• Map each item to the closest cache (going clockwise)• A holds 1,2,3• B holds 4, 5

A

B2

1

3

4

5


Implementation

C

A

B2

1

3

4

5

• To add a new cache– Hash the cache id– Move items that should be assigned to it

• Items do not move between A and B• A holds 3• B holds 4, 5• C holds 1, 2


Implementation• Cache “points” stored in a pre-

computed binary tree• Lookup for cached item requires:

– Hash of item key (e.g., URL)– BST lookup of successor

• Consistent hashing with n caches requires O(log n) time– An alternative that breaks the unit

circle into equal-length intervals can make this constant time


Balance• Cache points uniformly distributed

by H• Each cache “owns” equal portion of

the unit circle• Item position random by H• So each cache gets about the same

number of items


Smoothness• To add kth cache, hash it to circle• Captures items between it and

nearest cache– 1/k fraction of total items– Only from 1 other bucket– O(log n) to find is, as with lookup


Low Spread• Some views might not see nearest

cache to an item, hash it elsewhere

• But every view will have a bucket near the item (on the circle) by “random” placement

• So only buckets near the item will ever have to hold it

• Only a few buckets are near the item by “random” placement


Low Load• Cache only gets item I if no other

cache is closer to I• Under any view, some cache is

close to I by random place of caches

• So a cache only gets items close to it

• But an item is unlikely to be close• So cache doesn’t get many items


Fault Tolerance• Suppose contacted cache is down• Delete from cache set view (BST)

and find next closest cache in interval

• Just a small change in view• Even with many failures, uniform

load and other properties still hold


Experimental Setup• Cache Resolver System

– Cache machines for content– Users’ browsers that direct requests toward

virtual caches– Resolution units (DNS) that use consistent

hashing to map virtual caches to physical machines

• Surge web load generator from BU• Two modes:

– Common mode (fixed cache for set of clients)– Cache resolver mode using consistent hashing


Performance


Summary of Consistent Hashing

• Trivial to implement• Fast to compute• Uniformly distributes items• Can cheaply add/remove cache• Even with multiple views

– No cache gets too many items– Each item in only a few caches


Consistent Hashing for Caching

• Works well– Client maps known caches to unit circle– When item arrives, hash to cache– Server gets O(log n) requests for its own

pages

• Each server can also be a cache– Gets small number of requests for others’

pages

• Robust to failures– Caches can come and go– Different browsers can know different caches


Refinement: BW Adaptation• Browser bandwidth to machines may vary• If bandwidth to server is high, unwilling to

use lower bandwidth cache• Consistently hash item only to caches

with bandwidth as good as server• Theorem: all previous properties still hold

– Uniform cache loads– Low server loads (few caches per item)


Refinement: Hot Pages• What if one page gets popular?

– Cache responsible for it gets swamped

• Use of tree of caches– Cache at root gets swamped

• Use a different tree for each page– Build using consistent hashing

• Balances load for hot pages and hot servers


Cache Tree Result• Using cache trees of log depth, for

any set of page accesses, can adaptively balance load such that every server gets at most log times the average load of the system (browser/server ratio)

• Modulo some theory caveats


Agenda• Overview• Load Balancing Data• Load Balancing Computation

– (if there is time)


Load Balancing Spectrum• Tasks costs

– Do all tasks have equal costs?– If not, when are the costs known?

• Task dependencies– Can all tasks be run in any order (including parallel)?– If not, when are the dependencies known?

• Locality– Is it important for some tasks to be scheduled on the same

processor (or nearby)?– When is the information about communication known?

• Heterogeneity– Are all the machines equally fast?– If not, when do we know their performance?


Task cost spectrum


Task Dependency Spectrum


Task Locality Spectrum


Machine Heterogeneity Spectrum

• Easy: All nodes (e.g., processors) are equally powerful

• Harder: Nodes differ, but resources are fixed– Different physical characteristic

• Hardest: Nodes change dynamically:– Other loads on the system (dynamic)– Data layout (inner vs. out track on

disks)


Spectrum of SolutionsWhen is load balancing information known:• Static scheduling. All information is

available to scheduling algorithm, which runs before any real computation starts. (offline algorithms)

• Semi-static scheduling. Information may be known at program startup, or the beginning of each timestep, or at other points. Offline algorithms may be used.

• Dynamic scheduling. Information is not known until mid-execution. (online offline algorithms)


Approaches• Static load balancing• Semi-static load balancing• Self-scheduling• Distributed task queues• Diffusion-based load balancing• DAG scheduling

Note: these are not all-inclusive, but represent some of the problems for which good solutions exist.


Static Load Balancing• Static load balancing is use when all

information is available in advance, e.g.,– dense matrix algorithms, such as LU

factorization • done using blocked/cyclic layout• blocked for locality, cyclic for load balance

– most computations on a regular mesh, e.g., FFT

• done using cyclic+transpose+blocked layout for 1D• similar for higher dimensions, i.e., with transpose

– explicit methods and iterative methods on an unstructured mesh

• use graph partitioning• assumes graph does not change over time (or at

least within a timestep during iterative solve)


Semi-Static Load Balance• If domain changes slowly over time and

locality is important• Often used in:

– particle simulations, particle-in-cell (PIC) methods

• poor locality may be more of a problem than load imbalance as particles move from one grid partition to another

– tree-structured computations (Barnes Hut, etc.)

– grid computations with dynamically changing grid, which changes slowly


Self-Scheduling• Self scheduling:

– Keep a pool of tasks that are available to run– When a processor completes its current task, look at the

pool– If the computation of one task generates more, add

them to the pool

• Originally used for:– Scheduling loops by compiler (really the runtime-

system)– Original paper by Tang and Yew, ICPP 1986


When to Use Self-SchedulingUseful when:• A batch (or set) of tasks without

dependencies– can also be used with dependencies, but

most analysis has only been done for task sets without dependencies

• The cost of each task is unknown• Locality is not important• Using a shared memory multiprocessor,

so a centralized solution is fine


Variations on Self-Scheduling• Typically, don’t want to grab smallest unit of

parallel work.• Instead, choose a chunk of tasks of size K.

– If K is large, access overhead for task queue is small

– If K is small, we are likely to have even finish times (load balance)

• Variations:– Use a fixed chunk size– Guided self-scheduling– Tapering– Weighted Factoring– Note: there are more


V1: Fixed Chunk Size• Kruskal and Weiss give a technique for

computing the optimal chunk size

• Requires a lot of information about the problem characteristics– e.g., task costs, number

• Results in an off-line algorithm. Not very useful in practice. – For use in a compiler, for example, the compiler

would have to estimate the cost of each task– All tasks must be known in advance


V2: Guided Self-Scheduling• Idea: use larger chunks at the beginning to

avoid excessive overhead and smaller chunks near the end to even out the finish times.

• The chunk size Ki at the ith access to the task pool is given by

ceiling(Ri/p)• where Ri is the total number of tasks

remaining and• p is the number of processors• See Polychronopolous, “Guided Self-Scheduling: A Practical

Scheduling Scheme for Parallel Supercomputers,” IEEE Transactions on Computers, Dec. 1987.


V3: Tapering• Idea: the chunk size, Ki is a function of

not only the remaining work, but also the task cost variance– variance is estimated using history

information– high variance => small chunk size should

be used– low variant => larger chunks OK

• See S. Lucco, “Adaptive Parallel Programs,” PhD Thesis, UCB, CSD-95-864, 1994.– Gives analysis (based on workload distribution)– Also gives experimental results -- tapering always

works at least as well as GSS, although difference is often small


V4: Weighted Factoring• Idea: similar to self-scheduling, but divide

task cost by computational power of requesting node

• Useful for heterogeneous systems• Also useful for shared resource NOWs, e.g.,

built using all the machines in a building– as with Tapering, historical information is used to

predict future speed– “speed” may depend on the other loads currently

on a given processor• See Hummel, Schmit, Uma, and Wein, SPAA ‘96

– includes experimental data and analysis


V5: Distributed Task Queues• The obvious extension of self-scheduling to

distributed memory is:– a distributed task queue (or bag)

• When are these a good idea?– Distributed memory multiprocessors– Or, shared memory with significant

synchronization overhead– Locality is not (very) important– Tasks that are:

• known in advance, e.g., a bag of independent ones• dependencies exist, i.e., being computed on the fly

– The costs of tasks is not known in advance


Theory of Distributed Queues Main result: A simple randomized

algorithm is optimal with high probability• Karp and Zhang [88] show this for a tree

of unit cost (equal size) tasks• Chakrabarti et al [94] show this for a tree

of variable cost tasks– using randomized pushing of tasks

• Blumofe and Leiserson [94] show this for a fixed task tree of variable cost tasks– uses task pulling (stealing), which is better for locality– Also have (loose) bounds on the total memory required


Engineering Distributed Queues

A lot of papers on engineering these systems on various machines, and their applications

• If nothing is known about task costs when created– organize local tasks as a stack (push/pop from top)– steal from the stack bottom (as if it were a queue)

• If something is known about tasks costs and communication costs, can be used as hints. (See Wen, UCB PhD, 1996.)

• Goldstein, Rogers, Grunwald, and others (independent work) have all shown – advantages of integrating into the language framework– very lightweight thread creation


Diffusion-Based Load Balancing

• In the randomized schemes, the machine is treated as fully-connected.

• Diffusion-based load balancing takes topology into account– Locality properties better than prior work– Load balancing somewhat slower than

randomized– Cost of tasks must be known at creation time– No dependencies between tasks


Diffusion-based load balancing• The machines is modeled as a graph

• At each step, we compute the weight of task remaining on each processor– This is simply the number if they are unit cost tasks

• Each processor compares its weight with its neighbors and performs some averaging

• See Ghosh et al, SPAA96 for a second order diffusive load balancing algorithm– takes into account amount of work sent last time– avoids some oscillation of first order schemes

• Note: locality is not directly addressed


DAG Scheduling• Some problems involve a DAG of tasks:

– nodes represent computation (may be weighted)– edges represent orderings and usually communication

(may also be weighted)

• Two application domains– Digital Signal Processing computations– Sparse direct solvers (mainly Cholesky, since it doesn’t

require pivoting).

• The basic strategy: partition DAG to minimize communication and keep all processors busy– NP complete– See Gerasoulis and Yang, IEEE Transaction on P&DS, Jun

‘93.


Heterogeneous Machines• Diffusion-based load balancing for

heterogeneous environment– Fizzano, Karger, Stein, Wein

• Graduate declustering– Remzi Arpaci-Dusseau et al

• And more…