+ All Categories
Home > Documents > CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

Date post: 29-Dec-2015
Category:
Upload: prosper-harris
View: 222 times
Download: 2 times
Share this document with a friend
Popular Tags:
61
CS294, Yelick Load Balancing, p1 CS 294-8 Distributed Load Balancing http://www.cs.berkeley.edu/~yelick /294
Transcript
Page 1: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p1

CS 294-8Distributed Load

Balancing http://www.cs.berkeley.edu/~yelick/294

Page 2: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p2

Load Balancing• Problem: distribute items into

buckets– Data to memory locations– Files to disks– Tasks to processors– Web pages to caches

• Goal: even distribution

• Slides stolen from Karger at MIT: http://theory.lcs.mit.edu/~karger

Page 3: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p3

Load Balancing

Enormous and diverse literature on load balancing

• Computer Science systems – operating systems– parallel computing– distributed computing

• Computer Science theory• Operations research (IEOR)• Application domains

Page 4: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p4

Agenda• Overview• Load Balancing Data• Load Balancing Computation

– (if there is time)

Page 5: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p5

The Web

CMU

MIT

USC

CNN

Browers (clients)

Servers

UCB

Page 6: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p6

Hot Spots

UCB

Browers (clients)

Servers

OceanStore

BANE

IRAM

Telegraph

Page 7: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p7

Temporary Loads• For permanent loads, use bigger server• Must also deal with “flash crowds”

– IBM chess match– Florida election tally

• Inefficient to design for max load– Rarely attained– Much capacity wasted

• Better to offload peak load elsewhere

Page 8: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p8

Proxy Caches Balance Load

CMU

MIT

USC

CNN

Browers (clients)

Servers

UCB

OceanStore

BANE IRAM

Telegraph

Page 9: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p9

Proxy Caching• Old: server hit once for each

browser• New: serve his once for each page• Adapts to changing access

patterns

Page 10: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p10

Proxy Caching• Every server can also be a cache• Incentives:

– Provides a social good– Reduces load at sites you want to

contact

• Costs you little, if done right– Few accesses– Small amount of storage (times many

servers)

Page 11: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p11

Who Caches What?• Each cache should hold few items

– Otherwise swamped by clients

• Each item should be in few caches– Otherwise server swamped by caches– And cache invalidates/updates

expensive

• Browser must know right cache– Could ask for server to redirect– But server gets swamped by redirects

Page 12: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p12

Hashing• Simple and powerful load balancing• Constant time to find bucket for item• Example: map to n buckets. Pick a,

b:y=ax+b (mod n)

• Intuition: hash maps each itme to one random bucket– No bucket gets many items

Page 13: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p13

Problem: Adding Caches• Suppose a new cache arrives• How to work it into has function?• Natural change: y = ax + b (mod n+1)• Problem: changes bucket for every item

– Every cache will be flushed– Server swamped with new requests

• Goal: when add bucket, few items move

Page 14: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p14

Problem: Inconsistent Views• Each client knows about a different

set of caches: its view• View affects choice of cache for item• With many views, each cache will be

asked for item• Item in all caches – swamps server

• Goal: item in few caches despite views

Page 15: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p15

Problem: Inconsistent Views

0 1 2 3

UCB

ax + b (mod 4) = 2

caches

my view

Page 16: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p16

Problem: Inconsistent Views

0 31 2

UCB

ax + b (mod 4) = 2

caches

Joe’s view

Page 17: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p17

Problem: Inconsistent Views

20 31

UCB

ax + b (mod 4) = 2

caches

Sue’s view

Page 18: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p18

Problem: Inconsistent Views

1 20 3

UCB

ax + b (mod 4) = 2

caches

Mike’s view

Page 19: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p19

Problem: Inconsistent Views

22 22

UCB

caches

Page 20: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p20

Consistent Hashing• A new kind of hash function• Maps any item to a bucket in my view• Computable in constant time, locally

– 1 standard hash function

• Adding bucket to view takes log time– Logarithmic # of standard hash functions

• Handle incremental and inconsistent views

Page 21: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p21

Single View Properties• Balance: all buckets get roughly

same number of items

• Smooth: when a kth bucket is added, only a 1/k fraction of items move– And only from O(log n) servers– Minimum needed to preserve balance

Page 22: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p22

Multiple View Properties• Consider n views, each of an

arbitrary constant fraction of the buckets

• Load: number of items a bucket gets from all views is O(log n) times average– Despite views, load balanced

• Spread: over all views, each item appears in O(log n) buckets– Despite views, few caches for each item

Page 23: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p23

Implementation• Use standard hash function H to map

items and caches to unit circle• If H maps to [0..M], divide by M

x Y

• Map each item to the closest cache (going clockwise)• A holds 1,2,3• B holds 4, 5

A

B2

1

3

4

5

Page 24: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p24

Implementation

C

A

B2

1

3

4

5

• To add a new cache– Hash the cache id– Move items that should be assigned to it

• Items do not move between A and B• A holds 3• B holds 4, 5• C holds 1, 2

Page 25: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p25

Implementation• Cache “points” stored in a pre-

computed binary tree• Lookup for cached item requires:

– Hash of item key (e.g., URL)– BST lookup of successor

• Consistent hashing with n caches requires O(log n) time– An alternative that breaks the unit

circle into equal-length intervals can make this constant time

Page 26: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p26

Balance• Cache points uniformly distributed

by H• Each cache “owns” equal portion of

the unit circle• Item position random by H• So each cache gets about the same

number of items

Page 27: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p27

Smoothness• To add kth cache, hash it to circle• Captures items between it and

nearest cache– 1/k fraction of total items– Only from 1 other bucket– O(log n) to find is, as with lookup

Page 28: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p28

Low Spread• Some views might not see nearest

cache to an item, hash it elsewhere

• But every view will have a bucket near the item (on the circle) by “random” placement

• So only buckets near the item will ever have to hold it

• Only a few buckets are near the item by “random” placement

Page 29: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p29

Low Load• Cache only gets item I if no other

cache is closer to I• Under any view, some cache is

close to I by random place of caches

• So a cache only gets items close to it

• But an item is unlikely to be close• So cache doesn’t get many items

Page 30: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p30

Fault Tolerance• Suppose contacted cache is down• Delete from cache set view (BST)

and find next closest cache in interval

• Just a small change in view• Even with many failures, uniform

load and other properties still hold

Page 31: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p31

Experimental Setup• Cache Resolver System

– Cache machines for content– Users’ browsers that direct requests toward

virtual caches– Resolution units (DNS) that use consistent

hashing to map virtual caches to physical machines

• Surge web load generator from BU• Two modes:

– Common mode (fixed cache for set of clients)– Cache resolver mode using consistent hashing

Page 32: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p32

Performance

Page 33: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p33

Summary of Consistent Hashing

• Trivial to implement• Fast to compute• Uniformly distributes items• Can cheaply add/remove cache• Even with multiple views

– No cache gets too many items– Each item in only a few caches

Page 34: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p34

Consistent Hashing for Caching

• Works well– Client maps known caches to unit circle– When item arrives, hash to cache– Server gets O(log n) requests for its own

pages

• Each server can also be a cache– Gets small number of requests for others’

pages

• Robust to failures– Caches can come and go– Different browsers can know different caches

Page 35: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p35

Refinement: BW Adaptation• Browser bandwidth to machines may vary• If bandwidth to server is high, unwilling to

use lower bandwidth cache• Consistently hash item only to caches

with bandwidth as good as server• Theorem: all previous properties still hold

– Uniform cache loads– Low server loads (few caches per item)

Page 36: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p36

Refinement: Hot Pages• What if one page gets popular?

– Cache responsible for it gets swamped

• Use of tree of caches– Cache at root gets swamped

• Use a different tree for each page– Build using consistent hashing

• Balances load for hot pages and hot servers

Page 37: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p37

Cache Tree Result• Using cache trees of log depth, for

any set of page accesses, can adaptively balance load such that every server gets at most log times the average load of the system (browser/server ratio)

• Modulo some theory caveats

Page 38: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p38

Agenda• Overview• Load Balancing Data• Load Balancing Computation

– (if there is time)

Page 39: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p39

Load Balancing Spectrum• Tasks costs

– Do all tasks have equal costs?– If not, when are the costs known?

• Task dependencies– Can all tasks be run in any order (including parallel)?– If not, when are the dependencies known?

• Locality– Is it important for some tasks to be scheduled on the same

processor (or nearby)?– When is the information about communication known?

• Heterogeneity– Are all the machines equally fast?– If not, when do we know their performance?

Page 40: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p40

Task cost spectrum

Page 41: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p41

Task Dependency Spectrum

Page 42: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p42

Task Locality Spectrum

Page 43: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p43

Machine Heterogeneity Spectrum

• Easy: All nodes (e.g., processors) are equally powerful

• Harder: Nodes differ, but resources are fixed– Different physical characteristic

• Hardest: Nodes change dynamically:– Other loads on the system (dynamic)– Data layout (inner vs. out track on

disks)

Page 44: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p44

Spectrum of SolutionsWhen is load balancing information known:• Static scheduling. All information is

available to scheduling algorithm, which runs before any real computation starts. (offline algorithms)

• Semi-static scheduling. Information may be known at program startup, or the beginning of each timestep, or at other points. Offline algorithms may be used.

• Dynamic scheduling. Information is not known until mid-execution. (online offline algorithms)

Page 45: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p45

Approaches• Static load balancing• Semi-static load balancing• Self-scheduling• Distributed task queues• Diffusion-based load balancing• DAG scheduling

Note: these are not all-inclusive, but represent some of the problems for which good solutions exist.

Page 46: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p46

Static Load Balancing• Static load balancing is use when all

information is available in advance, e.g.,– dense matrix algorithms, such as LU

factorization • done using blocked/cyclic layout• blocked for locality, cyclic for load balance

– most computations on a regular mesh, e.g., FFT

• done using cyclic+transpose+blocked layout for 1D• similar for higher dimensions, i.e., with transpose

– explicit methods and iterative methods on an unstructured mesh

• use graph partitioning• assumes graph does not change over time (or at

least within a timestep during iterative solve)

Page 47: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p47

Semi-Static Load Balance• If domain changes slowly over time and

locality is important• Often used in:

– particle simulations, particle-in-cell (PIC) methods

• poor locality may be more of a problem than load imbalance as particles move from one grid partition to another

– tree-structured computations (Barnes Hut, etc.)

– grid computations with dynamically changing grid, which changes slowly

Page 48: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p48

Self-Scheduling• Self scheduling:

– Keep a pool of tasks that are available to run– When a processor completes its current task, look at the

pool– If the computation of one task generates more, add

them to the pool

• Originally used for:– Scheduling loops by compiler (really the runtime-

system)– Original paper by Tang and Yew, ICPP 1986

Page 49: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p49

When to Use Self-SchedulingUseful when:• A batch (or set) of tasks without

dependencies– can also be used with dependencies, but

most analysis has only been done for task sets without dependencies

• The cost of each task is unknown• Locality is not important• Using a shared memory multiprocessor,

so a centralized solution is fine

Page 50: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p50

Variations on Self-Scheduling• Typically, don’t want to grab smallest unit of

parallel work.• Instead, choose a chunk of tasks of size K.

– If K is large, access overhead for task queue is small

– If K is small, we are likely to have even finish times (load balance)

• Variations:– Use a fixed chunk size– Guided self-scheduling– Tapering– Weighted Factoring– Note: there are more

Page 51: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p51

V1: Fixed Chunk Size• Kruskal and Weiss give a technique for

computing the optimal chunk size

• Requires a lot of information about the problem characteristics– e.g., task costs, number

• Results in an off-line algorithm. Not very useful in practice. – For use in a compiler, for example, the compiler

would have to estimate the cost of each task– All tasks must be known in advance

Page 52: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p52

V2: Guided Self-Scheduling• Idea: use larger chunks at the beginning to

avoid excessive overhead and smaller chunks near the end to even out the finish times.

• The chunk size Ki at the ith access to the task pool is given by

ceiling(Ri/p)• where Ri is the total number of tasks

remaining and• p is the number of processors• See Polychronopolous, “Guided Self-Scheduling: A Practical

Scheduling Scheme for Parallel Supercomputers,” IEEE Transactions on Computers, Dec. 1987.

Page 53: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p53

V3: Tapering• Idea: the chunk size, Ki is a function of

not only the remaining work, but also the task cost variance– variance is estimated using history

information– high variance => small chunk size should

be used– low variant => larger chunks OK

• See S. Lucco, “Adaptive Parallel Programs,” PhD Thesis, UCB, CSD-95-864, 1994.– Gives analysis (based on workload distribution)– Also gives experimental results -- tapering always

works at least as well as GSS, although difference is often small

Page 54: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p54

V4: Weighted Factoring• Idea: similar to self-scheduling, but divide

task cost by computational power of requesting node

• Useful for heterogeneous systems• Also useful for shared resource NOWs, e.g.,

built using all the machines in a building– as with Tapering, historical information is used to

predict future speed– “speed” may depend on the other loads currently

on a given processor• See Hummel, Schmit, Uma, and Wein, SPAA ‘96

– includes experimental data and analysis

Page 55: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p55

V5: Distributed Task Queues• The obvious extension of self-scheduling to

distributed memory is:– a distributed task queue (or bag)

• When are these a good idea?– Distributed memory multiprocessors– Or, shared memory with significant

synchronization overhead– Locality is not (very) important– Tasks that are:

• known in advance, e.g., a bag of independent ones• dependencies exist, i.e., being computed on the fly

– The costs of tasks is not known in advance

Page 56: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p56

Theory of Distributed Queues Main result: A simple randomized

algorithm is optimal with high probability• Karp and Zhang [88] show this for a tree

of unit cost (equal size) tasks• Chakrabarti et al [94] show this for a tree

of variable cost tasks– using randomized pushing of tasks

• Blumofe and Leiserson [94] show this for a fixed task tree of variable cost tasks– uses task pulling (stealing), which is better for locality– Also have (loose) bounds on the total memory required

Page 57: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p57

Engineering Distributed Queues

A lot of papers on engineering these systems on various machines, and their applications

• If nothing is known about task costs when created– organize local tasks as a stack (push/pop from top)– steal from the stack bottom (as if it were a queue)

• If something is known about tasks costs and communication costs, can be used as hints. (See Wen, UCB PhD, 1996.)

• Goldstein, Rogers, Grunwald, and others (independent work) have all shown – advantages of integrating into the language framework– very lightweight thread creation

Page 58: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p58

Diffusion-Based Load Balancing

• In the randomized schemes, the machine is treated as fully-connected.

• Diffusion-based load balancing takes topology into account– Locality properties better than prior work– Load balancing somewhat slower than

randomized– Cost of tasks must be known at creation time– No dependencies between tasks

Page 59: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p59

Diffusion-based load balancing• The machines is modeled as a graph

• At each step, we compute the weight of task remaining on each processor– This is simply the number if they are unit cost tasks

• Each processor compares its weight with its neighbors and performs some averaging

• See Ghosh et al, SPAA96 for a second order diffusive load balancing algorithm– takes into account amount of work sent last time– avoids some oscillation of first order schemes

• Note: locality is not directly addressed

Page 60: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p60

DAG Scheduling• Some problems involve a DAG of tasks:

– nodes represent computation (may be weighted)– edges represent orderings and usually communication

(may also be weighted)

• Two application domains– Digital Signal Processing computations– Sparse direct solvers (mainly Cholesky, since it doesn’t

require pivoting).

• The basic strategy: partition DAG to minimize communication and keep all processors busy– NP complete– See Gerasoulis and Yang, IEEE Transaction on P&DS, Jun

‘93.

Page 61: CS294, YelickLoad Balancing, p1 CS 294-8 Distributed Load Balancing yelick/294.

CS294, Yelick Load Balancing, p61

Heterogeneous Machines• Diffusion-based load balancing for

heterogeneous environment– Fizzano, Karger, Stein, Wein

• Graduate declustering– Remzi Arpaci-Dusseau et al

• And more…


Recommended