+ All Categories
Home > Documents > Work-Stealing - TAU

Work-Stealing - TAU

Date post: 04-Apr-2022
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
21
Chapter 10 Work-Stealing 10.1 Introduction In this chapter we look at the problem of scheduling multithreaded computa- tions. This problem is interesting in its own right, and we will see that solutions to the problem require the design and implementation of novel lock-free concur- rent objects. A delightful aspect of this topic is that it combines well-established practical results with a non-trivial mathematical foundation. Figure 10.1 shows one way to decompose the well-known Fibonacci function into a multithreaded program. This implementation is an extremely inefficient way to compute Fibonacci numbers, but we use it here to illustrate the basic principles of multithreaded programming. We use the standard Java thread package. 1 To compute the n-th Fibonacci number, we create a new Fib object: Fib f = new Fib(10); A Fib object extends java.lang.Thread, so we can start it in parallel with the main computation: f.start(); Later, when we want the result, we call its join method. This method pauses the caller until the Fib object has completed its computation, and it is safe to pick up the result. f.join(); System.out.println("10-th Fibonacci number is: " + f.result); The Fib object’s run method creates two child Fib objects and starts them in parallel. The parent cannot use the results computed by its children until those children join the parent. The parent then sums the children’s results, and halts. 0 This chapter is part of the Manuscript Multiprocessor Synchronization by Maurice Herlihy and Nir Shavit. The current text is for your personal use and not for distribution outside our classroom. 1 For clarity, some exception handling is omitted from figures. 1
Transcript

Chapter 10

Work-Stealing

10.1 Introduction

In this chapter we look at the problem of scheduling multithreaded computa-tions. This problem is interesting in its own right, and we will see that solutionsto the problem require the design and implementation of novel lock-free concur-rent objects. A delightful aspect of this topic is that it combines well-establishedpractical results with a non-trivial mathematical foundation.

Figure 10.1 shows one way to decompose the well-known Fibonacci functioninto a multithreaded program. This implementation is an extremely inefficientway to compute Fibonacci numbers, but we use it here to illustrate the basicprinciples of multithreaded programming. We use the standard Java threadpackage.1 To compute the n-th Fibonacci number, we create a new Fib object:

Fib f = new Fib(10);

A Fib object extends java.lang.Thread, so we can start it in parallel with themain computation:

f.start();

Later, when we want the result, we call its join method. This method pausesthe caller until the Fib object has completed its computation, and it is safe topick up the result.

f.join();System.out.println("10-th Fibonacci number is: " + f.result);

The Fib object’s run method creates two child Fib objects and starts them inparallel. The parent cannot use the results computed by its children until thosechildren join the parent. The parent then sums the children’s results, and halts.

0This chapter is part of the Manuscript Multiprocessor Synchronization by MauriceHerlihy and Nir Shavit. The current text is for your personal use and not for distributionoutside our classroom.

1For clarity, some exception handling is omitted from figures.

1

2 CHAPTER 10. WORK-STEALING

public class Fib extends Thread {public int arg;public int result;

public Fib(int n) {arg = n;result = -1;

}

public void run() {if (arg < 2) {

result = arg;} else {

Fib left = new Fib(arg-1);Fib right = new Fib(arg-2);left.start(); right.start();left.join(); right.join();result = left.result + right.result;

}}

}

Figure 10.1: Multithreaded Fibonacci

10.2. MODEL 3

Figure 10.2: Multithreaded Fibonacci DAG

Notice that starting and joining threads in Java does not guarantee that anycomputations actually happen in parallel. Instead, you should think of thesemethods as advisory: they tell an underlying scheduler program that it mayexecute these programs in parallel. This chapter is concerned with the designand implementation of effective schedulers.

10.2 Model

It is convenient to model a multithreaded computation as a directed acyclicgraph (DAG), where each node represents an atomic step, and edges representdependencies. For example, a single thread is just a linear chain of nodes. Astep corresponding to a start method call has two outgoing edges: one to itssuccessor in the same thread, and one to the first step of the newly-startedthread. There is an edge from a child thread’s last step to the parent thread’sstep in which it calls the child’s join method. Figure 10.2 shows the DAGcorresponding to a short Fibonacci execution. All the DAGS considered herehave out-degree at most two.

Clearly, some computations are more parallel than others. We now considersome ways to make such notions precise. Let TP be the minimum number ofsteps needed to execute a multithreaded program on a system of P dedicatedprocessors. Note that TP is an idealized measure: it may not always be possiblefor every processor to “find” steps to execute, and actual computation timemay partly determined by other concerns, such as memory usage. Nevertheless,TP is clearly a lower bound on how much parallelism one can extract from a

4 CHAPTER 10. WORK-STEALING

multithreaded computation.Some values of P are important enough that they have special names. T1,

the number of steps needed to execute the program on a single processor, iscalled the computation’s work. Work is also the total number of steps in theentire computation. In one step, P processors can execute at most P steps, so

TP ≥ T1/P.

The other extreme is also of special importance: T∞, the number of steps toexecute the program on an unlimited number of processors, is called the critical-path length. Because finite resources cannot do better than infinite resources,

TP ≥ T∞.

The speedup on P processors is the following ratio:

T1/TP

We say a computation has linear speedup if T1/TP = Θ(P ). Finally, the paral-lelism of a computation is the maximum possible speedup: T1/T∞.

To illustrate these concepts, we now examine a simple multithreaded matrixmultiplication program. Matrix multiplication can be decomposed as follows:

(A11 A12

A21 A22

)=

(B11 B12

B21 B22

)·(

C11 C12

C21 C22

)

=(

B11C11 + B12C21 B11C12 + B12C22

B21C11 + B22C21 B21C12 + B22C22

).

To turn this observation into code, assume we have a Matrix class, with putand get methods to access elements. This class also proivdes a method thatsplits an n-by-n matrix into four (n/2)-by-(n/2) submatrices:

public Matrix[][] split() { ... }

In Java terminology, the four submatrices are “backed” by the original matrix,meaning that changes to the submatrices are reflected in the matrix, and vice-versa. This method can be implemented to take constant time by appropriaterepositioning of indexes (left as an exercise for the reader). The code for multi-threaded matrix addition appears in Figure 10.3, and multiplication in Figure10.4.

Let AP (n) be the number of steps needed to execute Add on P processors.The work A1(n) is defined by the recurrence:

A1(n) = 4A1(n/2) + Θ(1)

= Θ(n2).

This work is the same as the conventional doubly-nested loop implementation.The critical-path length is also easy to compute:

A∞(n) = A∞n/2 + Θ(1)= Θ(log n)

10.2. MODEL 5

public class Add extends Thread {public Matrix sum, arg;

public Add(Matrix sum, Matrix arg) {this.sum = sum;this.arg = arg;

}

public void run() {if (sum.getDimension() == 1) {sum.put(0, 0, sum.get(0,0) + arg.get(0,0));

} else {Matrix[][] s = this.sum.split();Matrix[][] a = this.arg.split();// create childrenAdd[] child = {

new Add(s[0][0], a[0][0]),new Add(s[0][1], a[0][1]),new Add(s[1][0], a[1][0]),new Add(s[1][1], a[1][1])

};// start childrenfor (int i = 0; i < child.length; i++)

child[i].start();// join childrenfor (int i = 0; i < child.length; i++) {

try {child[i].join();

} catch (InterruptedException e) {}}

}}

}

Figure 10.3: Matrix addition

6 CHAPTER 10. WORK-STEALING

public class Mult extends Thread {public Matrix prod, arg0, arg1;

public Mult(Matrix prod, Matrix arg0, Matrix arg1) {this.prod = prod;this.arg0 = arg0;this.arg1 = arg1;

}

public void run() {int n = prod.getDimension();if (n == 1) {

prod.put(0, 0, arg0.get(0,0) + arg1.get(0,0));} else {

Matrix tmp = new Matrix(n,n);Matrix[][] r = this.prod.split();Matrix[][] a = this.arg0.split();Matrix[][] b = this.arg1.split();Matrix[][] t = tmp.split();// create childrenMult[] child = {new Mult(r[0][0], a[0][0], b[0][0]),new Mult(r[0][1], a[0][0], b[0][1]),new Mult(r[1][0], a[1][0], b[0][0]),new Mult(r[1][1], a[1][0], b[0][1]),new Mult(t[0][0], a[0][1], b[1][0]),new Mult(t[0][1], a[0][1], b[1][1]),new Mult(t[1][0], a[1][1], b[1][0]),new Mult(t[1][1], a[1][1], b[1][1])

};// start childrenfor (int i = 0; i < child.length; i++)child[i].start();

// join childrenfor (int i = 0; i < child.length; i++) {try {

child[i].join();} catch (InterruptedException e) {}

}}

}}

Figure 10.4: Matrix multiplication

10.3. REALISTIC MULTIPROCESSOR SCHEDULING 7

This claim follows because each of the half-size additions is performed in parallelwith the others.

Let MP (n) be the number of steps needed to execute Mult on P processors.The work A1(n) is defined by the recurrence:

M1(n) = 8M1(n/2) + A1(n)

M1(n) = 8M1(n/2) + Θ(n2)

= Θ(n3).

This work is also the same as the conventional triply-nested loop implementa-tion. The critical-path length is:

M∞(n) = A∞(n/2) + A∞(n)= A∞(n/2) + Θ(log n)

= Θ(log2n)

This claim follows because the half-size multiplications are performed in parallel,followed by a single addition.

The parallelism for the Mult program is

M1(n)/M∞(n) = Θ(n3/ log2 n),

which is pretty high. For example, suppose we want to multiply two 1000-by-1000 matrices. Here, n3 = 109, and log n = log 1000 ≈ 10 (logs are base two), sothe parallelism is approximately 109/102 = 107. Roughly speaking, this instanceof matrix multiplication could, in principle, occupy roughly 107 processors, wellbeyond the powers of any existing multiprocessor.

You should understand that the parallelism computation given above is ahighly idealized upper bound on the performance of any multithreaded matrixmultiplication program. For example, when there are idle threads, it may notbe easy to assign those threads to idle processors. Moreover, a program thatdisplays less parallelism but consumes less memory may perform better becauseit encounters fewer page faults. The actual performance of a multithreadedcomputation remains a complex engineering problem, but the kind of analysispresented in this section is an indispensable first step in understanding thedegree to which a problem can be solved in parallel.

10.3 Realistic Multiprocessor Scheduling

Our analysis so far has been based on the assumption that each multithreadedprogram has P dedicated processors. This assumption, unfortunately, does notcorrespond to the way shared-memory multiprocessors are used in real life. Mul-tiprocessors typically run a mix of jobs, where jobs come and go dynamically.One might start, say, a matrix multiplication job on P processors. At somepoint, the operating system decides to download mail, pre-emptying one pro-cessor, and our job is now running on P − 1 processors. The mail program

8 CHAPTER 10. WORK-STEALING

pauses waiting for a disk read or write to complete, and in the interim thematrix program has P processors again.

Most operating systems provide user-level processes, where a process con-sists of a program counter (like a thread) and usually an address space. Theoperating system kernel includes a scheduler that runs user-level processes onphysical processors. The mapping between processes and processors, and whenthe processes are scheduled, is typically not under the control of the application.

One approach is to set up a one-to-one correspondence between application-level threads and processes: creating a new thread creates a new process, andending a thread ends that process. This approach, however, is impracticalbecause process creation is expensive. Instead, it makes more sense to create afixed collection of relatively long-lived processes to execute the varying collectionof short-lived threads.

We end up with a three-level model. At the top level, we write multithreadedprograms (such as matrix multiplication) that decompose an application intoa dynamically-varying number logical threads. At the middle level, we write auser-level scheduler that maps threads to a fixed number of P processes. At thebottom level, the kernel maps our P user-level processes onto a dynamically-varying number of processors. This last level is not under our control: applica-tions cannot tell the kernel how to schedule itself, and most modern operatingsystems kernels are anyway hidden from users. Our challenge here is to definea user-level scheduler that makes the best use of an unknown kernel schedulingpolicy.

Let us assume for now that the kernel works in discrete steps. (This discrete-step assumption is not required for correctness, but it makes the analysis easier.)At each step i, the kernel chooses an arbitrary subset of 0 ≤ pi ≤ P user-levelprocesses to run for one step. The processor average PA over T steps is definedto be

PA =1T

T−1∑

i=0

pi. (10.1)

Instead of designing a user-level schedule to achieve a P -fold speedup, we cantry to achieve a PA-fold speedup.

A schedule is greedy if the number of program steps executed at each timestep is the minimum of pi, the number of available processors, and the numberof ready nodes in the program DAG.

Theorem 10.3.1 Consider a multithreaded program with work T1, critical-pathlength T∞, and P user-level processes. Any greedy execution has length at most

T1

PA+

T∞(P − 1)PA

.

Proof: Equation 10.1 implies that:

T =1

PA

T−1∑

i=0

pi.

10.4. WORK STEALING 9

We will bound T by bounding the sum of the pi. At each kernel-level step,imagine placing tokens in one of two buckets. For each user-level process thatexecutes a node at step i, we place a token in a work bucket, and for each processthat remains idle at that step, we place a token in a idle bucket. After the laststep, the work bucket contains T1 tokens, one for each node of the computationDAG. How many tokens does the idle bucket contain?

An idle step is one in which some process places a token in the idle bucket.Because the schedule is greedy, there is at least one process with a ready node ateach idle step, so of the pi processes scheduled at step i, at most pi− 1 ≤ P − 1can be idle. Let Gi be sub-DAG of the computation consisting of the nodesthat have not be executed at the end of step i.

Every node with in-degree 0 in Gi−1 was ready at the start of step i. Weclaim that there must be fewer than pi such nodes, because otherwise the greedyschedule would execute pi of them, and step i would not be idle. It follows thatthe longest directed path in Gi is one shorter than the longest directed path inGi−1. The longest directed path before step 0 is T∞, so the greedy schedule canhave at most T∞ idle steps. Combining these observations shows that the idlebucket contains at most T∞(P − 1) tokens.

The total number of tokens in both buckets is therefore

T−1∑

i=0

pi ≤ T1 + T∞(P − 1),

yielding the desired bound.It turns out that this bound is within a factor of two of optimal. Actually

achieving an optimal schedule is NP-complete, so greedy schedules are a simpleand practical way to get performance that is reasonably close to optimal.

10.4 Work Stealing

We now understand that if we keep the user-level processes supplied with work,then the resulting schedule is greedy, and our multithreded application willachieve a pretty good speedup. Multithreaded computations, however, createand destroy threads dynamically, sometimes in unpredictable ways. We still donot know how to connect ready threads with idle processes.

There are two basic approaches to keeping processes busy. In work sharing,processes distribute surplus work to other processes, with the goal of ensuringthat all processes are assigned approximately the same amount of work. Inwork stealing, a process that runs out of work will try to “steal” work fromother processes. For now, we focus on work stealing, which has the attractivefeature that no inter-process synchronization is need if all processes have plentyof work.

Each process keeps a pool of ready threads in the form of a double-endedqueue (DEQueue), providing pushBottom, popBottom, and popTop methods(we do not need a pushTop method). The process that owns a DEQueue is

10 CHAPTER 10. WORK-STEALING

public class DEQueue {longRMWregister top; // tag & top thread indexint bottom; // bottom thread indexThread[] deq; // array of threads

public class Abort extends java.lang.Exception {};

// extract tag field from topprivate int TAG_MASK = 0xFFFF0000;private int TAG_SHIFT = 16;private int getTag(long i) {

return (int)(i & TAG_MASK) >> TAG_SHIFT;}

// extract index field from topprivate int INDEX_MASK = 0x0000FFFF;private int INDEX_SHIFT = 0;private int getIndex(long i) {

return (int)(i & INDEX_MASK) >> INDEX_SHIFT;}

// combine tag and index to form new topprivate long makeTop(int tag, int index) {

return (tag << TAG_SHIFT) & (index << INDEX_SHIFT);}

Figure 10.5: Manipulating the top field

10.4. WORK STEALING 11

public class DEQueue {longRMWregister top; // tag & top thread indexint bottom; // bottom thread indexThread[] deq; // array of threads

...

/*** called by local thread to set aside work**/

public void pushBottom(Thread t){this.deq[this.bottom] = t; // store objectthis.bottom++; // advance bottom

}

...}

Figure 10.6: The pushBottom method

called the local process for that object. When the local process creates a newthread, it calls pushBottom to push that thread onto the DEQueue. When thelocal process needs more work, it calls popBottom to remove a thread from theDEQueue. If the local process discovers its DEQueue is empty, then it becomesa thief : it chooses a victim process at random, and calls that object’s popTopmethod attempting to pop a thread from the top of that DEQueue.

Ideally, we would like an efficient, wait-free, linearizable DEQueue imple-mentation. In practice, we have to settle for slightly weaker conditions. Ourimplementation of popTop may throw an exception if a concurrent popTop callsucceeds, or if a concurrent popBottom takes the last thread in the DEQueue.

The DEQueue class has three fields: top, bottom, and deq. The top field is along integer encompassing two subfield. In the top field, the high-order 16 bitsconstitute the tag value, while the low-order 16 bits constitute the index value.Figure 10.5 shows how to extract the tag and index values. The tag field isneeded to avoid the “ABA” problem examined in the previous section.

The pushBottom method (Figure 10.11) simply stores the new thread at thebottom queue location and increments the bottom field. The popBottom method(Figure 10.7 is more complex. It tests whether the DEQueue contains more thanone thread. If so, then it returns the bottom thread without performing a CAS.Otherwise, there is a danger that the bottom thread may be stolen. The methodtries to set both top and bottom fields to zero. If it succeeds, it returns the lastremaining thread, and otherwise, that thread has been stolen, and the methodreturns null. The important aspect of this protocol is that an expensive CASoperation is needed only when the DEQueue is almost empty.

12 CHAPTER 10. WORK-STEALING

public class DEQueue {longRMWregister top; // tag & top thread indexint bottom; // bottom thread indexThread[] deq; // array of threads

...

/*** Called by local thread to get more work**/

Thread popBottom() {// is the queue empty?if (this.bottom == 0) // empty

return null;this.bottom--;Thread t = this.deq[this.bottom];long oldTop = this.top.read();if (this.bottom > getIndex(oldTop))

return t;long newTop = makeTop(getTag(oldTop),0);this.bottom = 0;if (this.bottom == getIndex(oldTop))

if (this.top.CAS(oldTop, newTop))return t;

this.top.write(newTop);return null;

}}

Figure 10.7: The popBottom method

10.4. WORK STEALING 13

public class DEQueue {long top; // tag & top thread indexint bottom; // bottom thread indexObject[] deq; // array of threads.../*** Called by thieves to try to steal a thread**/

Thread popTop() throws Abort {long oldTop = this.top.read();int bottom = this.bottom;if (bottom < getIndex(oldTop)) // emptyreturn null;

Thread t = this.deq[getIndex(oldTop)];long newTop = makeTop(getTag(oldTop), getIndex(oldTop)+1);if (this.top.CAS(oldTop, newTop))return t;

throw new Abort();}...

}

Figure 10.8: The popTop method

14 CHAPTER 10. WORK-STEALING

The popTop method (Figure 10.8) checks whether the DEQueue is empty,and if not, tries to steal the top element by applying CAS to the top field. Ifthe CAS succeeds, the theft is successful, and otherwise the method throws anexception.

10.5 The Steal-Half Protocol

In the work-stealing protocol described in the previous section, each processmaintains a local work queue, and steals an item from others if its queue becomesempty. At its core is a lock-free protocol for stealing an individual item from abounded-size queue, minimizing the need for costly CAS operations when fetchingitems locally.

We have seen that stealing one item is ensures that the time needed toexecute a multithreaded computation to within a constant factor of optimal.Nevertheless, there is reason to believe that the scheme can be improved byallowing the thief process to steal multiple items from the victim. The moststraightforward way to steam multiple threads is simply for the thief to callpopTop multiple times. This solution is unsatisfactory because each such callrequires an expensive CAS operation.

This section shows how to generalize the previous section’s algorithm, toallow processes to steal up to half of the threads in a given queue at a time.This revised protocol preserves the key properties of the original: it is lock-freeand it minimizes the number of CAS operations that the local process needs toperform.

As before, each process has a local work queue, called an extended double-ended queue (EDEQueue), where the local process calls pushBottom and popBottommethods, while thieves call the stealTop method. The stealTop method canreturn multiple threads, not just one.

The original DEQueue implementation had a nice property: as long as there ismore than one item in the DEQueue, the local process can pop from the bottomof the DEQueue without an expensive CAS operation. If there is a single item inthe DEQueue, then the process needs to use CAS to synchronize with potentialthieves. You should think of this CAS as a consensus protocol in which theprocesses decide what is to become of the contested item. For any “reasonable”sequence of k pushBottom or k popBottom calls, this protocol requires a constantnumber of CAS operations.

If a thief process could remove up to half of the items, it may be necessaryto reach consensus on the status of each item in the overlap. For any sequenceof k pushBottom or k popBottom calls, the protocol that removes items one ata time would require Θ(k) CAS operations, an unacceptable overhead.

The extended deque algorithm we will present manages to steal up to halfthe items and pay only Θ(log k) CAS operations for any “reasonable” sequenceof k pushBottom or k popBottom calls.

The extended DEQueue implementation presented here differs from the orig-inal in two ways: (1) it is implemented as a cyclic array, and (2) the top field

10.5. THE STEAL-HALF PROTOCOL 15

public class EDEQueue {public longRMWregister stealRange; // where to stealint bottom; // bottom thread indexObject[] deq; // array of threads

private static final int QUEUE_SIZE = 32;

// extract tag field from topprivate int TAG_MASK = 0xFFFF0000;private int TAG_SHIFT = 16;private int getTag(long i) {

return (int)(i & TAG_MASK) >> TAG_SHIFT;}

// extract index field from topprivate int TOP_MASK = 0x0000FF00;private int TOP_SHIFT = 8;private int getTop(long i) {

return (int)(i & TOP_MASK) >> TOP_SHIFT;}

// extract index field from topprivate int STEAL_MASK = 0x000000FF;private int STEAL_SHIFT = 0;private int getSteal(long i) {

return (int)(i & STEAL_MASK) >> STEAL_SHIFT;}

// combine tag and index to form new topprivate long makeStealRange(int tag, int top, int steal) {

return (tag << TAG_SHIFT) & (top << TOP_SHIFT) & (steal << STEAL_SHIFT);}

private int log2(int x) {int result = 0;while (x != 0) {x = x << 1;result++;

}return result;

}

public int getSize() {return this.bottom - getTop(this.stealRange.read());

}

/*** Adjust steal range if needed* @returns whether unsuccessful CAS occurred**/

private boolean updateStealRange(long oldStealRange) {int size = this.getSize();int logSize = log2(size); // floor of actual loglong currentStealRange = this.stealRange.read();// is size a power of two or did someone steal something?if (size == (1 << logSize) || oldStealRange != currentStealRange) {// Try to update the stealRange to contain max(1,2^(logSize-1)) threadsint newSize = Math.max(1, (1 << (logSize-1)));int top = getTop(currentStealRange);int tag = getTag(currentStealRange);return this.stealRange.CAS(currentStealRange, makeStealRange(tag+1, top, top+newSize));

}return true;

}...

}

Figure 10.9: Methods for manipulating stealRange

16 CHAPTER 10. WORK-STEALING

Figure 10.10: The extended DEQueue

is replace by a field called stealRange. which defines the range of items thatcan be stolen atomically by a thief process. The stealRange field has threesubfields: tag is used to avoid the ABA problem, top is the index of the threadat the top of the queue, and steal is the index of the last thread to be stolen.

A local process updates the stealRange field only when

• The number of items in the EDEQueue becomes a power of two, or

• A successful steal has occurred since the last time that thread observedthe EDEQueue.

Figure 10.9 illustrates methods for manipulating the object’s stealRangeThe EDEQueue object has the following additional fields: deq is an array of

size QUEUE SIZE that stores threads. The range of occupied entries in the deqarray is the half-open range [top · · · bottom) modulo DEQ SIZE. For simplicity,the queue’s top and bottom counters are incremented without bound, but areused modulo DEQ SIZE when indexing into the array.

The bottom field points to the entry following the last entry containing athread. If bottom and top are equal, the EDEQueue is empty. Each process

10.5. THE STEAL-HALF PROTOCOL 17

/*** called by local thread to set aside work**/

public void pushBottom(Thread t, long oldStealRange) throws Full {if (this.getSize() == QUEUE_SIZE)throw new Full();

this.deq[this.bottom % QUEUE_SIZE] = t; // store objectthis.bottom++; // advance bottomupdateStealRange(oldStealRange);

}

Figure 10.11: Code for pushBottom

keeps track of its own prevStealRange value, of the same type as stealRange,which the process uses to determine whether a steal has occurred since the lastmethod call.

When and how to steal work is a policy decision best made by the individualapplication. Reasonable policies include the following:

• Try to steal only when the local DEQueue is empty (steal-on-empty).

• Try to steal probabilistically, with the probability decreasing as the num-ber of items in the DEQueue increases: this (probabilistic balancing)

• Try to steal whenever the number of items in the DEQueue increases/decreasesby a constant factor from the last time a steal attempt was performed.

The local process calls the pushBottom method (illustrated in Figure 10.11)whenever it needs to insert a new thread into its local EDEQueue. If the EDEQueueis not full, the new thread is placed in the entry indexed by bottom, bottom isincremented, and the thread checks whether the stealRange field needs to beupdated. The updateStealRange method tries to reset the stealRange field ifif either the queue size is a power of two, or if the stealRange field has changedsince the last time the process examined the object (meaning that some threadshave been stolen).

The local thread calls popBottom (illustrated in Figure 10.12 when it needsto consume another thread. If the queue is empty, the method returns null.It then calls updateStealRange to check the stealRange field. If that methoddetects that the field should be updated, but fails to complete the update, thenthe popBottom method throws an exception Otherwise, the method pops off athread, and checks stealRange. If that range does not include the thread, thenthe method returns it, since no thief could have taken the thread. Otherwise,there are two possibilities: the queue is empty and the thread was stolen, or thequeue contains a single thread. The method tries to update the stealRangefield with an empty value. If it fails, it returns null, and if it succeeds, it returnsthe thread;

18 CHAPTER 10. WORK-STEALING

public Object popBottom(long prevStealRange) throws Abort {if (this.getSize() == 0)

return null;boolean ok = updateStealRange(prevStealRange);if (!ok)

throw new Abort();

this.bottom--;Object t = this.deq[this.bottom % QUEUE_SIZE];long oldStealRange = this.stealRange.read();

int rangeTop = getTop(oldStealRange);int rangeBot = getSteal(oldStealRange);if (this.bottom > rangeTop)

return t; // no need to synchronizeelse if (rangeTop == rangeBot) { // oldStealRange is empty

this.bottom = 0; // last thread already stolenreturn null;

} else { // Try to make stealRange emptylong currentStealRange = this.stealRange.read();int tag = getTag(currentStealRange);int bot = getSteal(currentStealRange);if this.stealRange.CAS(currentStealRange, makeStealRange(tag+1, bot+1, bot))) {return t; // thread was not stolen so far)

} else {return null; // thread stolen

}}

}

Figure 10.12: Code for popBottom

10.5. THE STEAL-HALF PROTOCOL 19

public int stealTop(EDEQueue victim) {long oldStealRange = victim.stealRange;int oldSteal = getSteal(oldStealRange);int oldTop = getTop(oldStealRange);int oldTag = getTag(oldStealRange);int rangeLen = oldSteal - oldTop;// figure out how much we can stealint capacity = QUEUE_SIZE - this.getSize();int numToSteal = Math.min(capacity, rangeLen);// tentatively copy stolen threadsfor (int i = 0; i < numToSteal; i++)this.deq[this.bottom+i % QUEUE_SIZE] = victim.deq[oldTop+i % QUEUE_SIZE];

// try to make theft completelong newStealRange = makeStealRange(oldTag+1, oldSteal+numToSteal, oldSteal);if (CAS(oldStealRange, newStealRange))) {this.bottom += numToSteal; // make theft visible to thiefthis.updateStealRange(0); // adjust thief’s steal rangereturn numToSteal;

}return 0;

}

Figure 10.13: Code for stealTop

20 CHAPTER 10. WORK-STEALING

A local process calls the stealTop method (illustrated in Figure 10.13 tosteal threads from another EDEQueue. The method first computes how manythreads it can steal by comparing the victim’s stealRange and the excess ca-pacity of the thief’s deq array. It then tentatively copies that many threadsfrom the victim to the thief, but without yet updating either the thief’s bottomfield or the victim’s stealRange field. The method then calls CAS to adjustthe victim’s stealRange field to reflect the missing items. If it succeeds, thenthe thief updates its own bottom field, followed by its own stealRange. Themethod then returns the number of threads stolen. If the CAS fails, the thefthas also failed, and the method returns zero.

Note that if another process succeeds in stealing concurrently from the thief,then the thief may fail to update its own stealRange, but it will not be pre-vented from updating its bottom and completing the theft.

10.6 Course Notes

This chapter adapted material from Leiserson and Prokop’s Introduction to Mul-tithreaded Programming. The DAG based model for analysis of multithreadedcomputation was formalized by Blumofe and Leiserson in 1994. They also gavethe first work-stealing deque based implementation. Their deque however wasnot lock-free. Theorem 10.3.1 and its proof first appeared in [1]. The steal-halfprotocol is a simpler version of a protocol due to Hendler and Shavit [4].

Bibliography

[1] Nimar S. Arora and Robert D. Blumofe and C. Greg Plaxton, ThreadScheduling for Multiprogrammed Multiprocessors, ACM Symposium onParallel Algorithms and Architectures, 1998.

[2] Robert D. Blumofe and Charles E. Leiserson. Scheduling multithreadedcomputations by work stealing. In Proceedings of the 35th Annual Sympo-sium on Foundations of Computer Science, pages 356–368, Santa Fe, NewMexico, November 1994.

[3] C. Leiserson and H. Prokop, A Minicourse on Multithreaded Programming.ftp://theory.lcs.mit.edu/pub/cilk/minicourse.ps.gz.

[4] D. Hendler and N. Shavit,Hendler, D., and Shavit, N. Non-blocking steal-half work queues. In Proceedings of the 21st Annual ACM Symposium onPrinciples of Distributed Computing (2002).

21


Recommended