Parallel Real-Time Scheduling of DAGslu/papers/tpds-dags.pdf · each real-time task itself is a...

1

Parallel Real-Time Scheduling of DAGsAbusayeed Saifullah, David Ferry, Jing Li, Kunal Agrawal, Chenyang Lu, Christopher Gill

Abstract—Recently, multi-core processors have become mainstream in processor design. To take full advantage of multi-coreprocessing, computation-intensive real-time systems must exploit intra-task parallelism. In this paper, we address the problem ofreal-time scheduling for a general model of deterministic parallel tasks, where each task is represented as a directed acyclic graph(DAG) with nodes having arbitrary execution requirements. We prove processor-speed augmentation bounds for both preemptiveand non-preemptive real-time scheduling for general DAG tasks on multi-core processors. We first decompose each DAG intosequential tasks with their own release times and deadlines. Then we prove that these decomposed tasks can be scheduledusing preemptive global EDF with a resource augmentation bound of 4. This bound is as good as the best known bound for morerestrictive models, and is the first for a general DAG model. We also prove that the decomposition has a resource augmentationbound of 4 plus a constant non-preemption overhead for non-preemptive global EDF scheduling. To our knowledge, this is thefirst resource augmentation bound for non-preemptive scheduling of parallel tasks. Finally, we evaluate our analytical resultsthrough simulations that demonstrate that the derived resource augmentation bounds are safe in practice.

Index Terms—parallel task; multi-core processor; real-time scheduling; resource augmentation bound.

F

1 INTRODUCTION

As the rate of increase of clock frequencies is levelingoff, most processor chip manufacturers have recentlymoved to increasing performance by increasing thenumber of cores on a chip. Intel’s 80-core Polaris [1],Tilera’s 100-core TILE-Gx, AMD’s 12-core Opteron [2],and ClearSpeed’s 96-core processor [3] are some no-table examples of multi-core chips. With the rapidevolution of multi-core technology, however, real-time system software and programming models havefailed to keep pace. Most classic results in real-timescheduling concentrate on sequential tasks runningon multiple processors [4]. While these systems allowmany tasks to execute on the same multi-core host,they do not allow an individual task to run any fasteron it than on a single-core machine.

To scale the capabilities of individual tasks with thenumber of cores, it is essential to develop new ap-proaches for tasks with intra-task parallelism, whereeach real-time task itself is a parallel task that canutilize multiple cores at the same time. Here, we takeautonomous vehicle [5] as a motivating example. Sucha system consists of a myriad of real-time tasks suchas motion planning, sensor fusion, computer vision,and decision making algorithms that exhibit intra-task parallelism. For example, the decision makingsubsystem processes massive amounts of data fromvarious types of sensors, where the data processingon different types of sensors can run in parallel. Suchintra-task parallelism may enable timing guaranteesfor many complex real-time systems requiring heavy

• The authors are with the Department of Computer Science and Engi-neering, Washington University in St. Louis, St. Louis, MO 63130.E-mail: {saifullah, dferry, li.jing, kunal, lu, cdgill}@wustl.edu

computation, whose stringent timing constraints aredifficult to meet on traditional single-core processors.

There has been some recent work on real-timescheduling for parallel tasks, but it has been mostlyrestricted to the synchronous task model [6]–[8]. Inthe synchronous model, each task consists of a sequenceof segments with synchronization points at the endof each segment. In addition, each segment of a taskcontains threads of execution that are of equal length.For synchronous tasks, the result in [6], [8] provesa resource augmentation bound of 4 under globalearliest deadline first (EDF) scheduling. A resourceaugmentation bound ν of a scheduling policy A indi-cates that if there is any way to schedule a task seton m identical unit-speed processor cores, then A isguaranteed to successfully schedule it on m cores witheach core being ν times as fast as the original.

While the synchronous task model represents thetasks generated by the parallel for loop construct com-mon to many parallel languages such as OpenMP [9]and CilkPlus [10], most parallel languages also haveother constructs for generating parallel programs, no-tably fork-join constructs. A program that uses fork-join constructs will generate a non-synchronous task,generally represented as a directed acyclic graph(DAG), where each thread (sequence of instructions)is a node, and the edges represent dependencies be-tween the threads. A node’s execution requirementcan vary arbitrarily, and different nodes in the sameDAG can have different execution requirements.

Another limitation of the state-of-the-art is thatall prior work on parallel real-time tasks considerspreemptive scheduling, where threads are allowed topreempt each other in the middle of execution. Pre-emption can be a high-overhead since it often in-volves a system call and a context switch. An alter-

2

native scheduling model is to consider node-level non-preemptive scheduling (called non-preemptive schedul-ing in this paper), where once the execution of aparticular node (thread) starts it cannot be preemptedby any other thread. Most parallel languages andlibraries have yield points at the end of threads (nodesof a DAG), allowing low-cost, user-space preemp-tion at these yield points. For these, schedulers thatswitch context only when threads end can be imple-mented entirely in user-space, and therefore have lowoverheads. In addition, fewer switches imply lowercaching overhead. In this model, since a node is neverpreempted, if it accesses the same memory locationmultiple times, those locations will be cached, and anode never has to restart on a cold cache.

This paper addresses the hard real-time schedulingof a set of generalized DAGs sharing a multi-coremachine. We generalize the previous work in two im-portant directions. First, we consider a general modelof deterministic parallel tasks, where each task is rep-resented by a general DAG in which nodes can havearbitrary execution requirements. Second, we addressboth preemptive and non-preemptive scheduling. In par-ticular, we make the following new contributions.• We propose a novel task decomposition to trans-

form the nodes of a general DAG into sequentialtasks. Since each node of the DAG becomes an in-dividual sequential task, these tasks can be sched-uled either preemptively or non-preemptively.

• We prove that any set of parallel tasks of ageneral DAG model, upon decomposition, can bescheduled using preemptive global EDF with aresource augmentation bound of 4. This boundis as good as the best known bound for morerestrictive models [6] and, to our knowledge, isthe first bound for a general DAG model.

• We prove that our decomposition requires aresource augmentation bound of 4 + 2ρ fornon-preemptive global EDF scheduling, whereρ is the non-preemption overhead of the tasks. Toour knowledge, this is the first bound for non-preemptive scheduling of parallel real-time tasks.

• Through simulations, we demonstrate that thederived bounds are safe, and reasonably tight inpractice, especially under preemptive EDF thatrequires a resource augmentation of 3.2 in simu-lation as opposed to our analytical bound of 4.

Section 2 reviews related work. Section 3 describesthe task model. Section 4 presents the decompositionalgorithm. Sections 5 and 6 present analyses for pre-emptive and non-preemptive global EDF scheduling,respectively. Section 7 presents the simulation results.

2 RELATED WORK

There has been a substantial amount of work on tra-ditional multiprocessor real-time scheduling focusedon sequential tasks [4]. Scheduling of parallel tasks

without deadlines has been addressed in [11], [12].Soft real-time scheduling, where the goal is to meeta subset of deadlines based on some application-specific criterion, for parallel task has been studiedfor optimizing cache misses [13], makespan [14], andtotal work done within the deadlines [15]. In contrast,we address hard real-time scheduling where the goal isto meet all task deadlines. Hard real-time schedulingis a fundamental requirement in many importantapplication domains such as video surveillance, radartracking, and autonomous vehicle [5].

An exact (i.e., both sufficient and necessary) schedu-lability analysis under hard real-time system is in-tractable for most cases of parallel tasks [16]. Earlyworks on hard real-time parallel scheduling makesimplifying assumptions about task models. For ex-ample, the study in [17], [18] considers EDF schedul-ing of parallel tasks where the actual number ofprocessors used by a particular task is determinedbefore starting the system, and remains unchanged.

Recently, preemptive real-time scheduling has beenstudied [6]–[8] for synchronous parallel tasks withimplicit deadlines. In [7], every task is an alternatesequence of parallel and sequential segments witheach parallel segment consisting of multiple threadsof equal length that synchronize at the end of thesegment. All parallel segments in a task have an equalnumber of threads which cannot exceed the numberof processor cores. Each thread is transformed into asubtask, and a resource augmentation bound of 3.42 isclaimed under partitioned Deadline Monotonic (DM)scheduling. This result was later generalized for syn-chronous model with arbitrary numbers of threads insegments, with bounds of 4 and 5 for global EDF andpartitioned DM scheduling, respectively [6], and alsoto minimize the required number of processors [19].

Scheduling and analysis of DAGs introduces achallenging open problem. For this general model,an augmentation bound has been analyzed recentlyin [20], but it considers a single DAG on a multi-core machine with preemption. Our earlier work [6]has proposed a simple extension to a synchronoustask scheduling approach that handles unit-node DAGwhere each node has unit execution requirement. Thework in [8] is an implementation of our work in [6].However, most parallel languages that use fork-joinconstructs generate a non-synchronous task, generallyrepresented as a DAG where each node’s executionrequirement can vary arbitrarily, and different nodesin the same DAG can have different execution require-ments. The decomposition in [6] for restrictive modelis not applicable for general DAG. If it is extendedto general DAG, it may split each node of a DAGinto multiple subtasks, thereby disallowing node-level non-preemptive scheduling. Also, it will makepreemptive scheduling inefficient and costly due toexcessive numbers of contexts switches due to nodesplitting and artificially increased synchronization.

3

The same node in all figures have the same color and shade

Wi

1

Wi

4

Wi

10

Wi

2

Wi

7

Wi

8W

i

9

Wi

3

Wi

5

Wi

6

Fig. 1. A parallel task τi represented as a DAG

3 PARALLEL TASK MODEL

We consider n periodic parallel tasks to be sched-uled on a multi-core platform consisting of m iden-tical cores. The task set is represented by τ ={τ1, τ2, · · · , τn}. Each task τi, 1 ≤ i ≤ n, is representedas a Directed Acyclic Graph (DAG), where the nodesstand for different execution requirements, and theedges represent dependencies between the nodes.

A node in τi is denoted by W ji , 1 ≤ j ≤ ni,

with ni being the total number of nodes in τi. Theexecution requirement of node W j

i is denoted by Eji . Adirected edge from node W j

i to node W ki , denoted as

W ji → W k

i , implies that the execution of W ki cannot

start until W ji finishes. W j

i , in this case, is called aparent of W k

i , while W ki is its child. A node may have

0 or more parents or children, and can start executiononly after all of its parents have finished execution.Figure 1 shows a task τi with ni = 10 nodes.

The execution requirement (i.e., work) Ci of task τi isthe sum of the execution requirements of all nodes inτi; that is, Ci =

∑ni

j=1Eji . Thus, Ci is the maximum

execution time of τi if it was executing on a singleprocessor of speed 1. For task τi, the critical path length,denoted by Pi, is the sum of execution requirements ofthe nodes on a critical path. A critical path is a directedpath that has the maximum execution requirementamong all other paths in DAG τi. Thus, Pi is the mini-mum execution time of τi meaning that it needs at leastPi time units on unit-speed processor cores even whenthe number of cores m is infinite. The period of taskτi is denoted by Ti and the deadline Di of each taskτi is considered implicit, i.e., Di = Ti. Since Pi is theminimum execution time of task τi even on a machinewith an infinite number of cores, the condition Ti ≥ Pimust hold for τi to be schedulable (i.e. to meet itsdeadline). A task set is said to be schedulable when alltasks in the set meet their deadlines.

4 TASK DECOMPOSITION

We schedule parallel tasks by decomposing each par-allel task into smaller sequential tasks. The mainintuition for decomposing a parallel task into a set ofsequential tasks is that the scheduling of parallel taskreduces to the scheduling of sequential tasks, allowingus to leverage existing schedulability analysis for tra-ditional multiprocessor scheduling. In this section, we

present a decomposition technique for a parallel taskunder a general DAG model. Upon decomposition,each node of a DAG becomes an individual sequentialtask, called a subtask, with its own deadline andwith an execution requirement equal to the node’sexecution requirement. We use the terms ‘subtask’and ‘node’ interchangeably. All nodes of a DAG areassigned appropriate deadlines and release offsetssuch that when they execute as individual subtasksall dependencies among them in the original DAG arepreserved. The deadlines of the subtasks of a DAGare assigned by splitting the DAG’s deadline. Thedecomposition ensures that if the subtasks of a DAGare schedulable, then the DAG must be schedulable.Thus, an implicit deadline DAG is decomposed into aset of constrained deadline (i.e. deadline is no greaterthan period) sequential subtasks with each subtaskcorresponding to a node of the DAG.

Our schedulability analysis for parallel tasks entailsderiving a resource augmentation bound [6], [7]. Inparticular, our result aims at procuring the followingclaim: If an optimal algorithm can schedule a taskset on a machine of m unit-speed processor cores,then our algorithm can schedule this task set onm processor cores, each of speed ν, where ν is theresource augmentation factor. Since an optimal algo-rithm is unknown, we pessimistically assume that anoptimal scheduler can schedule a task set if each taskof the set has a critical-path length no greater thanits deadline, and the total utilization of the task set isno greater than m. No algorithm can schedule a taskset that does not meet these conditions. Our resourceaugmentation analysis is based on the densities of thedecomposed tasks, where the density of any task is theratio of its execution requirement to its deadline.

4.1 TerminologyThe utilization ui of a task τi, and the total utilizationusum(τ) for any task set τ of n tasks are defined as

ui =CiTi

; usum(τ) =

n∑i=1

CiTi

If usum is greater than m, then no algorithm canschedule τ on m identical unit-speed processor cores.

The density δi of any task τi, and the total densityδsum(τ) and the maximum density δmax(τ) for any set τof n tasks are defined as follows.

δi =CiDi

; δsum(τ) =

n∑i=1

δi; δmax(τ) = max{δi|1 ≤ i ≤ n}

(1)The demand bound function (DBF) of task τi is the

largest cumulative execution requirement of all jobsgenerated by τi that have both arrival times anddeadlines within a contiguous interval of t time units.For any task τi, the DBF is given by

DBF(τi, t) = max

(0,(⌊ t−Di

Ti

⌋+ 1)Ci

)(2)

4

Based on the DBF, the load, denoted by λ(τ), of anytask set τ consisting of n tasks is defined as follows.

λ(τ) = maxt>0

n∑i=1

DBF(τi, t)

t

(3)

4.2 Decomposition AlgorithmIn the decomposition, the intermediate subdeadlineassigned to a node is called node deadline. Note thatonce task τi is released, it has a total of Ti time unitsto finish its execution. The proposed decompositionalgorithm splits this deadline Ti into node deadlinesby preserving the dependencies in τi. For task τi,the deadline and the offset assigned to node W j

i aredenoted by Dj

i and Φji , respectively. Once appropriatevalues of Dj

i and Φji are determined for each nodeW ji (respecting the dependencies in the DAG), task

τi is decomposed into nodes. Upon decomposition,the dependencies in the DAG need not be considered,and each node can execute as a traditional sequentialmultiprocessor task. Hence, the decomposition tech-nique for τi boils down to determining Dj

i and Φji foreach node W j

i as presented below. The presentationis accompanied by an example using the DAG τifrom Figure 1. For the example, we assign executionrequirement of each node W j

i as follows: E1i = 4,

E2i = 2, E3

i = 4, E4i = 5, E5

i = 3, E6i = 4, E7

i = 2,E8i = 4, E9

i = 1, E10i = 1. Hence, Ci = 30, Pi = 14. Let

period Ti = 21.To perform the decomposition, we first represent

DAG τi as a timing diagram τ∞i (Figure 2(a)) thatshows its execution time on an infinite number ofunit-speed processor cores. Specifically, τ∞i indicatesthe earliest start time and the earliest finishing time(of the worst case execution requirement) of each nodewhen m =∞. For any node W j

i that has no parents,the earliest start time and the earliest finishing time are0 and Eji , respectively. For every other node W j

i , theearliest start time is the latest finishing time among itsparents, and the earliest finishing time is Eji time unitsafter that. For example, in τi of Figure 1, nodes W 1

i ,W 2i , and W 3

i can start execution at time 0, and theirearliest finishing times are 4, 2, and 4, respectively.Node W 4

i can start after W 1i and W 2

i complete, andfinish after 5 time units at its earliest, and so on.Figure 2(a) shows τ∞i for DAG τi. Next, based onτ∞i , the calculation of Dj

i and Φji for each node W ji

involves the following two steps. In Step 1, for eachnode, we estimate the time requirement at differentparts of the node. In Step 2, the total estimated timerequirements at different parts of the node is assignedas the node’s deadline.

As stated before, our resource augmentation analy-sis is based on the densities of the decomposed tasks.The efficiency of the analysis is largely dependenton the total density (δsum) and the maximum density

(δmax) of the decomposed tasks. Namely, we needto keep both δsum and δmax bounded and as smallas possible to minimize the resource augmentationrequirement. Therefore, the objective of the decom-position algorithm is to split the entire task deadlineinto node deadlines and to keep their densities smallso that each node (subtask) has enough slack. Theslack of any task represents the extra time beyond itsexecution requirement and is defined as the differencebetween its deadline and execution requirement.

4.2.1 Estimating Time Requirements of the Nodes

In DAG τi, a node can execute with different numbersof nodes in parallel at different times. Such a degreeof parallelism can be estimated based on τ∞i . Forexample, in Figure 2(a), node W 5

i executes with W 1i

and W 3i in parallel for the first 2 time units, and then

executes with W 4i in parallel for the next time unit. In

this way, we first identify the degrees of parallelism atdifferent parts of each node. Intuitively, the parts of anode that may execute with a large number of nodesin parallel demand more time. Therefore, differentparts of a node are assigned different amounts oftime considering these degrees of parallelism andexecution requirements. Later, the total time of allparts of a node is assigned to the node as its deadline.

To identify the degree of parallelism for differentportions of a node based on τ∞i , we assign timeunits to a node in different (consecutive) segments.In different segments of a node, the task may havedifferent degrees of parallelism. In τ∞i , starting fromthe beginning, we draw a vertical line at every timeinstant where a node starts or ends (as shown in Fig-ure 2(b)). This is done in linear time using a breadth-first search over the DAG. The vertical lines now splitτ∞i into segments. For example, in Figure 2(b), τi issplit into 7 segments (numbered from left to right).

Once τ∞i is split into segments, each segment con-sists of an equal amount of execution by the nodesthat lie in the segment. Parts of different nodes inthe same segment can now be thought of as threadsof execution that run in parallel, and the threads in asegment can start only after those in the precedingsegment finish. We denote this synchronous form ofτ∞i by τ

syni . We first allot time to the segments, and

finally add all times allotted to different segments ofa node to calculate its deadline.

We split Ti time units among the nodes based onthe number of threads and execution requirementof the segments where a node lies in τ

syni . We first

estimate time requirement for each segment. Let τ syni

be a sequence of si segments numbered as 1, 2, · · · , si.For any segment j, we use mj

i to denote the numberof threads in the segment, and eji to denote the exe-cution requirement of each thread in the segment (seeFigure 2(b)). Since τ syn

i has the same critical path and

5

start end

the earliest time when can start when

the earliest time when can finish when

The same node in each figure has the same color and shade

Wi

1W

i

4

Wi

2

Wi

7

Wi

8

Wi

3

Wi

5W

i

6

Ei

1=

Ei

2= 2

Ei

3= "

Ei

4=

Ei

6= 4E

i

5= 3

Ei

7= 2

m = ∞

m = ∞

Ei

9=1W

i

9

Ei

10=1W

i

10Ei

8= 4

The length of each node is proportional to its execution requirement

Wi

8

Wi

8

Ti= 21

Ci = E i

j

j=1

10

∑ = 30

Pi= 4 � � 4 � =

(a) τ∞i : a timing diagram for when τi executes on an infinite number of processor cores

Vertical bar where a node starts or ends

ei

1= 2 e

i

2= 2 e

i

3=1 e

i

4= 2 e

i

5= 2

mi

1= 3 m

i

2= 3

mi

3= 2

mi

4= 3 m

i

5= 2

mi

7= 2

start end

Ti= 21

ei

7=1e

i

6= 4

segment 6 is a light segment

θi=

Ci

2Ti− P

i

=30

42 −14=1.07

mi

6=1segment 6 with

Since mi

6=1≤ θ

i,

segment 7 with

segment 1 with segment 2 with segment 3 with

segment 4 with segment 5 with

(b) τ syni

Fig. 2. τ∞i and τ syni of DAG τi (of Figure 1)

total execution requirements as those of τi,

Pi =

si∑j=1

eji ; Ci =

si∑j=1

mji .e

ji

For any segment j of τ syni , we calculate a value dji ,

called the segment deadline, so that the segment isassigned a total of dji time units to finish all its threads.Now we calculate the value dji that minimizes boththread density and segment density that would leadto minimizing δsum and δmax upon decomposition.

Since segment j consists of mji parallel threads, with

each thread having an execution requirement of eji ,the total execution requirement of segment j is mj

ieji .

Thus, the segments with larger numbers of threadsand with longer threads are computation-intensive,and demand more time to finish execution. Therefore,a reasonable way to assign the segment deadlinesis to split Ti proportionally among the segments byconsidering their total execution requirement. Sucha policy assigns a segment deadline of Ti

Cimjieji to

segment j. Since this is the deadline for each parallelthread of segment j, by Equation (1), the density ofa thread becomes Ci

mjiTi

which can be as large as m.Hence, such a method does not minimize δmax, andis not useful. Instead, we classify the segments ofτ

syni into two groups based on a threshold θi on the

number threads per segment: each segment j withmji > θi is called a heavy segment, and each segment

j with mji ≤ θi is called a light segment. Among

the heavy segments, we allocate a portion of timeTi that is no less than that allocated among the lightones. Before assigning time among the segments, wedetermine a value of θi and the fraction of time Ti tobe split among the heavy and light segments.

We show below that choosing θi = Ci

2Ti−Pihelps

us keep both thread density and segment densitybounded. Therefore, each segment j with mj

i >Ci

2Ti−Pi

is classified as a heavy segment while other segmentsare called light segments. Let Hi denote the set of heavysegments, and Li denote the set of light segments of τ syn

i .This raises three different cases: when Li = ∅ (i.e.,when τ

syni consists of only heavy segments), when

Hi = ∅ (i.e., when τ syni consists of only light segments),

and when Hi 6= ∅, Li 6= ∅ (i.e., when τsyni consists

of both light segments and heavy segments). We usethree different approaches for these three scenarios.Case 1: when Hi = ∅. Since each segment has asmaller number (≤ Ci

2Ti−Pi) of threads, we only con-

sider the length of a thread in each segment to assigntime for it. Hence, Ti time units is split proportionallyamong all segments according to the length of eachthread. For each segment j, its deadline dji is calcu-lated as follows.

dji =TiPieji (4)

Since the condition Ti ≥ Pi must hold for every taskτi to be schedulable,

dji =TiPieji ≥

TiTieji = eji (5)

Hence, the maximum density of a thread in anysegment is at most 1. Since a segment has at mostCi

2Ti−Pithreads, and Ti ≥ Pi, the segment’s density is

at mostCi

2Ti − Pi≤ Ci

2Ti − Ti=CiTi

(6)

Case 2: when Li = ∅. All segments are heavy,and Ti time units is split proportionally among allsegments according to the work (i.e. total executionrequirement) of each segment. For each segment j, its

6

deadline dji is given by

dji =TiCimjieji (7)

Since for every segment j, mji >

Ci

2Ti−Pi, we have

dji =TiCimjieji >

TiCi

Ci2Ti − Pi

eji =2Ti

2(2Ti − Pi)eji ≥

eji2(8)

Hence, the maximum density of any thread is at most2. The total density of segment j is at most

mjieji

dji=

mjieji

Ti

Cimjieji

=CiTi

(9)

Case 3: when Hi 6= ∅ and Li 6= ∅. The task hasboth heavy segments and light segments. A total of(Ti − Pi/2) time units is assigned to heavy segments,and the remaining Pi/2 time units is assigned to lightsegments. (Ti−Pi/2) time units is split proportionallyamong heavy segments according to the work of eachsegment. The total execution requirement of heavysegments of τ syn

i is denoted by Cheavyi , defined as

Cheavyi =

∑j∈Hi

mji .e

ji

For each heavy segment j, the deadline dji is

dji =Ti − Pi

2

Cheavyi

mjieji (10)

Since for each heavy segment j, mji >

Ci

2Ti−Pi, we have

dji =(Ti − Pi

2 )mjieji

Cheavyi

>(Ti − Pi

2 ) Ci

2Ti−Pieji

Cheavyi

≥ eji2

(11)

Hence, maximum density of a thread in any heavysegment is at most 2. As Ti ≥ Pi, the total density ofa heavy segment becomes

mjieji

dji=

mjieji

Ti−Pi2

Cheavyi

mjieji

=C

heavyi

Ti − Pi

2

≤ Ci

Ti − Ti

2

=2CiTi

(12)Now, to distribute time among the light segments,Pi/2 time units is split proportionally among lightsegments according to the length of each thread. Thecritical path length of light segments is denoted byP

lighti , and is defined as follows.

Plighti =

∑j∈Li

eji

For each light segment j, the deadline dji is

dji =Pi

2

Plighti

eji (13)

The density of a thread in any light segment is at most2 since

dji =Pi

2

Plighti

eji ≥Pi

2

Pieji =

eji2

(14)

Since a light segment has at most Ci

2Ti−Pithreads, and

Ti ≥ Pi, the total density of a light segment is at most

2Ci2Ti − Pi

≤ 2Ci2Ti − Ti

=2CiTi

(15)

4.2.2 Calculating Deadline and Offset for NodesWe have assigned segment deadlines to (the threadsof) each segment of τ syn

i in Step 1 (Equations (4), (7),(10), (13)). Since a node may be split into multiple(consecutive) segments in τ

syni , now we have to re-

move all segment deadlines of a node to reconstruct(restore) the node. Namely, we add all segment dead-lines of a node, and assign the total as the node’sdeadline.

Now let a node W ji of τi belong to segments k to r

(1 ≤ k ≤ r ≤ si) in τsyni . Therefore, the deadline Dj

i ofnode W j

i is calculated as follows.

Dji = dki + dk+1

i + · · ·+ dri (16)

Note the execution requirement Eji of node W ji is

Eji = eki + ek+1i + · · ·+ eri (17)

Node W ji cannot start until all of its parents complete.

Hence, its release offset Φji is determined as follows.

Φji =

{0; if W j

i has no parentmax{Φli +Dl

i|W li is a parent of W j

i }; otherwise.

Now that we have assigned an appropriate dead-line Dj

i and release offset Φji to each node W ji of τi, the

DAG τi is now decomposed into nodes. Each node W ji

is now an individual (sequential) multiprocessor sub-task with an execution requirement Eji , a constraineddeadline Dj

i , and a release offset Φji . Note that theperiod of W j

i is still the same as that of the originalDAG which is Ti. The release offset Φji ensures thatnode W j

i can start execution no earlier than W ji time

units following the release time of the original DAG.Our method guarantees that for a general DAG nonode is split into smaller subtasks to ensure node-level non-preemption. Thus, the (node-level) non-preemptive behavior of the original task is preservedin scheduling the nodes as individual tasks, wherenodes of the DAG are never preempted. The entiredecomposition method is presented as Algorithm 1 inAppendix A which runs in linear time (in terms of theDAG size i.e., number of nodes and edges). Figure 7in Appendix B shows the complete decomposition ofτi. Appendix C provides a sketch (Figure 8) on howit can be implemented on a real system.

4.3 Density Analysis after DecompositionAfter decomposition, let τdec

i denote all subtasks (i.e.,nodes) that τi generates. Note that the densities of allsuch subtasks comprise the density of τdec

i . Now weanalyze the density of τdec

i which will later be used toanalyze schedulability.

7

Let node W ji of τi belong to segments k to r

(1 ≤ k ≤ r ≤ si) in τsyni . Since W j

i has been assigneddeadline Dj

i , by Equations (16) and (17), its density δjiafter decomposition is

δji =EjiDji

=eki + ek+1

i + · · ·+ eridki + dk+1

i + · · ·+ dri(18)

By Equations (5), (8), (11), (14), dki ≥eki2 , ∀i, k. Hence,

from (18),

δji =EjiDji

≤ 2eki + 2ek+1i + · · ·+ 2eri

eki + ek+1i + · · ·+ eri

= 2 (19)

Let τdec be the set of all generated subtasks of alloriginal DAG tasks, and δmax be the maximum densityamong all subtasks in τdec. By Equation (19),

δmax = max{δji∣∣1 ≤ j ≤ ni, 1 ≤ i ≤ n

}≤ 2 (20)

We use Dmin to denote the minimum deadline amongall subtasks in τdec. That is,

Dmin = min{Dji

∣∣1 ≤ j ≤ ni, 1 ≤ i ≤ n}

(21)

Theorem 1: Let a DAG τi, 1 ≤ i ≤ n, with period Ti,critical path length Pi where Ti ≥ Pi, and maximumexecution requirement Ci be decomposed into sub-tasks (nodes) denoted τdec

i using the decompositiontechnique (Algorithm 1 in Appendix). The density ofτdeci is at most 2Ci

Ti.

Proof: Since we decompose τi into nodes, thedensities of all decomposed nodes W j

i , 1 ≤ j ≤ ni,comprise the density of τdec

i . In Step 1, every node W ji

of τi is split into threads in different segments of τ syni ,

and each segment is assigned a segment deadline. InStep 2, we remove all segment deadlines in the node,and their total is assigned as the node’s deadline. Ifτi is scheduled in the form of τ syn

i , then each segmentis scheduled after its preceding segment is complete.That is, at any time at most one segment is active.By Equations (6), (9), (12), (15), a segment has densityat most 2Ci

Ti(considering Ti ≥ Pi). Hence, the overall

density of τ syni never exceeds 2Ci

Ti. Therefore, it is suf-

ficient to prove that removing segment deadlines inthe nodes does not increase the task’s overall density.That is, it is sufficient to prove that the density δji(Equation (18)) of any node W j

i after removing itssegment deadlines is no greater than the density δj,syn

i

that it had before removing its segment deadlines.Let node W j

i of the original DAG task τi be splitinto threads in segments k to r (1 ≤ k ≤ r ≤ si) inτ

syni . Since the total density of any set of tasks is an

upper bound on its load (as proven in [21]), the loadof the threads of W j

i must be no greater than the totaldensity of these threads. Since each of these threadsis executed only once in the interval of Dj

i time units,based on Equation (2), the DBF of the thread, threadli,

in segment l, k ≤ l ≤ r, in the interval of Dji time

units is expressed as

DBF(threadli, Dji ) = eli

Therefore, using Equation (3), the load, denoted byλj,syni , of the threads of W j

i in τsyni for interval Dj

i is

λj,syni ≥ eki

Dji

+ek+1i

Dji

+ · · ·+ eri

Dji

=EjiDji

= δji

Since δj, syni ≥ λ

j, syni , for any W j

i , we have δj, syni ≥ δji .

Let δsum be the total density of all subtasks τdec. Since,from Theorem 1, the density of each τdec

i is at most2Ci

Tiwhere Ti ≥ Pi,

δsum ≤n∑i=1

2CiTi

= 2

n∑i=1

CiTi

(22)

5 PREEMPTIVE EDF SCHEDULING

Once all DAG tasks are decomposed into nodes (i.e.,subtasks), we consider scheduling the nodes. Sinceevery node after decomposition becomes a sequentialtask, we schedule them using traditional multiproces-sor scheduling policies. In this section, we considerthe preemptive global EDF policy.

Lemma 2: For any set of DAGs τ = {τ1, · · · , τn}, letτdec be the decomposed task set. If τdec is schedulableunder some preemptive scheduling, then τ is preemp-tively schedulable.

Proof: See Appendix D.To schedule the decomposed subtasks τdec, the EDF

policy is the same as the traditional global EDF policywhere jobs with earlier absolute deadlines have higherpriorities. Due to the preemptive policy, a job canbe suspended (preempted) at any time by arrivinghigher-priority jobs, and is later resumed with (intheory) no cost or penalty. Under preemptive globalEDF, we now present a schedulability analysis for τdec

in terms of a resource augmentation bound which, byLemma 2, is also a sufficient analysis for the originalDAG task set τ . For a task set, a resource augmentationbound ν of a scheduling policy A on an m-coremachine is a processor speed-up factor. That is, if thereexists any way to schedule the task set on m identicalunit-speed processor cores, then A is guaranteed tosuccessfully schedule it on an m-core processor witheach core being ν times as fast as the original.

Our analysis hinges on a result (Theorem 3) for pre-emptive global EDF scheduling of constrained dead-line sporadic tasks on a traditional multiprocessorplatform [22]. This result is a generalization of theresult for implicit deadline tasks [23].

Theorem 3: (From [22]) Any constrained deadlinesporadic sequential task set π with total densityδsum(π) and maximum density δmax(π) is schedulable

8

using preemptive global EDF policy on m unit-speedprocessor cores if

δsum(π) ≤ m− (m− 1)δmax(π)

Note that τdec consists of constrained deadline(sub)tasks that are periodic with offsets. If they do nothave offsets, then the above condition directly applies.Taking the offsets into account, the execution require-ment, the deadline, and the period (which is equalto the period of the original DAG) of each subtaskremains unchanged. The release offsets only ensurethat some subtasks of the same original DAG are notexecuted simultaneously to preserve the precedencerelations in the DAG. This implies that both δsumand δmax of the subtasks with offsets are no greaterthan δsum and δmax, respectively, of the same set oftasks with no offsets. Hence, Theorem 3 holds forτdec. We now use the results of density analysis fromSubsection 4.3, and prove that τdec is guaranteed to beschedulable with a resource augmentation of at most4 in Corollary 1 that follows Theorem 4.

Theorem 4: For any set of DAGs τ = {τ1, τ2, · · · , τn},let τdec be the decomposed task set. If every DAG τisatisfies the condition Ti ≥ Pi, and the DAG set τ sat-isfies the condition

∑ni=1

Ci

Ti≤ m on m identical unit-

speed processor cores, then the decomposed task setτdec is guaranteed to be schedulable under preemptiveglobal EDF on m processor cores, each of speed 4.

Proof: If each DAG τi satisfies the condition Ti ≥Pi, then the total density δsum of the decomposed taskset τdec is at most 2

∑ni=1

Ci

Ti(Equation (22)), and the

maximum density δmax of τdec is at most 2 (Equation(20)) on unit-speed processors. To be able to schedulethe decomposed tasks τdec, let each processor core beof speed ν, where ν > 1. On an m-core platform whereeach core has speed ν, let the total density and themaximum density of task set τdec be denoted by δsum,νand δmax,ν , respectively.

Considering that the condition∑ni=1

Ci

Ti≤ m holds

for τ , the total density of decomposed tasks τdec fromEquation (22) is derived as follows on ν-speed cores.

δsum,ν =δsum

ν≤ 2

n∑i=1

Ci

ν

Ti=

2

ν

n∑i=1

CiTi≤ 2m

ν(23)

On ν-speed cores, the maximum density of τdec isderived from Equation (20) as follows.

δmax,ν =δmax

ν≤ 2

ν(24)

Using Conditions (24) and (23) in Theorem 3, τdec

is schedulable under preemptive EDF policy on mprocessor cores each of speed ν if

2m

ν≤ m− (m− 1)

2

ν⇔ 4

ν− 2

mν≤ 1

From the above condition, τdec must be schedulableif

4

ν≤ 1 ⇔ ν ≥ 4.

Corollary 1: For any set of DAGs τ ={τ1, τ2, · · · , τn}, let τdec be the decomposed task set.If there exists any algorithm that can schedule τ onm unit-speed processor cores, then the decomposedtask set τdec is guaranteed to be schedulable underpreemptive global EDF on m cores, each of speed 4.

Proof: If there exists any algorithm that can sched-ule τ on m unit-speed processor cores, then the fol-lowing two conditions must hold.

n∑i=1

CiTi≤ m (25)

Ti ≥ Pi, for each τi (26)

Hence, the proof follows from Theorem 4.Since Theorem 4 holds, we have the following

straightforward schedulability test based on the re-source augmentation bound of 4 for any set of DAGs:For any set of DAGs τ = {τ1, τ2, · · · , τn}, if the totalutilization usum(τ) ≤ m

4 and every DAG τi individuallysatisfies condition Pi ≤ Ti

4 , then the task set is schedulableunder preemptive EDF policy upon decomposition.

6 NON-PREEMPTIVE EDF SCHEDULING

We now address non-preemptive global EDF schedul-ing considering that the original task set τ is sched-uled based on node-level non-preemption. In node-level non-preemptive scheduling, whenever the executionof a node in a DAG starts, the node’s execution cannotbe preempted by any task.

The decomposition converts each node of a DAG toa traditional multiprocessor (sub)task. Therefore, weconsider fully non-preemptive global EDF schedulingof the decomposed tasks. Namely, once a job of adecomposed (sub)task starts execution, it cannot bepreempted by any other job.

Lemma 5: For any set of DAGs τ = {τ1, · · · , τn}, letτdec be the decomposed task set. If τdec is schedulableunder some fully non-preemptive scheduling, then τis schedulable under node-level non-preemption.

Proof: See Appendix E.Under non-preemptive global EDF, we now present

a schedulability analysis for τdec in terms of a resourceaugmentation bound which, by Lemma 5, is also asufficient analysis for the DAG task set τ . This analysisexploits Theorem 6 for non-preemptive global EDFscheduling of constrained deadline periodic tasks ontraditional multiprocessor. The theorem is a general-ization of the result for implicit deadline tasks [24].

For a task set π, let Cmax(π) and Dmin(π) be themaximum execution requirement and the minimumdeadline among all tasks in π. In non-preemptivescheduling, Cmax(π) represents the maximum blockingtime that a task may experience, and plays a major rolein schedulability. Hence, a non-preemption overhead,defined in [24], for the task set π is given by ρ(π) =

9

Cmax(π)Dmin(π) . The value of ρ(π) indicates the added penaltyor overhead associated with non-preemptivity. Inother words, since preemption is not allowed, thecapacity of each processor is reduced (at most) bya factor of ρ(π). In non-preemptive scheduling, thiscapacity reduction is recompensed by reducing thecost associated with context-switch, saving state etc.

Theorem 6: (From [24]) Any constrained deadlineperiodic task set π with total density δsum(π), max-imum density δmax(π), and a non-preemption over-head ρ(π) is schedulable using non-preemptive globalEDF on m unit-speed cores if

δsum(π) ≤ m(1− ρ(π)

)− (m− 1)δmax(π)

Let Emax and Emin be the maximum and minimumexecution requirement, respectively, among all nodesof all DAG tasks. That is,

Emax = max{Eji∣∣1 ≤ j ≤ ni, 1 ≤ i ≤ n

}(27)

Emin = min{Eji∣∣1 ≤ j ≤ ni, 1 ≤ i ≤ n

}(28)

In node-level non-preemptive scheduling of theDAGs, the processor capacity reduction due to non-preemptivity is at most Emax

Emin. Hence, this value is the

non-preemption overhead of the DAGs denoted by ρ:

ρ =Emax

Emin(29)

Theorem 7 derives a resource augmentation boundof 4 + 2ρ for non-preemptive global EDF schedulingof the decomposed tasks. A tighter bound analysis isprovided in Appendix E.

Theorem 7: For DAG model parallel tasks τ ={τ1, · · · , τn}, let τdec be the decomposed task set withnon-preemption overhead ρ. If there exists any wayto schedule τ on m unit-speed processor cores, thenτdec is schedulable under non-preemptive global EDFon m cores, each of speed 4 + 2ρ.

Proof: After decomposition, Dmin (Equation (21))is the minimum deadline among all subtasks in τdec.Since Emax (Equation (27)) represents the maximumblocking time that a subtask may experience, thenon-preemption overhead of the decomposed tasksis Emax

Dmin. From Equations (19) and (29), the non-

preemption overhead of the decomposed tasks

Emax

Dmin≤ Emax

Emin/2=

2Emax

Emin= 2ρ (30)

Similar to Theorem 4 and Corollary 1, suppose weneed each core to be of speed ν to be able to schedulethe decomposed tasks τdec. From Equation (30), thenon-preemption overhead of τdec on ν-speed cores is

Emax/ν

Dmin≤ 2ρ

ν(31)

Considering a non-preemption overhead of at most2ρν on ν-speed processor cores, and using Equations

(24) and (23) in Theorem 6, τdec is schedulable undernon-preemptive EDF on m cores each of speed ν if

2m

ν≤ m(1− 2ρ

ν)− (m− 1)

2

ν⇔ 4 + 2ρ

ν− 1

mν≤ 1

From the above condition, task set τdec is schedulableif

4 + 2ρ

ν≤ 1 ⇔ ν ≥ 4 + 2ρ.

7 EVALUATION

In this section, we evaluate our analytical results. Wesimulate the execution of a set of parallel tasks underscheduling algorithms to observe deadline misses. Wedeveloped a simple event-driven simulator detailedin Appendix F where task executions are simulatedin parallel as if they executed on m cores.

We use the Erdos-Renyi method G(ni, p) [25] togenerate task sets for evaluation. For each value ofm (i.e. the number of cores), we generate task setswhose utilization is exactly m, fully loading a machineof 1-speed processors. The complete task generationmethod is explained in Appendix F. We experimentby varying the following 4 parameters: type of taskperiod (harmonic vs. arbitrary periods), number ofcores (m), probability of an edge in DAG (p), and non-preemption overhead (ρ). The experimental method-ology is detailed in Appendix F.

In all experiments, we simulate 1000 task sets. Foreach task set, we start by simulating its executionon 1-speed processors, and increase the speed by 0.1intervals until all task sets are schedulable. Usingthese different task sets, we conduct two sets ofexperiments. In our first set, we evaluate the schedulerunder preemptive global EDF. Hence, we vary thetypes of period, m and p, but keep ρ constant at2, leading to 112 combinations. In the second set,we evaluate under non-preemptive global EDF byvarying all four factors, leading to 896 combinations.

1 1.5 2 2.510

−4

10−3

10−2

10−1

100

Processor speed

Lo

ga

rith

mic

sca

le o

f fa

ilure

ra

tio

p = 0.01p = 0.05p = 0.1p = 0.2p = 0.4p = 0.6p = 0.8

Fig. 3. Failure ratio in preemptive EDF on 32 coresunder different edge probability

7.1 Results

Effect of harmonic vs. arbitrary periods. This resultis discussed in Appendix F.Effect of p in preemptive scheduling. For each valueof p, Figure 3 shows the failure ratio defined as the ratioof the number of task sets where some task misseda deadline to the total number of task sets (which is1000 in our experiment) attempted to be scheduled. To

10

1 1.5 2 2.510

−4

10−3

10−2

10−1

100

Processor speed

Lo

ga

rith

mic

sca

le o

f fa

ilure

ra

tio

4 cores8 cores16 cores32 cores

Fig. 4. Failure ratio in preemptive EDF on differentnumbers of cores

preserve resolution of the figure, we show the resultsfor only 7 (out of 14) values of p. In these experiments,ρ = 2, m = 32. Note that the failure ratio increases asp increases from 0.01 to 0.1, and then falls again. Wehave detailed the reasons in Appendix F.Effect of m in preemptive scheduling. Figure 4 showsthat he failure ratio increases as m increases. We havedetailed the results in Appendix F.

1 2 3 4 5 6 7 810

−4

10−3

10−2

10−1

100

Processor speed

Lo

ga

rith

mic

sca

le o

f fa

ilure

ra

tio

NP overhead (ρ) = 1




Fig. 5. Failure ratio in non-preemptive EDF on 8 coresunder different non-preemption overhead

Effect of ρ in non-preemptive scheduling. Figure 5shows that the failure ratio increases as the discrete ρincreases. The results are detailed in Appendix F.

2 4 6 8 102

4

6

8

10

NP overhead (ρ)

Pro

ce

sso

r sp

ee

d

4 cores8 cores16 cores32 cores

Fig. 6. Required speed in non-preemptive EDFon different numbers of cores with increasing non-preemption overheadEffect of m in non-preemptive scheduling. Figure 6shows the required speed for each combination of mand ρ, with p = 0.2. We have detailed the results inAppendix F.

The simulation results show a maximum speedrequirement of 3.2 for preemptive EDF suggestingthat our analytical resource augmentation bound of4 is reasonably tight. The corresponding bounds fornon-preemptive EDF sound relatively looser in oursimulation results. This is because, as stated in Sec-tion 6, non-preemptivity can cause processor capacityreduction of up to ρ in the worst case. We havediscussed this issue in more details in Appendix F.

8 CONCLUSIONS

As multi-core technology becomes mainstream in pro-cessor design, real-time scheduling of parallel tasks

is crucial to exploit its potential. In this paper, weconsider a general task model and through a noveltask decomposition we prove a resource augmenta-tion bound of 4 for preemptive EDF, and 4 plus anon-preemption overhead for non-preemptive EDFscheduling. To our knowledge, these are the firstbounds for real-time scheduling of general DAGs.

ACKNOWLEDGEMENTS

This research was supported by NSF under XPS grant(1337218), CPS grant (1136073), NeTS grant (1017701).

REFERENCES[1] http://en.wikipedia.org/wiki/Teraflops Research Chip.[2] www.amd.com/us/products/server/processors.[3] www.clearspeed.com.[4] R. Davis and A. Burns, “A survey of hard real-time scheduling

for multiprocessor systems,” ACM Comp. Surv., vol. 43, 2011.[5] J. Kim, H. Kim, K. Lakshmanan, and R. Rajkumar, “Paral-

lel scheduling for cyber-physical systems: Analysis and casestudy on a self-driving car,” in ICCPS ’13.

[6] A. Saifullah, K. Agrawal, C. Lu, and C. Gill, “Multi-core real-time scheduling for generalized parallel task models,” in RTSS’11.

[7] K. Lakshmanan, S. Kato, and R. R. Rajkumar, “Schedulingparallel real-time tasks on multi-core processors,” in RTSS ’10.

[8] D. Ferry, J. Li, M. Mahadevan, K. Agrawal, C. Gill, and C. Lu,“A real-time scheduling service for parallel tasks,” in RTAS’13.

[9] “OpenMP,” http://openmp.org.[10] http://software.intel.com/en-us/articles/intel-cilk-plus.[11] K. Agrawal, Y. He, W. J. Hsu, and C. E. Leiserson, “Adaptive

task scheduling with parallelism feedback,” in PPoPP ’06.[12] K. Agrawal, C. E. Leiserson, Y. He, and W. J. Hsu, “Adaptive

work-stealing with parallelism feedback,” ACM Trans. Comput.Syst., vol. 26, no. 3, 2008.

[13] J. Anderson and J. Calandrino, “Parallel real-time taskscheduling on multicore platforms,” in RTSS ’06.

[14] Q. Wang and K. H. Cheng, “A heuristic of scheduling paralleltasks and its analysis,” SIAM J. Comput., vol. 21, no. 2, 1992.

[15] O. Kwon and K. Chwa, “Scheduling parallel tasks with indi-vidual deadlines,” Theor. Com. Sc., vol. 215, pp. 209–223, 1999.

[16] C.-C. Han and K.-J. Lin, “Scheduling parallelizable jobs onmultiprocessors,” in RTSS ’89.

[17] G. Manimaran, C. Murthy, and K. Ramamritham, “A newapproach for scheduling of parallelizable tasks in real-timemultiprocessor systems,” Real-Time Syst., vol. 15, no. 1, 1998.

[18] S. Kato and Y. Ishikawa, “Gang EDF scheduling of paralleltask systems,” in RTSS ’09.

[19] G. Nelissen, V. Berten, J. Goossens, and D. Milojevic, “Tech-niques optimizing the number of processors to schedule multi-threaded tasks,” in ECRTS ’12.

[20] S. Baruah, V. Bonifaci, A. Marchetti-Spaccamela, L. Stougie,and A. Wiese, “A generalized parallel task model for recurrentreal-time processes,” in RTSS ’12.

[21] N. Fisher, T. P. Baker, and S. Baruah, “Algorithms for deter-mining the demand-based load of a sporadic task system,” inRTCSA ’06.

[22] S. Baruah, “Techniques for multiprocessor global schedulabil-ity analysis,” in RTSS ’07.

[23] J. Goossens, S. Funk, and S. Baruah, “Priority-driven schedul-ing of periodic task systems on multiprocessors,” Real-TimeSyst., vol. 25, no. 2-3, pp. 187–205, 2003.

[24] S. Baruah, “The non-preemptive scheduling of periodic tasksupon multiprocessors,” Real-Time Syst., vol. 32, pp. 9–20, 2006.

[25] D. Cordeiro, G. Mounie, S. Perarnau, D. Trystram, J.-M. Vin-cent, and F. Wagner, “Random graph generation for schedul-ing simulations,” in SIMUTools ’10.

[26] T. Abdelzaher, B. Andersson, J. Jonsson, V. Sharma, andM. Nguyen, “The aperiodic multiprocessor utilization boundfor liquid tasks,” in RTAS ’02.

[27] http://en.wikipedia.org/wiki/Gamma distribution.

11

Abusayeed Saifullah is a Ph.D. candidatein the Department of Computer Scienceand Engineering at Washington Universityin St Louis. Advised by Chenyang Lu, heis a member of the Cyber-Physical Sys-tems Laboratory at Washington University.Abu’s research focuses on real-time wirelesssensor-actuator networks used in emergingcyber-physical systems, and spans a broadrange of topics in wireless sensor networks,embedded systems, real-time systems, and

multi-core parallel computing. He has received the best studentpaper awards at the 32nd IEEE Real-Time Systems Symposium(RTSS 2011) and at the 5th International Symposium on Parallel andDistributed Processing and Applications (ISPA 2007), and best papernomination at the 18th IEEE Real-Time and Embedded Technologyand Applications Symposium (RTAS 2012).

David Ferry is a PhD student in the depart-ment of Computer Science and Engineeringat Washington University in St. Louis. Hisresearch interests include parallel comput-ing, real-time parallel systems, and cyber-physical systems.

Jing Li is a third year Ph.D. student in Com-puter Science and Engineering departmentat Washington University in St. Louis. Shereceived the Bachelor of Science degreein Computer Science and Engineering fromHarbin Institute of Technology in China in2011. Her research interests include real-time parallel scheduling theory, real-time par-allel system and cyber-physical systems.

Kunal Agrawal is an assistant professorin the Department of Computer Scienceand Engineering at Washington Universityin Saint Louis. She completed her Ph.D. in2009 at MIT on the topic Scheduling andSynchronization for Multicore ConcurrencyPlatforms. Her research interests includescheduling of parallel programs, parallel al-gorithms and data structures, synchroniza-tion mechanisms, transactional memory, andcache-efficient algorithms.

Chenyang Lu is a Professor of ComputerScience and Engineering at Washington Uni-versity in St. Louis. Professor Lu is Editor-in-Chief of ACM Transactions on SensorNetworks, Area Editor of IEEE Internet ofThings Journal and Associate Editor of Real-Time Systems. He also serves as Pro-gram Chair of premier conferences suchas IEEE Real-Time Systems Symposium(RTSS 2012), ACM/IEEE International Con-ference on Cyber-Physical Systems (ICCPS

2012) and ACM Conference on Embedded Networked Sensor Sys-tems (SenSys 2014). Professor Lu is the author and co-author ofover 100 research papers with over 10000 citations and an h-index of47. He received the Ph.D. degree from University of Virginia in 2001,the M.S. degree from Chinese Academy of Sciences in 1997, andthe B.S. degree from University of Science and Technology of Chinain 1995, all in computer science. His research interests include real-time systems, wireless sensor networks and cyber-physical systems.

Christopher D. Gill is a Professor of Com-puter Science and Engineering at Washing-ton University in St. Louis. His research in-cludes formal modeling, verification, imple-mentation, and empirical evaluation of poli-cies and mechanisms for enforcing timing,concurrency, footprint, fault-tolerance, andsecurity properties in distributed, mobile, em-bedded, real-time, and cyber-physical sys-tems. Dr. Gill developed the Kokyu real-timescheduling and dispatching framework used

in several AFRL and DARPA projects. He led development of thenORB small-footprint real-time object request broker at WashingtonUniversity. He has also led research projects under which a numberof real-time and fault-tolerant services for The ACE ORB (TAO), andthe Component Integrated ACE ORB (CIAO), were developed. Dr.Gill has over 50 refereed technical publications and has an extensiverecord of service in review panels, standards bodies, workshops, andconferences for distributed real-time and embedded computing.

1

APPENDIX

SUPPLEMENTAL MATERIALS

Parallel Real-Time Scheduling of DAGs

Abusayeed Saifullah, David Ferry, Jing Li, KunalAgrawal, Chenyang Lu, Christopher Gill

APPENDIX A

Pseudo Code of the Decomposition Algorithm

Algorithm 1: Decomposition AlgorithmInput: a DAG task τi with period and deadline Ti, total executionrequirement Ci, critical path length Pi;Output: deadline Dj

i , offset Φji for each node W j

i of τi;

for each node W ji of τi do Φj

i ← 0; Dji ← 0; end;

Represent τi as τ syni ;

θi ← Ci/(2Ti − Pi); /* heavy or light threshold */total heavy ← 0; /* total heavy segments */total light← 0; /* total heavy segments */

Cheavyi ← 0; /* total work of heavy segments */

Plighti ← 0; /* light segments’ critical path length */

for each j-th segment in τ syni do

if mji > θi then /* it is a heavy segment */total heavy ← total heavy + 1;C

heavyi ← C

heavyi +mj

ieji ;

else /* it is a light segment */total light← total light+ 1;P

lighti ← P

lighti + eji ;

endendif total heavy = 0 then /* all segments are light */

for each j-th segment in τ syni do dji =

TiPieji ;;

else if total light = 0 then /* all seg. are heavy */

for each j-th segment in τ syni do dji =

TiCimj

ieji ;;

else /* τsyni has both heavy and light segment */for each j-th segment in τ syn

i doif mj

i > θi then /* for heavy segment */

dji =Ti−Pi/2

Cheavyi

mjie

ji ;

else /* for light segment */

dji =Pi/2

Plighti

eji ;

endend

end/* Remove segments. Assign node deadlines */

for each node W ji of τi in breadth-first search order do

if W ji belongs to segments k to r in τ syn

i thenDj

i = dki + dk+1i + · · ·+ dri ; /* node deadline */

Φji ← max{Φl

i +Dli|W

li is a parent of W j

i };end

APPENDIX B

An Example Decomposition

Figure 7 shows the complete decomposition of τi.

APPENDIX CImplementation ConsiderationsThis paper provides the algorithmic foundation forbuilding a real-time parallel scheduler for paralleltasks. We now provide a sketch (Figure 8) on howit be implemented on a real system. In principle, onecan use any parallel language such as OpenMP [9]and CilkPlus [10] that provides parallel programmingsupport through library routines and directives. Forexample, OpenMP directives are compiler pragmastatements that indicate where and how paralleliza-tion can occur within a program. One such directiveconverts a regular for loop to a parallel-for loop, byprefacing the loop with #pragma omp parallel for.The programmer can specify their task set as a setof parallel programs written in some such parallellanguage. To make these tasks real-time tasks, theprogrammer must also specify task deadlines andperiods. We assume that these things are specifiedusing a separate task specification file that is also aninput to the scheduler. In addition, the decompositionalgorithm also needs the execution requirements ofeach node in task, which can also be either specified inthe task specification file, or measured using a profiler.

Using the task specification file and the task set, thescheduler computes the intermediate deadlines andrelease times for each node. In addition, the compilerdecomposes the task into individual nodes/subtasks.Once the intermediate deadlines are known, we canuse a global priority queue to keep the subtasks sortedby priorities according to EDF. Now at runtime, thescheduler schedules these subtasks on m processorsusing OS support for scheduling priorities, runtimepreemption, and synchronization. When a subtaskbecomes available, if a worker is free, it is simplyscheduled on this worker. If no worker is free, itis added to the priority queue. In the preemptivescheduler, we can use Linux support for preemptionto preempt tasks with lower priorities when a high-priority task becomes available. For non-preemptivescheduling, we disable preemption and add yieldpoints after each node1. When a subtask yields at itsyield point, the scheduler can schedule the highestpriority task that is available in the priority queue.

APPENDIX DProof of Lemma 2

Proof: In each τdeci , a node is released only after all

of its parents finish execution. Hence, the precedencerelations in original task τi are retained in τdec

i (thatrepresent all subtasks of τi). Besides, the time bywhich the last subtask of τdec

i has to finish is equalto the deadline of the original task τi, and the sum

1. Most parallel languages already have this support, since thisis when control returns to the scheduler. For others, these can beadded by the compiler.

2

ei

1= 2 e

i

2= 2 e

i

3=1 e

i

4= 2 e

i

5= 2

mi

1= 3 m

i

2= 3 m

i

3= 2 m

i

4= 3 m

i

5= 2 m

i

7= 2m

i

6=1

Ti= 21

ei

7=1e

i

6= 4

Since segment 6 is the only light segment,

di

1

di

2

di

3di

4di

!di

6

di

7

Ci

heavy= E i

j

j∈H i

∑ = 26

Pilight

= 4

slack

=42/13

=Ti− Pi / 2

Ci

heavymi

1ei1

=42/13 =14/13 =42/13 =28/13 =14/13= i / 2

Pilight

ei7=14 / 2

44 = 7

(a) Calculating segment deadlines of τ syni

Ti= 21

Φi

1= 0

Di

1= d

i

1" d

i

2

Di

!= d

i

3� d

i

4� d

i

5

Di

�= d

i

6Di

10= d

i

#

Φi

�= D

i

1 Φi

�=Φ

i

4�D

i

4

Φi

10=Φ

i

��D

i

�

Ei

1= e

i

1� e

i

2= E

i

4= e

i

3� e

i

4� e

i

5= 5 E

i

= e

i

6= E

i

10= e

i

=1

Ei

9= e

i

7=1E

i

2= e

i

1= 2

Ei

3= e

i

1� e

i

2= 4

Ei

5= e

i

2� e

i

3= 3

Ei

7= e

i

4= 2

Ei

6= e

i

4 e

i

5= 4

= 84/13 = 84/13

= 84/13

=7

= 168/13

= 14/13

= 259/13

Di

2= d

i

1

Di

5= d

i

2� d

i

3

Di

6= d

i

4� d

i

5

= 42/13 = 56/13= 70/13 D

i

9= d

i

7

= 14/13

Φi

2= 0 Φ

i

6=Φ

i

5�D

i

5Φ

i

5= D

i

2= 42/13 = 98/13

Φi

9=Φ

i

8�D

i

8

Φi

3= 0

Di

3= d

i

1� d

i

2

Di

7= d

i

4

Φi

7=Φ

i

5�D

i

5

= 98/13

= 42/13= 84/13

Segment deadlines assigned in different segments of a node in the previous step are now removed, and their sum is assigned as the node's deadline

slack slack slack

slackslack

slack

slack

slack

slack

slack

= 259/13

(b) Removing segment deadlines, and calculating node deadlines and offsets

Fig. 7. Decomposition of τi (shown in Figure 1) when Ti = 21

of the execution requirements of these subtasks isequal to the execution requirement of the originaltask τi. Hence, if τdec is preemptively schedulable, apreemptive schedule must exist for τ where each taskin τ meets its deadline.

APPENDIX EProof of Lemma 5

Proof: Since the decomposition converts eachnode of a DAG to an individual task, a fully non-preemptive scheduling of τdec preserves the node-level non-preemptive behavior of task set τ . The restof the proof follows from Lemma 2.

A Tighter Bound for Non-Preemptive EDF

A resource augmentation of 4+2ρ for non-preemptiveEDF is relatively looser than the corresponding bound

of 4 for preemptive EDF. This is mainly because non-preemptivity can cause processor capacity reductionof up to ρ. Due to decomposition, this value increasesto 2ρ (see Equation (30)). However, we can expressthe augmentation bound in a tighter form by usinga tighter bound on non-preemption overhead. Asshown in Equation (30) the non-preemption overheadof the decomposed task is indeed at most Emax

Dmin. But

we used a pessimistic upper bound of this valueby replacing Dmin with Emin as the value of Dmin

is unknown before decomposition. Emin is a lowerbound of Dmin and is known (from input) before de-composition. Therefore, if we define the bound upondecomposition, we can use Emax

Dminas the maximum

non-preemption overhead. Using this value of non-preemption overhead in Theorem 7, our bound willbe 4 + Emax

Dminwhich is a lot smaller than 4 + 2ρ.

Notably, the work in [26] has identified a largeclass of applications such as high-performance web

3

Compile-time decomposition

Parallel

Task set

Compute intermediate

deadlines

in

OpenMP

CilkPlus

Assign node deadlines

and offsets

Run-time scheduler

Priority queue

Worker 1

Worker 2

Worker 3

Add yield points after

nodes for non-

preemptive scheduling

Task-set

specification

Worker m

...

Subtasks sorted

based on EDF

Fig. 8. Scheduler components

and data-servers that consist of many real-time tasks,called liquid tasks, in which the smallest deadline ofany job in the system is orders of magnitude greaterthan the largest execution requirement of any job.Upon decomposing the liquid tasks, the value of Dmin

can be very close to Emax. Thus the value of Emax

Dmin

approaches 1, and a resource augmentation of 4+Emax

Dmin

is tight and quite useful in scheduling liquid paralleltasks. Our result provides the first such bound fornon-preemptive real-time scheduling of parallel tasks,and provides the basis for future directions to derivetighter bounds for all classes of real-time tasks.

APPENDIX FEvent-Driven SimulationThe derived resource augmentation bounds provide asufficient condition for schedulability. Namely, if a setof DAG tasks is schedulable on a unit-speed m-coremachine by a (potentially non-existing) ideal sched-uler, then the tasks upon our proposed decompositionare guaranteed to be schedulable under global EDF onan m-core machine where each core has a speed of 4(with preemption) or 4 + 2ρ (without preemption).

In simulations, we first randomly create tasks andthen calculate subtask deadlines using our proposeddecomposition method. We then simulate the execu-tion of these subtasks. The environment consists ofm cores and a global priority queue which keepssubtasks in the order of priorities based on EDF. Anevent occurs when a subtask is released or completed.When a subtask t is released, a preemptive and non-preemptive schedulers behave differently. In a non-preemptive scheduler, two things can occur: (i) If acore is free, then t is scheduled on that core; (ii)If all cores are busy, then the task is added to thepriority queue. On a preemptive scheduler, if all coresare busy, but another subtask s with a deadline laterthan t’s deadline is executing, then s is preempted

and placed in the priority queue, and t is scheduledinstead of s. When a subtask completes, the high-est priority subtask from the queue is executed onthe core that has just become free. This is a simplesimulator which only simulates the task executionsand ignores overheads due to migration, cache misses,preemption, and synchronization.

Task and Task Set Generation

We want to evaluate our scheduler using task setsthat an optimal scheduler could schedule on 1-speedprocessors. However, as we cannot determine thisideal scheduler, we assume that an ideal schedulercan schedule any task set whose total utilization isno greater than m, and that each individual task isschedulable in isolation (i.e. its critical path length isno greater than its deadline). Therefore, in our exper-iments, for each value of m (i.e. the number of cores),we generate task sets whose utilization is exactly m,fully loading a machine of 1-speed processors.

We use the Erdos-Renyi method G(ni, p) [25] aspresented below to generate task sets for evaluation.Number of nodes. To generate a DAG τi, we pick thenumber of nodes ni uniformly at random in range[50, 350]. These values would allow us to generatevaried task sets within a reasonable amount of time.

Adding edges. We add edges to the graph usingthe Erdos-Renyi method G(ni, p) [25]. We scan allthe possible edges directing from lower node id tohigher node id to avoid introducing a cycle into thegraph. For each possible edge, we generate a randomvalue in range [0, 1] and add the edge only if thegenerated value is less than a predefined probabilityp. (We will vary p in our experiments to explore theeffect of changing p.) Finally, we add an additionalminimum number of edges so that each node (exceptthe first and the last node) has at least one incoming

4

and one outgoing edge in order to make the DAGweakly connected. Note that the critical path length ofa DAG generated using the pure Erdos-Renyi methodincreases as p increases. Since our method is slightlymodified, the critical path is also large when p is small.Hence, as p increases, the critical path first decreasesup to a certain value of p and then increases again.

Execution time of nodes. We assign every node anexecution time chosen randomly from a specifiedrange. The range is based on the value and type (con-tinuous or discrete) of the non-preemption overheadρ (explained in the next subsection).

At this point, we have the DAG structure and theexecution times for its nodes. For each DAG τi, wenow assign a period Ti that is no less than the criticalpath length Pi. We consider two types of task sets:

Task sets with harmonic periods. These deadlinesare carefully picked so that they are multiples of eachother, so as to ensure that we can run our experimentsup to the hyper-period of the task sets. In particular,we pick deadlines that are powers of two. We findthe smallest value a such that Pi ≤ 2a, and randomlyset Ti to be one of 2a, 2a+1, or 2a+2. We choose suchperiods because we want some high utilization tasksand some low utilization tasks. The ratio Pi/Ti of thetask is in the range [1, 1/2], (1/2,1/4], or (1/4, 1/8],when its period Ti is 2a, 2a+1, or 2a+2, respectively.

Task sets with arbitrary periods. We first generate arandom number Gamma(2, 1) using the gamma distri-bution [27]. Then we set period Ti to be (Pi+

Ci

0.5m )∗(1+0.25 ∗Gamma(2, 1)). We choose this formula for threereasons. First, we want to ensure that the assignedvalue is a valid period, i.e., Pi ≤ Ti. Second, we want toensure that each task set contains a reasonable numberof tasks even when m is small. At the same time,with more cores, we do not want to limit averageDAG utilization to a certain small value. Hence theminimum period is a function of m. Third, while wewant the average period to be close to the minimumvalid period (to have high utilization tasks), we alsowant some tasks with large periods. Table 1 showsthe average number of DAGs per task set achievedby the random period generation process.

Experimental Methodology

We experiment by varying the following 4 parameters.

Harmonic vs. arbitrary periods. We want to evaluatewhether arbitrary periods are better or worse thanharmonic ones. For harmonic period task sets, werun simulation up to their hyper-period. For arbitraryperiod task sets, the hyper-period can be too long tosimulate, and hence we run simulation up to 20 timesthe maximum period.

Number of cores (m). We want to evaluate if parallelscheduling is easier or harder as the number of coresincreases. We run experiments on m: 4, 8, 16, and 32.

Probability of an edge (p). As stated before, p affectsthe critical path length, the density, and the structureof the DAG. We test using 14 values of p: 0.01, 0.02,0.03, 0.05, 0.07, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9.

Non-preemption overhead (ρ). This is the ratio ofthe maximum node execution length to the min-imum node execution length. For non-preemptiveEDF scheduling, the resource augmentation boundincreases as ρ increases. We want to evaluate whetherthe effect of increased ρ is really that severe in practice.For all of our experiments, we set the minimum nodeexecution requirement to be 50, and vary the maxi-mum execution requirement. To get ρ = 1, 2, 5, and 10,the maximum execution requirements are chosen to be50, 100, 250, and 500, respectively. In addition, whenwe evaluate the performance of non-preemptive EDF,we want to maximize the influence of ρ. Therefore, be-sides using uniformly generated node execution timebetween maximum and minimum (called continuousρ), we also generate by choosing from discrete values50, 2 ∗ 50, · · · , ρ ∗ 50 (called discrete ρ).

Detailed ResultsOf the 896 combinations of parameters (each hav-ing 1000 task sets) we have tested, preemptive EDFhas the maximum required speed of 3.2 to meet alldeadlines (this data point is not shown in figuresfor better resolution), which is close to our analyticalresource augmentation bound of 4. In contrast, amongthe combinations of parameters with ρ = 1, 2, 5, 10, themaximum required speed for non-preemptive EDFare 4.0, 5.8, 8.6, and 12.6, respectively, which lookmuch smaller than the analytical bound of 6, 8, 14,and 24, respectively. These issues are discussed uponpresenting the results.

Effect of harmonic vs. arbitrary periods. We findthat it is slightly harder to schedule harmonic periodtasks using preemptive EDF, and vice-versa for non-preemptive EDF. However, the difference is minor,and the trends are very similar under both. Here wewill only show the experiments for arbitrary periods.

Effect of p in preemptive scheduling. For each valueof p, Figure 3 shows the failure ratio defined as the ratioof the number of task sets where some task misseda deadline to the total number of task sets (which is1000 in our experiment) attempted to be scheduled. Topreserve resolution of the figure, we show the resultsfor only 7 (out of 14) values of p. In these experiments,ρ = 2, m = 32. Note that the failure ratio increases asp increases from 0.01 to 0.1, and then falls again. Aswe explained in Section A, as p increases, the critical-path length first decreases (making the tasks more

5

TABLE 1Number of tasks per task set

HHHHm

p0.01 0.02 0.03 0.05 0.07 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

4 4 4 4 4 4 4 5 6 6 7 7 8 8 88 4 4 4 4 5 5 7 8 9 10 11 12 13 1416 4 5 5 6 6 7 10 12 15 17 19 20 22 2432 5 6 7 8 9 11 17 22 26 30 34 37 41 45

“parallel” or “DAG-like”) and then increases again(making the tasks more sequential). Therefore, forboth small and large p, the tasks are largely sequential.These results seem to conform to our intuition that,in general, parallel tasks are more difficult to schedulethan sequential ones. The results for 4, 8, and 16 coresalso follow this trend, and hence are omitted.

Effect of m in preemptive scheduling. Figure 4 showsthe failure ratio in logarithmic scale for each value ofm, when p = 0.2 and ρ = 2. The failure ratio increasesas m increases, indicating that it is harder to scheduleon larger numbers of cores. The trend is similar fordifferent values of p, and hence is not shown.

Effect of ρ in non-preemptive scheduling. The mostimportant factor to evaluate is the effect of ρ. Figure 5shows the failure ratio for discrete ρ for each value ofρ, with fixed p = 0.2, m = 8. With the increase in ρ, thefailure ratio becomes much higher, which is expected.However, this trend is not quite strong for continuousρ, and we omit plotting those results. Following maybe the reason for this anomaly. The maximum valueof ρ only affects the schedule if a node having themaximum execution interferes with a node havingthe minimum execution. Since ρ is continuous, anode’s execution requirement is assigned from manydifferent values. This causes only a small number ofnodes to be at these extremes, thereby reducing thechances of such interference.

Effect of m in non-preemptive scheduling. Figure 6shows the required speed for each combination of mand ρ, with p = 0.2. This figure is different from theprevious ones in that it only shows the speed at whichall task sets become schedulable. We can see that foreach value of m, when ρ increases, the required speedincreases, which is expected. This trend is less visiblewhen m increases. One possible reason is that whenthere are more cores, the overhead from interferencebetween executing low priority subtask and a newlyreleased higher priority subtask will, on average, besmaller. This happens because the overhead is theminimum remaining work of all m running lowerpriority subtasks, instead of the average or worstcase subtask execution time. When m is higher, theminimum will be much smaller than average, makingthe system much less influenced when ρ increases.

The simulation results show a maximum speed

requirement of 3.2 for preemptive EDF suggestingthat our analytical resource augmentation bound of 4is reasonably tight. While the corresponding boundsfor non-preemptive EDF sound relatively looser inour simulation results, we clarify the tightness andthe usefulness of this bound in practice from twopoints of view. First, considering that the bound of4 for preemptive EDF is tight, it is unlikely that abound better than 4 + ρ can be derived for non-preemptive EDF, since non-preemptivity can causeprocessor capacity reduction of up to ρ in the worstcase. For the sake of non-preemptivity in schedulingthe decomposed tasks, the processor capacity reduc-tion can be up to 2ρ in extreme cases, requiring aspeed increase of 2ρ in addition to that for preemptivescheduling. Hence, there may be task sets that requirea resource augmentation of 4+ 2ρ, but our simulationdoes not encounter those tasks. In other words, ourresults may be an artifact of our experimental setup and random task generation that is unlikely togenerate the worst-case task set. Second, as explainedin Appendix E, there are practical task sets for whichsuch a bound is tight. Specifically, the work in [26]has identified a large class of applications such ashigh-performance web and data-servers that consistof many real-time tasks, called liquid tasks, in whichthe smallest deadline of any job in the system is ordersof magnitude greater than the largest execution re-quirement of any job. As the value of non-preemptiveoverhead approaches very small value (close to 1) forthese tasks, a resource augmentation of 4 + 2ρ is tightand quite useful in scheduling liquid parallel tasks.Deriving tighter bounds for all classes of real-timetasks is an important future work.

Date post:	21-Mar-2020
Category:	Documents
Upload:	others
View:	13 times
Download:	0 times

Parallel Real-Time Scheduling of DAGslu/papers/tpds-dags.pdf · each real-time task itself is a...

Documents