Dynamic Critical-Path Scheduling: An Effective Technique ...€¦ · Yu-Kwong Kwok and Ishfaq...

- 1 -

Dynamic Critical-Path Scheduling: An Effective Techniquefor Allocating Task Graphs to Multiprocessors

Yu-Kwong Kwok and Ishfaq Ahmad1

Department of Computer Science

The Hong Kong University of Science and Technology

Clear Water Bay, Kowloon, Hong Kong

Corresponding Author: Ishfaq Ahmad

Email: [email protected], Phone: (852) 2358-6980

AbstractIn this paper, we propose a static scheduling algorithm for allocating task graphs to fully-

connected multiprocessors. We discuss six recently reported scheduling algorithms and showthat they possess one drawback or the other which can lead to poor performance. Theproposed algorithm, which is called the Dynamic Critical-Path (DCP) scheduling algorithm,is different from the previously proposed algorithms in a number of ways. First, it determinesthe critical path of the task graph and selects the next node to be scheduled in a dynamicfashion. Second, it rearranges the schedule on each processor dynamically in the sense thatthe positions of the nodes in the partial schedules are not fixed until all nodes have beenconsidered. Third, it selects a suitable processor for a node by looking ahead the potentialstart times of the remaining nodes on that processor, and schedules relatively less importantnodes to the processors already in use. A global as well as a pair-wise comparison is carriedout for all seven algorithms under various scheduling conditions. The DCP algorithmoutperforms the previous algorithms by a considerable margin. Despite having a number ofnew features, the DCP algorithm has admissible time complexity, is economical in terms ofthe number of processors used and is suitable for a wide range of graph structures.

Keywords: Algorithms, Clustering, List Scheduling, Multiprocessors, Processor Alloca-tion, Parallel Scheduling, Task Graphs.

1. This research was supported by the Hong Kong Research Grants Council under contract numberHKUST 179/93E.

- 2 -

1 IntroductionAn efficient scheduling of a parallel program onto the processors is vital for achieving a

high performance from a parallel computer system. When the structure of the parallelprogram in terms of its task execution times, task dependencies, task communications andsynchronization, is known a priori, scheduling can be accomplished statically at compile time.The objective is to minimize the schedule length. It is well known, however, thatmultiprocessor scheduling for most precedence-constrained task graphs is an NP-completeproblem in its general form [12], [21]. To tackle the problem, simplifying assumptions havebeen made regarding the task graph structure representing the program and the model forthe parallel processor systems [7], [14]. However, the problem is NP-complete even in twosimple cases: (1) scheduling unit-time tasks to an arbitrary number of processors [15], (2)scheduling one or two time unit tasks to two processors [9]. There are only two special casesfor which optimal polynomial-time algorithms exist. These cases are: scheduling tree-structured task graphs with identical computation costs on an arbitrary number of processorsand scheduling arbitrary task graphs with identical computation costs on two processors[18], [33]. However, even in these cases, no communication is assumed among the tasks of theparallel program. It has been shown that scheduling an arbitrary task graph with inter-taskcommunication onto two processors is NP-complete and scheduling a tree-structured taskgraph with inter-task communication onto a system with an arbitrary number of processorsis also NP-complete [25].

For more realistic cases, a scheduling algorithm needs to address a number of issues. Itshould exploit the parallelism by identifying the task graph structure, and take intoconsideration task granularity, arbitrary computation and communication costs. Moreover, inorder to be of practical use, a scheduling algorithm should have low complexity and shouldbe economical in terms of the number of processors used [3], [11]. Because of its vitalimportance, the scheduling problem continues to be a focus of attention from the researchcommunity [4], [5], [8], [13], [16], [17], [19], [20], [22], [23], [24], [27], [28], [29], [30], [31], [34].In this paper, we propose a new static scheduling algorithm. The proposed algorithm, whichis called the Dynamic Critical Path (DCP) algorithm, schedules task graphs with arbitrarycomputation and communication costs to a multiprocessor system with unlimited number offully-connected identical processors. The DCP algorithm tackles the drawbacks of previousapproaches and outperforms them by a considerable margin. The algorithm has admissibletime complexity. It is also economical in terms of the number of processors used and issuitable for different types of graph structures.

The remainder of this paper is organized as follows. In the next section, we describe thebackground of the scheduling problem including some of the major issues involved. InSection 3, we describe six recently reported scheduling algorithms. The merits andlimitations of these algorithms are discussed briefly. In Section 4, we describe our DCPscheduling algorithm and discuss its design principles as well as its properties. In Section 5,we use an example to illustrate the functionality of all seven algorithms. In Section 6, weprovide the experimental results and a comparison of all algorithms. Section 7 provides theconcluding remarks.

2 BackgroundA parallel program can be represented by a directed acyclic graph , whereG V E,( )= V

- 3 -

is the set of nodes ( ) and is the set of edges ( ). A node in the parallelprogram graph represents a task which is a set of instructions that must be executed seriallyin the same processor. Associated with each node is its computation cost, denoted by w(ni)which indicates the task execution time. The edges in the parallel program graph correspondto the communication messages and precedence constraints among the nodes. Associatedwith each edge is a number indicating the time required for communicating the data fromone node to another. This number is called the communication cost of the edge and is denotedby cij. Here, the subscript ij indicates that the directed edge emerges from the source node ni

and incidents on the destination node nj. The source node and the destination node of anedge is called the parent node and the child node respectively. In a task graph, a node whichdoes not have any parent is called an entry node while a node which does not have any childis called an exit node. A node cannot start execution before it gathers all of the messages fromits parent nodes. The communication-to-computation-ratio (CCR) of a parallel program isdefined as its average communication cost divided by its average computation cost on agiven system. We assume each processor in the system possesses dedicated hardware to dealwith communication so that communication can take place simultaneously withcomputation. The communication cost among two nodes assigned to the same processor isassumed to be zero.

The objective of static scheduling is to assign the nodes of the task graph to the processorssuch that the schedule length or makespan is minimized without violating the precedenceconstraints. A schedule is considered efficient if the schedule length is short and the numberof processors used is reasonable. There are many approaches that can be employed in staticscheduling. These include queuing theory, graph-theoretic approaches, mathematicalprogramming and state-space search [6], [14]. In the classical approach [1], [9], which is alsocalled list scheduling, the basic idea is to make an ordered list of nodes by assigning themsome priorities, and then repeatedly execute the following two steps until a valid schedule isobtained.

1) Select from the list the node with the highest priority for scheduling.

2) Select a processor to accommodate this node.

The priorities are determined statically before the scheduling process begins. In thescheduling process, the node with the highest priority is chosen for scheduling. In the secondstep, the best possible processor, that is, the one which allows the earliest start time, isselected to accommodate this node. Most of the reported scheduling algorithms based on thisconcept [15], [18], [30] employ variations in the priority assignment methods such as HLF(Highest level First), LP (Longest Path), LPT (Longest Processing Time) and CP (CriticalPath).

The main problem with list scheduling algorithms is that static priority assignment maynot always order the nodes for scheduling according to their relative importance. A node ismore important than other nodes if timely scheduling of the node can lead to a betterschedule eventually. The drawback of a static approach is that an inefficient schedule may begenerated if a relatively less important node is chosen for scheduling before the moreimportant ones. Static priority assignment may not capture the variation in relativeimportance of nodes during the scheduling process. For example, consider the task graphshown in Figure 1(a). Here, a schedule is produced using the HLFET (Highest Levels Firstwith Estimated Times) algorithm [1], [18], which determines the priority of a node by

V v= E E e=

- 4 -

computing its level. The level of a node is the largest sum of computation costs along a pathfrom the node to an exit node. The node with a higher level gets a higher priority. The HLFETalgorithm schedules nodes in the order: n1, n2, n3, n4. The schedule is shown in Figure 1(b) inwhich all the nodes are scheduled to one processor (PE denotes a processor); the schedulelength is 43 time units. However, the schedule length can be reduced, as shown in Figure 1(c),if we schedule the nodes in the order: n1, n3, n2, n4. At the second scheduling step, n3 is arelatively more important node than n2 because if it is not scheduled to start earlier on aprocessor, the start time of n4 will be delayed due to the large communication costs along thepath n1-n3-n4. Thus, the HLFET algorithm does not precisely identify the most importantnode at each scheduling step as it orders nodes by assigning each of them a static attributewhich does not depend on the communication among nodes.

As can be seen from the above simple example, a scheduling algorithm may generatevery inefficient schedules if it cannot assign accurate priorities to nodes. One importantattribute of a task graph that can be used to determine node priorities accurately is explainedin the following definition.

Definition 1: A Critical Path (CP) of a task graph, is a set of nodes and edges, forming a path from anentry node to an exit node, of which the sum of computation costs and communication costs is themaximum.

The CP of a task graph potentially determines the schedule length because thecumulative computation costs of the nodes on the CP is the lower bound on the schedulelength. Indeed, the final schedule length is the length of the “critical path” of the scheduledtask graph. For example, the CP of the task graph shown in Figure 1(a) is the path n1-n3-n4

(shown in thick arrows); while the CP of the scheduled graph shown in Figure 1(d) is thepath n1-n2-n4. Thus, generating an efficient schedule requires proper scheduling of the nodeson the CP. We will further elaborate this issue in Section 4.

In order to avoid scheduling less important nodes before the more important ones, nodepriorities can be determined dynamically during the scheduling process. The priorities ofnodes are re-computed after a node has been scheduled in order to capture the changes in therelative importance of nodes. Thus, the following three steps are repeatedly executed in suchkind of scheduling algorithms.

1) Determine new priorities of all unscheduled nodes.

n2

20

n15

n3

10

n4

8

20

10

1

1

0

10

20

30

40

5053

(a)

n15

n2

20

n3

10

n4

8

0

10

20

30

40

43

(b)

PE 0

0

10

20

30

34

n15

n3

10 n2

20

n4

8

1

(c)

PE 0 PE 1

Figure 1: (a) A task graph; (b) The schedule generated by the HLFET, MCP, ETFand DLS algorithms (schedule length = 43 time units); (c) The schedule generatedby the DSC algorithm (schedule length = 34 time units); (d) The schedulegenerated by the EZ and MD algorithm (schedule length = 35 time units).

10

0

10

20

30

35

n15

n3

10 n2

20

n4

8

1

PE 0 PE 1

1

(d)

- 5 -

2) Select the node with the highest priority for scheduling.

3) Select the most suitable processor to accommodate this node.

Scheduling algorithms which employ the above three-step approach can potentiallygenerate better schedules [13], [35]. However, this can increase the complexity of thealgorithm.

3 Related WorkIn this section, six recently reported scheduling algorithms and their characteristics are

described. These are the Edge-zeroing (EZ) algorithm [32], the Modified Critical Path (MCP)algorithm [37], the Mobility Directed (MD) algorithm [37], the Earliest Task First (ETF)algorithm [16], the Dynamic Level Scheduling (DLS) algorithm [35] and the DominantSequence Clustering (DSC) algorithm [36].

3.1 The EZ AlgorithmAs opposed to the CP-based algorithms, the EZ algorithm attempts to reduce the partial

schedule length at each step by considering the highest cost edge in the task graph. At eachscheduling step, the algorithm schedules the two nodes with the heaviest communicationedge to the same processor provided the partial schedule length does not increase. To do this,the EZ algorithm first constructs a list of edges in decreasing order of communication costs. Itthen removes the first edge from the list and schedules the two incident nodes to the sameprocessor if the partial schedule length is not increased. If the partial schedule length isincreased by such scheduling, the two nodes are scheduled to two distinct processors. Thenodes within the same processor are maintained in a decreasing order of their levels (levelsare computed with the same method as used by the HLFET algorithm). The process isrepeated until all nodes are scheduled. The complexity of the EZ algorithm is shown to be

.

For the task graph shown in Figure 1(a), the EZ algorithm generates the schedule shownin Figure 1(d). It is apparent from this example that the criterion used by the EZ algorithm toselect node for scheduling does not properly identify the most important node at eachscheduling step. For this task graph, the EZ algorithm schedules the nodes in the order: n1, n3,n4, n2. After n1 and n3 are scheduled, the highest cost edge is (n3, n4). Thus, n4 is scheduled toPE 0. However, n2 cannot be scheduled to PE 0 afterwards in order not to increase theschedule length. This results in an inefficient schedule.

3.2 The MCP AlgorithmThe MCP algorithm is designed based on an attribute called the latest possible start time of

a node. A node’s latest possible start time is determined through the as-late-as-possible (ALAP)binding, which is done by traversing the task graph upward from the exit nodes to the entrynodes and by pulling the nodes downwards as much as possible constrained by the length ofthe CP. The MCP algorithm first computes all the latest possible start times for all nodes.Then, each node is associated with a list of latest possible start times which consists of thelatest possible start time of the node itself, followed by a decreasing order of the latestpossible start times of its children nodes. The MCP algorithm then constructs a list of nodesin an increasing lexicographical order of the latest possible start times lists. At each schedulingstep, the first node is removed from the list and scheduled to a processor that allows for theearliest start time. The MCP algorithm was originally designed for a bounded number ofprocessors. The complexity of the MCP algorithm is shown to be .

O e e v+( )( )

O v2 vlog( )

- 6 -

The MCP algorithm assigns higher priorities to nodes which have smaller latest possiblestart times. However, the MCP algorithm does not necessarily schedule nodes on the CP first.For example, consider the task graph in Figure 1(a) again. Here, the MCP algorithmschedules nodes in the same order as the HLFET algorithm and hence generates the sameschedule (shown in Figure 1(b)). The MCP algorithm does not assign node prioritiesaccurately even though it takes communication among nodes into account for computing thepriorities.

3.3 The MD AlgorithmThe MD algorithm selects a node at each step for scheduling based on an attribute called

the relative mobility. Mobility of a node is defined as the difference between a node’s earlieststart time and latest start time. Similar to the ALAP binding mentioned above, the earliestpossible start time is assigned to each node through the as-soon-as-possible (ASAP) bindingwhich is done by traversing the task graph downward from the entry nodes to the exit nodesand by pulling the nodes upward as much as possible. Relative mobility is obtained bydividing the mobility with the node’s computation cost. Essentially, a node with zeromobility is a node on the CP. At each step, the MD algorithm schedules the node with thesmallest mobility to the first processor which has a large enough time slot to accommodatethe node without considering the minimization of the node’s start time. After a node isscheduled, all the relative mobilities are updated. The complexity of the MD algorithm is

.

As opposed to the MCP algorithm, the MD algorithm determines node prioritiesdynamically. Although the MD algorithm can correctly identify the CP nodes for schedulingat each step, the selection of a suitable time slot and a processor are not done properly. Themajor problem with the MD algorithm is that it pushes scheduled nodes downwards tocreate a large enough time slot to accommodate a new node without paying any regard to thedegradation in the schedule length. It may happen that pushing down the nodes mayincrease the final schedule length. The second drawback of the MD algorithm is that it looksfor a suitable processor by scanning the processors one by one starting with the firstprocessor. This processor selection criterion does not precisely make any effort to minimizethe start time of nodes at each step. Another problem with the MD algorithm is that it insertsa node into an idle time slot on a processor without considering whether the descendants ofthat node can be scheduled in a timely manner. The schedule generated by the MD algorithmfor the task graph in Figure 1(a) is the same as the one generated by the EZ algorithm (shownin Figure 1(d)). When node n4 is considered, it is found that there is a time slot on PE 0 largeenough to accommodate it. The MD algorithm schedules n4 to PE 0 without considering otherprocessors. As a result, a longer schedule length is obtained.

3.4 The ETF AlgorithmSimilar to the MCP algorithm, the ETF algorithm uses static node priorities and assumes

only a bounded number of processors. However, a node with a higher priority may notnecessarily get scheduled before the nodes with lower priorities. This is because at eachscheduling step, the ETF algorithm first computes the earliest start times for all the readynodes and then selects the one with the smallest value of the earliest start time. A node isready if all its parent nodes have been scheduled. The earliest start time of a node iscomputed by examining the start time of the node on all processors exhaustively. When twonodes have the same value of the earliest start times, the ETF algorithm breaks the tie by

O v3( )

- 7 -

scheduling the one with a higher static priority. The static node priorities can be computedbased on the node levels as in the HLFET algorithm or the latest possible start times as in theMCP algorithm. The complexity of the ETF algorithm is described to be , where p isthe number of processors given.

The major deficiency of the ETF algorithm is that it may not be able to reduce the partialschedule length at every scheduling step. This is because a node which has the smallest valueof the earliest start time may not necessarily lie on the CP. An adverse effect of schedulingsuch nodes before the CP nodes is that the earlier time slots on the processors may beoccupied and hence the CP nodes may not get scheduled in a timely manner. It is in thisrespect that the ETF algorithm works in a similar way as the MCP algorithm. For the taskgraph shown in Figure 1(a), the ETF algorithm generates the same schedule as the MCPalgorithm (shown in Figure 1(b)). This is expected because both algorithms attempt tominimize the start time of a node at each step in a greedy fashion.

3.5 The DLS AlgorithmSimilar to the MD algorithm, the DLS algorithm determines node priorities dynamically

by assigning an attribute called the dynamic level (DL) to all unscheduled nodes at eachscheduling step. The DL is computed by using two quantities. The first quantity is the staticlevel of a node ni which is defined as the maximum sum of computation costs along apath from ni to an exit node. The second quantity is the start time of ni on processorJ. The dynamic level for the node-processor pair is then defined as

. At each scheduling step, the DLS algorithm computes the DL for eachready node on every processors. Then, the node-processor pair which constitutes the largestDL among all other pairs is selected so that the node is scheduled to the processor. Thisprocess is repeated until all the nodes are scheduled. The complexity of the DLS algorithm isshown to be , where p is the number of processors given and is thecomplexity of the data routing algorithm to calculating the ST of a node at each step.

The DLS algorithm does not assign priorities based on the CP. It performs exhaustive pairmatching1 of nodes to processors at each step to find the highest priority node. The idea ofthe DLS algorithm is to use a composite parameter DL to select a node with a higher staticlevel and a smaller start time to schedule. However, it should be noted that the level of theselected node may not be the highest and its start time may not be the earliest among all theready nodes. This is the subtle difference between the DLS algorithm and the ETF algorithm(note that the ETF algorithm tries to schedule a node that can start earlier and breaks ties byusing the static levels). At the beginning of the scheduling process, the DLs of ready nodesare dominated by the SLs because the ready nodes are in higher levels in the task graph; andtheir start times are likely to be small. On the other hand, when scheduling the nodes in alower level (say, the exit nodes), the DLs of the ready nodes are dominated by their start timeson the processors because their SLs are small whereas their start times are large. This revealsthe flaw in the behaviour of the DLS algorithm. A node with a large SL may be scheduledfirst even though its start time is not small. This may block the early scheduling of moreimportant nodes. For the task graph in Figure 1(a), the DLS algorithm generates the scheduleshown in Figure 1(b). The DLS algorithm schedules the nodes in the same order as the MCP

1. It should be noted that lower complexity versions of the DLS algorithm are reported in [35]. Thoseversions are streamlined to run faster with degraded performance. However, in our study, we havechosen the version that can give the best performance in terms of schedule lengths.

O pv2( )

SL ni( )ST ni J,( )

DL ni J,( ) ni J,( )SL ni( ) ST ni J,( )–

O v3pf p( )( ) f p( )

- 8 -

algorithm and therefore the same schedule is produced.

3.6 The DSC AlgorithmThe DSC algorithm is based on an attribute called the dominant sequence which is

essentially the critical path of the partially scheduled task graph at each step. At each step,the DSC algorithm checks whether the highest CP node is a ready node. If the highest CPnode is a ready node, the DSC algorithm schedules it to a processor that allows the minimumstart time. Such minimum start time may be achieved by “re-scheduling” some of the node’sparent nodes to the same processor. On the other hand, if the highest CP node is not a readynode, the DSC algorithm does not select it for scheduling. Instead, the DSC algorithm selectsthe highest node that lies on a path reaching the CP for scheduling. The DSC algorithmschedules it to the processor that allows the minimum start time of the node provided thatsuch processor selection will not delay the start time of a not yet scheduled CP node. Thedelayed scheduling of the CP nodes allows the DSC algorithm to incrementally determinethe next highest CP node. This strategy also leads to the low complexity of the DSCalgorithm.

Although the DSC algorithm can identify the most important node at each schedulingstep, it does not schedule a CP node if it is not a ready node. However, delaying thescheduling of a CP node may prevent it from occupying an earlier idle time slot in thesubsequent scheduling steps. Another deficiency of the DSC algorithm is that it uses moreprocessors than necessary because it schedules a node to a new processor if its start timecannot be reduced by scheduling to any processor already in use. However, it is possible tosave processors by scheduling nodes to processors already in use without degrading theschedule length. The complexity of the DSC algorithm is shown to be . Forthe task graph in Figure 1(a), the DSC algorithm generates the schedule shown in Figure 1(d).The deficiencies mentioned above are not revealed by this example.

4 The Proposed AlgorithmIn this section, we present the proposed DCP scheduling algorithm. As discussed earlier,

although the six scheduling algorithms described above can produce efficient schedules, eachof them has its own deficiencies. The proposed algorithm can overcome the deficiencies ofthese algorithms and have the following features.

• It assigns dynamic priorities to the nodes at each step based on the dynamic critical path(defined below) so that the schedule length can be reduced monotonically.

• It changes the schedule on each processor dynamically in that the start times of the nodesare not fixed until all nodes have been scheduled.

• It selects a suitable processor for a node by looking ahead the potential start time of thenode’s critical child node on that processor.

• It does not exhaustively examine all processors for a node. Instead, it only considers theprocessors that are holding the nodes that communicate with this node.

• It schedules relatively unimportant nodes to the processors already in use in order not towaste processors.

In the following, we discuss some of the principles used in the design of our algorithm. Inthe first part of the discussion, we describe the techniques used to select a node forscheduling. In the second part, we discuss the criteria used for processor selection. We

O e v+( ) vlog( )

- 9 -

formalize the DCP scheduling algorithm at the end of this section. Table I provides someterms and their meaning that will be used in the subsequent discussion.

4.1 Node SelectionAs described in Definition 1, during the scheduling process, the critical path (CP) of a task

graph determines the partial schedule length. Thus, the nodes on the CP have to bescheduled properly in time and space. However, as the scheduling process proceeds, the CPcan change dynamically. That is, a node on a CP at one step may not be on the CP at the nextstep. This is because the communication cost among two nodes is considered zero if thenodes are scheduled to the same processor. In order to distinguish the CP at an intermediatescheduling step from the original CP in the task graph, we call it the dynamic critical path(DCP). To reduce the intermediate schedule length monotonically and hence achieve ashorter final schedule length, we need to identify the unscheduled nodes on the DCP. In thefollowing theorem, we formalize the condition for reducing the schedule lengthmonotonically.

Theorem 1: Let be the intermediate schedule length at step t of the scheduling process. If ni is thehighest unscheduled node on the DCP whose start time is minimized at step t, then .

Proof: Clearly, is equal to the length of the DCP at step t. If the start time of ni isminimized, then it cannot be greater than the sum of computation costs and communicationcosts (with the ones among two nodes scheduled to the same processor taken as zeros) alongthe DCP from the entry node to ni. It follows that . ❏

To minimize the final schedule length, we select a node on the DCP for scheduling at eachstep in the proposed algorithm. In order to identify the nodes on the DCP, we use twoattributes for each node: the lower bound and upper bound on the start time of a node. Thecomputation of the values of these two attributes is explained in the following definitions. Inour approach, the start times of nodes on a processor are not fixed until schedulingcompletes. Thus, in effect, the nodes are simply “clustered” together in a linear order. Thefirst definition described below gives the lower bound on the start time of a node on aprocessor.

Table I : Symbols and their meanings

Symbol Meaningni The node number of a task in the parallel program task graph

w(ni) The computation cost of node ni

cij The communication cost of the directed edge from node ni to nj

e The number of edges in the task graph

v The number of nodes in the task graph

CCR Communication-to-computation ratio

CP A critical path of the task graph

DCP A dynamic critical path of the task graph

DCPL The dynamic critical path length

SLt The schedule length at scheduling step t

PE(ni) The processor which contains node ni

AEST(ni, J) The absolute earliest possible start time of ni in processor J

ALST(ni, J) The absolute latest possible start time of ni in processor J

SLt

SLt 1+ SLt≤SLt

SLt 1+ SLt≤

- 10 -

Definition 2: The absolute earliest start time of a node ni in a processor J, denoted by , isrecursively defined as follows:

where ni has p parent nodes and is the k-th parent node. if it is an entry node,and if and zero otherwise.

According to Definition 2, the AEST values can be computed by traversing the task graphin a breadth-first manner beginning from the entry nodes so that when is to becomputed, all the AEST values of ni’s parent nodes are available. The AEST of ni is thensimply the latest data arrival time among all its parent nodes. Note that the above definitioncaptures the condition that the communication among two nodes are taken to be zero if theyare in the same processor.

Definition 3: The dynamic critical path length, denoted by DCPL, is defined as:

The value of the DCPL is simply the schedule length of the partially scheduled taskgraph. This is because according to Definition 3, it is computed by taking the maximum valueacross all the earliest finish times. The value of the DCPL is useful in that it can be used todetermine the upper bound on the start time of a node. This is described in the followingdefinition.

Definition 4: The absolute latest start time of a node ni in a processor J, denoted by , isrecursively defined as follows:

where ni has q children nodes and is the m-th child node. if it isan exit node, and if and zero otherwise.

Similar to the computation of the AEST values, the values of the ALST can also becomputed by traversing the task graph in a breadth-first manner but in the reverse direction.Note that the ALST values should be computed after the DCPL has been computed. Witheach node assigned AEST and ALST, the nodes on the DCP can be easily identified. In thefollowing theorem, we formalize the condition for a node to be a DCP node.

Theorem 2: If , then ni is a node on the DCP.

Proof: Assume on the contrary that ni is not on the DCP. This implies that it does not lie onany path of which the sum of computation costs and communication costs equals DCPL.Consider the path with the largest sum of computation costs and communication costs, froman entry node np to an exit node nq, going through ni. Then, by the definitions of AEST andALST, is equal to the sum of computation costs and communication costsfrom np to ni excluding w(ni); and is equal to the sum of computation costsand communication costs from ni to nq. This can be deduced from the fact that the path fromnp to ni is the longest path from any entry node to ni and the path from ni to nq is the longestpath from ni to any exit node. Thus, < DCPL whichin turn implies that . Thus, ni is on the DCP. ❏

Based on Theorem 2, we can identify a DCP node simply by checking for equality of itsAEST and ALST. In order to reduce the value of the DCPL at each scheduling step, the DCPnode selected for scheduling is the one that has no unscheduled parent node on the DCP. Wecall this the highest node in the DCP. This gives a well-defined order of scheduling DCP nodes

AEST ni J,( )

max1 k p≤ ≤ AEST nikPE nik

( ),( ) w nik( ) r PE nik

( ) J,( )ciki+ +{ }

nikAEST ni J,( ) 0=

r PE nik( ) J,( ) 1= PE nik

( ) J≠

AEST ni J,( )

maxi AEST ni PE ni( ),( ) w ni( )+{ }

ALST ni J,( )

min1 m q≤ ≤ ALST nimPE nim

( ),( ) r PE nim( ) J,( )– ciim

w ni( )–{ }

nimALST ni J,( ) DCPL w ni( )–=

r PE nim( ) J,( ) 1= PE nim

( ) J≠

AEST ni PE ni( ),( ) ALST ni PE ni( ),( )=

AEST ni PE ni( ),( )ALST ni PE ni( ),( )

AEST ni PE ni( ),( ) ALST ni PE ni( ),( )+AEST ni PE ni( ),( ) ALST ni PE ni( ),( )≠

- 11 -

so that each DCP node is examined for scheduling after its parent DCP node.

4.2 Processor SelectionWhile we are able to identify a DCP node, we still need a method to select an appropriate

processor for scheduling that node into the most suitable idle time slot. As discussed earlier,the scheduled nodes are not assigned fixed start times. Rather, they are still assigned valuesof AEST and ALST. The only constraint on the scheduled nodes on the same processor is thatthere is a total order among them which will not be affected by the subsequent scheduling.The unfixed start times of the nodes allow us to insert an important node considered in latersteps into an earlier time slot by adjusting the AESTs and ALSTs of the scheduled nodes on aprocessor. At each step, the algorithm needs to find the most suitable processor whichcontains the most suitable place in time for a node. We formalize a rule below governing theselection of a valid place in time within a processor for a node.

Rule I: A node ni can be inserted into a processor J, which contains the sequence of nodes, if there exists some k such that

where , , and ; provided none of thenodes in is a descendant node of ni and none of the nodes in isa ancestor node of ni.

The above rule states that ni can be inserted into a processor if it has a sufficiently largeidle time slot, possibly created by delaying the AESTs of some nodes, to accommodate it. Inorder not to violate the precedence constraints among nodes, a node must not be inserted in atime slot before which there is a child node scheduled, or after which there is an ancestornode scheduled. Note that as the only criterion for a node to be a candidate for scheduling isthat it is the highest node on the DCP, it can happen that not all of its parent nodes have beenscheduled.

After ni is inserted into a processor, the communication costs among the nodes in theprocessor are set to zero. In addition, to preserve the linearity, a zero cost edge is added fromthe preceding node to ni and another zero cost edge is added from ni to the succeeding node.Thus, ni’s AEST and ALST can change due to the linear ordering of the nodes according to thestart times within the processor. The determination of their new values is explained in thefollowing rule.

Rule II: If a node ni is inserted to the processor J, then

and

where l is a value of k satisfying Rule I.

Using Rule I and Rule II, we can determine the set of processors that can accommodate anode ni at each step. We can create an idle time slot in a processor by delaying the AESTs ofthe scheduled nodes to accommodate ni. However, this is not done arbitrarily in ourproposed algorithm. When finding an idle time slot in a processor to accommodate a node, inorder to minimize the length of the DCP, we do not delay the AESTs of the scheduled nodesin a processor if possible. That is, we first search if there is already a large enough idle time

nJ1nJ2

… nJm, , ,{ }

min ALST ni J,( ) w ni( ) ALST nJk 1+J,( ),+{ } max AEST ni J,( ) AEST nJk

J,( ) w nJk( )+,{ } w ni( )≥–

k 0 … m, ,= ALST nJm 1+J,( ) ∞= AEST nJ0

J,( ) w nJ0( )+ 0=

nJ1nJ2

… nJk, , ,{ } nJk 1+

nJ2… nJm

, , ,{ }

AEST ni J,( ) max AEST ni J,( ) AEST nJlJ,( ) w nJl

( )+,{ }=

ALST ni J,( ) min ALST ni J,( ) w ni( )+ ALST nJl 1+J,( ),{ }=

- 12 -

slot in the processor. This is because delaying the AESTs of the scheduled nodes is likely toincrease the final schedule length since the final DCP may contain previously schedulednodes. Thus, when we consider inserting a node into a processor, we first find if there is anidle time slot in the processor under the constraint that all nodes are bounded to start at theirAESTs. If there is no such time slot, we ignore the constraint and find another time slot.

Given a set of candidate processors that can accommodate a node, we need to choose thebest processor. As can be noticed from the earlier discussion, the other scheduling algorithmsuse a very straightforward criterion — selecting the processor which gives the minimum starttime for a node. Although Theorem 1 states that the schedule length monotonically decreasesif we minimize the start time of nodes in the scheduling process, this strategy can potentiallygenerate very inefficient schedules. For example, it can happen that after a node is scheduledto a processor which gives the earliest start time, its heavily communicated children nodescannot be scheduled timely on the same processor possibly due to the lack of valid idle timeslots. In our proposed algorithm, we do not employ this simple start time minimizationstrategy. Instead, we use a start time looking-ahead strategy, which is given in the followingrule.

Rule III: Suppose that ni is being considered for scheduling. Let nc be the child node of ni which has thesmallest difference between its ALST and AEST. Then, ni should be scheduled to the processor J whichgives the smallest value of

where is computed after tentatively inserting ni to J.

Using Rule III, a node may not be inserted into a processor which allows its earliest starttime in the scheduling process. This happens when it is found that the start times of itschildren nodes are large. Thus, by using this looking-ahead strategy for examining the starttimes of critical children, the proposed algorithm can avoid scheduling a node to aninappropriate processor. As a result, it avoids the danger of increasing the schedule length insubsequent steps.

Exhaustively examining all the processors to select a suitable one can be very timeconsuming when the task graph is very large (hence the number of processors to beexamined is also large). Observe that the start time of a node can only be reduced byscheduling it to a processor which holds its parent nodes. And, in order to reduce the starttimes of the node’s earlier scheduled children nodes, the processors holding such childrennodes are also candidates for examination. Thus, the set of processors to be examined can berestricted to those holding the parent nodes and possibly children nodes, together with a newprocessor.

It should be noted that at some scheduling step, there may not be any unscheduled nodewith equal values of AEST and ALST. This implies that the DCP contains only the schedulednodes and will not change in the subsequent scheduling steps. Thus, in the subsequentscheduling steps, there is no need to delay the AESTs of the scheduled nodes into a processorwhen considering to insert an unscheduled node. This is because making such a node to startearlier will not improve the final schedule length. Consequently, we can schedule each non-DCP node to any processor which can accommodate it without increasing the DCPL. That is,we can insert the non-DCP nodes to any processor provided the schedule length is notincreased.

AEST ni J,( ) AEST nc J,( )+

AEST nc J,( )

- 13 -

4.3 The DCP AlgorithmThe DCP algorithm is formalized in this section. It uses two procedures: Find_Slot() and

Select_Processor() which are described below.

Find_Slot(ni, J, Condition):(1) Determine and on J by taking all communication costs

among and its parent nodes and children nodes scheduled on J to be zero(2) If Condition = DONT_PUSH, then check if there exists k satisfying Rule I without

delaying the AEST of any node in processor J; otherwise, check if there exists ksatisfying Rule I possibly by delaying the AESTs of some nodes in processor J

(3) return if there exists such k with l beingthe smallest one; otherwise return ∞

The procedure Find_Slot() checks whether there is a valid time slot in the processor J toaccommodate ni by using Rule I. In addition to a node and a processor number, Find_Slot()takes a boolean parameter Condition which indicates if delaying of AESTs of scheduled nodesin the processor J is allowed. The complexity of Find_Slot() is since there arenodes in a processor to be examined in order to find a valid time slot in the worst case.

Select_Processor(ni, Location):(1) If Location = On_DCP, then construct Processor_List in the order: processors holding

the parent nodes of ni, processors holding the children nodes of ni and a newprocessor; otherwise, construct Processor_List by only including all the processorsalready in use

(2) Best_Processor ← NULL(3) Best_Composite_AEST ← ∞(4) while Processor_List is not empty do(5) J ← remove the first processor from Processor_List(6) This_AEST ← Find_Slot(ni, J, DONT_PUSH)(7) if This_AEST = ∞ and Location = On_DCP then(8) This_AEST ← Find_Slot(ni, J, PUSH)(9) end if(10) if This_AEST ≠ ∞ then(11) nc ← unscheduled child node of ni with the smallest difference between its

ALST and AEST(12) Tentatively insert ni into J(13) Child_AEST ← Find_Slot(nc, J, DONT_PUSH)(14) if Child_AEST + This_AEST < Best_Composite_AEST then(15) Best_Processor ← J(16) Best_Composite_AEST ← Child_AEST + This_AEST(17) end if(18) end if(19) end while(20) If ni cannot get scheduled (ni is not a DCP node), schedule it to a new processor.

The procedure Select_Processor() first constructs a processors list in order to find the mostsuitable one to accommodate ni. If ni is on the DCP, only the processors containing its parentnodes and children nodes are considered because only these processors can possibly satisfyRule III as discussed earlier. The processors containing the parent nodes of ni are given higherpriorities to accommodate ni. This will help reducing the start times of other descendantnodes of ni that are examined later. If ni is not on the DCP, Select_Processor() tries all theprocessors already in use because a processor which does not contain any of its parent nodes

AEST ni J,( ) ALST ni J,( )ni

max AEST ni J,( ) AEST nJlJ,( ) w nJl

( )+,{ }

O v( ) O v( )

- 14 -

and children nodes is also suitable, provided that it has a large enough time slot. However, nodelaying is allowed for scheduling such a node since it is undesirable to increase the starttimes of the scheduled nodes. In selecting the most suitable processor, a parameter composedof the node’s AEST and also its critical child node’s AEST is used. Using this parameter canavoid scheduling the node too early to a processor that cannot accommodate the node’scritical child node. The complexity of Select_Processor() is . With the proceduresFind_Slot() and Select_Processor(), the DCP algorithm is formalized below.

The DCP Algorithm:(1) Compute AEST and ALST for all nodes(2) while not all nodes scheduled do(3) ni ← the highest node with the smallest difference between its ALST and AEST;

break ties by choosing the one with a smaller AEST(4) If ni’s ALST is equal to its AEST, then call Select_Processor(ni, On_DCP); otherwise,

call Select_Processor(ni, Not_On_DCP)(5) Update AEST and ALST for all nodes(6) end while(7) Make all nodes’ start times to be their respective AESTs

The DCP algorithm continues to perform scheduling all the DCP nodes first. It updatesthe AEST and ALST values dynamically after each scheduling step in order to determine thenext DCP node. Finally, it assigns the actual start times of each node to be its AEST. Thecomplexity of the algorithm is because there is v calls to the procedureSelect_Processor().

5 An Application ExampleIn this section, an example task graph is used to illustrate the effectiveness of the

proposed algorithm. For comparison, the schedules generated by the other six schedulingalgorithms discussed earlier are also presented. The task graph used is a macro data-flowgraph which represents the parallel Gaussian elimination algorithm written in an SPMD style[10], [37] and is shown in Figure 2. Note that the edges in the two CPs in this task graph areshown with thick arrows.

The schedule of the Gaussian elimination task graph generated by the EZ algorithm isshown in Figure 3(a). The EZ algorithm, as mentioned above, creates a list of edges sorted ina descending order of communication costs. According to the edge list for this example, itschedules nodes in the order: n1, n7, n6, n5, n4, n3, n2, n12, n11, n10, n9, n8, n16, n15, n14, n13, n18, n17.Note that the nodes scheduled to the processor are ordered in decreasing static levels. Thus,for example, although n2 is selected for scheduling earlier than n12, it is scheduled as the lastnode in the processor eventually because it has the smallest static level (which is equal to itscomputation cost). It can be seen that most of the CP nodes are not scheduled to occupyearlier time slots. Instead, the relatively less important nodes, such as n5, n6, are scheduled tooccupy the important time slots. The task graph in this example is “over-clustered” by the EZalgorithm. The EZ algorithm has the tendency of packing tasks together by reducingparallelism. This is because the EZ algorithm assigns higher priorities to the nodes of theedge having the highest communication cost. These nodes, which may not necessarily be onthe CP, may then be scheduled to occupy earlier time slots. It should be noted that the finalschedule length may not be reduced by such edge-zeroing action. When the more importantnodes are considered at later steps, they may have to be scheduled within later idle time slots.

O v2( )

O v3( )

- 15 -

This effect propagates downward along the CP and can eventually lead to a longer schedulelength.

The schedule of the Gaussian elimination task graph generated by the MCP algorithm isshown in Figure 3(b). The MCP algorithm schedules the task graph in the order: n1, n3, n7, n4,n9, n5, n12, n10, n6, n14, n11, n16, n15, n2, n8, n13, n17, n18. The MCP algorithm schedules nodesproperly until it considers node n10. The nodes n1 to n12 are scheduled to start at the earliestpossible times. However, when n10 is considered, it is found that its start time on PE 1 is 300while its start time on PE 0 is 320. Thus, n10 is scheduled to PE 1. Consider the scheduling ofthe node n14. Since n10 has been scheduled to PE 1, n14 can start only at time 410 on either PE 0or PE 1 because of the data dependency from nodes n10 and n12. The MCP algorithmschedules it to PE 0 since it selects processors from left to right. If n10 were scheduled to PE 0instead of PE 1, n14 could have started earlier. Thus, scheduling n10 to PE 1 is not an intelligentdecision because this delays the start time of n10’s descendants. Similar to n10, the node n11 isscheduled to start at its earliest possible time on PE 2. Similarly, n16 is scheduled to PE 0which gives the smallest start time (as n14 is scheduled to PE 0). Notice that the adverse effectof inappropriate scheduling of n10 propagates downward. In subsequent steps, n15 isscheduled on PE 1, to which n11 has been scheduled. The nodes n2, n8, n13 and n17 arescheduled to start at the earliest times. However, it should be noted that n17 can start earlier ifn16 is properly scheduled. Similar to the case of n10, the improper scheduling of n15 affects the

n1

n2

40

n7

80

60

n3

40

n4

40

n5

40

n6

40

n8

30

n9

30

n10

30

n11

30

n13

20

n12

40

n17

10

120

120

120

120

80

80 80 80

8080

80

80

8080

80

80

n14

20

n15

20

n16

20

n18

10

Figure 2: A parallel Gaussian elimination task graph.

120

120

120

120

- 16 -

scheduling of n18. It is apparent from this example that the MCP algorithm does not take careof the scheduling of the descendants of a node due to the straightforward start timeminimization greedy strategy.

The DLS algorithm schedules nodes in the order: n1, n3, n7, n4, n5, n6, n9, n2, n12, n10, n11, n8,n13, n14, n15, n16, n17, n18 and generates the same schedule as the MCP algorithm. Similar to theMCP algorithm, the DLS algorithm schedules nodes properly until it considers n10. Thecritical nodes, such as n10 and n15 are also not scheduled properly. As discussed above, theDLS algorithm schedules a node to a processor which gives the minimum start time withoutpaying any regard to handling the descendant nodes. In this respect, the DLS algorithm hasthe same problems as the MCP algorithm, as shown in this example. The ETF algorithm alsogenerates the same schedule as the MCP algorithm with the following order: n1, n3, n7, n4, n5,n6, n2, n9, n12, n8, n10, n11, n13, n14, n15, n16, n17, n18. The ETF algorithm selects nodes forscheduling based on start times only. For example, n2 is selected for scheduling before n9

because it has a smaller start time.

0

100

200

300

400

500

600

n1

80

n3

40

n7

60

n4

40

n9

30

n12

40

n8

30

n1320

n1420

n1710

n1620

n1810

n5

40

n10

30

n1520

n6

40

n11

30

n2

40

PE 0

Figure 3: The schedule of the Gaussian elimination task graph generatedby (a) the EZ algorithm (schedule length = 600 time units); (b) the ETF,DLS and the MCP algorithms (schedule length = 520 time units).

0

100

200

300

400

500

520

n1

80

n3

40

n7

60

n4

40

n9

30

n12

40

n8

30n1320

n1420

n1710

n1620

n1810

n5

40

n10

30

n1520

n6

40

n11

30

n2

40

PE 0 PE 1 PE 2 PE 3

120120120

120120

8080

80

120

(a) (b)

- 17 -

The schedule generated by the DSC algorithm is shown in Figure 4. The schedule lengthis shorter than those of the MCP, DLS and ETF algorithms. The DSC algorithm schedules thenodes in the order: n1 n3, n7, n4, n9, n12, n13, n8, n6, n5, n10, n14, n16, n17, n11, n15, n18, n2. Similar tothe MCP and DLS algorithms, n10 is not properly scheduled because of the same reason — theDSC algorithm tries to minimize the start time of a node at each step without considering theeffect on subsequent scheduling of the descendant nodes. As can be seen, n14 and n16 cannotstart earlier due to the inappropriate scheduling of n10. When n15 is considered, it is scheduledto start at time 410 on PE 5. However, when n18 is considered, it is found that its start time canbe reduced from 550 (on PE 4) to 480 (also on PE 4) by re-scheduling n15 to PE 0 from PE 5.This re-scheduling process causes the DSC algorithm to generate a better schedule than theMCP, DLS and ETF algorithms. Note that this process is not applicable to the scheduling ofn14 because the DSC algorithm finds that n14’s start time cannot be reduced even if n10 is re-scheduled. It can be seen that the set of nodes {n14, n15, n16, n17, n18} is scheduled to PE 0 butthese nodes still cannot start at the earliest times. This is because after n10 has to wait for thedata from n5. Although n5 can start at its earliest time, the schedule length cannot beimproved. Similar to the MCP, DLS and ETF algorithms, the straightforward start timeminimization strategy of the DSC algorithm also makes it unable to reduce the start times ofthe CP nodes {n14, n16, n17, n18} by delaying the start times of nodes n5 and n10 even though itcan avoid the problem of scheduling node n15 too early. Furthermore, the DSC algorithmwastes processors without improving the schedule length.

The schedule generated by the MD algorithm is shown in Figure 5(a). The MD algorithmgenerates a better schedule compared to the above five algorithms. The MD algorithm

Figure 4: The schedule of the Gaussian elimination task graphgenerated by the DSC algorithm (schedule length = 460 time units).

0

100

200

300

400

460

n1

80

n3

40

n7

60

n4

40

n9

30

n12

40

n8

30

n1320

n1420

n1710

n1620

n1810

n5

40

n10

30

n1520

n6

40

n11

30

n2

40

PE 0 PE 3PE 2 PE 5PE 1 PE 4

120

120

120

120

120

80

80

- 18 -

schedules nodes in the order: n1, n3, n7, n4, n9, n12, n5, n10, n14, n16, n6, n11, n15, n18, n17, n13, n8, n2.As can be seen, all the CP nodes are scheduled to the same processor PE 0. The MD algorithmis able to avoid scheduling nodes n10 and n15 too early so that the set of nodes {n14, n16, n17, n18}can start immediately after the previous one finishes. However, it has one major problemwhich makes the schedule length still longer than the best possible. As mentioned in earlierdiscussion, the MD algorithm scans for a suitable processor for a node from left to right. Itschedules the node to a processor which has a large enough idle time slot to accommodatethat node without making any effort to minimize the start time. Notice that n10 is scheduledto PE 0 instead of PE 1 because the MD algorithm scans for suitable processor from left toright. Thus, n10 is “accidentally” scheduled to occupy a proper idle time slot. Consequently,nodes n14 and n16 can start at their earliest possible times. Consider, for example, thescheduling of n8. As there is a large enough slot, which is created by pushing nodesdownward, on PE 0 to accommodate n8, it is scheduled to PE 0. Obviously, n8 is a relativelyunimportant node compared with the nodes n4, n9, n12, n10, n14 and n16. Pushing these nodesdownward leads to an inefficient schedule. The MD algorithm also accidentally schedules n15

to the proper processor so that n18 can also be properly scheduled.

The schedule generated by the DCP algorithm is shown in Figure 5(b). Let us examine thescheduling process of the DCP algorithm step by step. The AEST and ALST values of all

0

100

200

300

400

460

n1

80

n3

40

n7

60

n4

40

n9

30

n12

40

n8

30

n13

20

n14

20

n1710

n1620

n1810

n5

40

n10

30

n15

20

n6

40

n11

30

n2

40

PE 0 PE 1 PE 2

Figure 5: The schedules of the Gaussian elimination task graph generated by (a) the MD algorithm(schedule length = 460 time units); (b) the DCP algorithm (schedule length = 440 time units).

120120

120

80

80

80

0

100

200

300

400

440

n1

80

n3

40

n7

60

n4

40

n9

30

n12

40

n8

30

n1320

n14

20

n1710

n1620

n1810

n5

40

n10

30

n15

20

n6

40

n11

30

n2

40

PE 0 PE 1 PE 2

120120

120

12080

80

(a) (b)

- 19 -

nodes are shown in Table II(a). As can be noticed, the nodes on the CP can be identified bythose having equal values of AEST and ALST. At this step, the highest CP node n1 is selectedfor scheduling. After scheduling the first three CP nodes {n1, n3, n7}, the AEST and ALSTvalues of all nodes are shown in Table II(b). The scheduled nodes are marked by asterisks.From this table, we can observe that the CP changes to become {n1, n4, n9, n12, n14, n16, n17}

Node AEST ALSTn1 (CP) 0 0n2 200 980n3 (CP) 200 200n4 200 380n5 200 540n6 200 680n7 (CP) 320 320n8 500 990n9 (CP) 500 500n10 500 660n11 500 800n12 (CP) 610 610n13 770 1000n14 (CP) 770 770n15 770 910n16 (CP) 870 870n17 (CP) 1010 1010n18 (CP) 1010 1010

Node AEST ALST*n1 (CP) 0 0n2 200 800*n3 80 100n4 (CP) 200 200n5 200 360n6 200 500*n7 120 140n8 300 810n9 (CP) 320 320n10 320 480n11 320 620n12 (CP) 430 430n13 590 820n14 (CP) 590 590n15 590 730n16 (CP) 690 690n17 (CP) 830 830n18 (CP) 830 830

Table II : The AEST and ALST values of the nodes in the Gaussian eliminationtask graph (a) before scheduling the first node, (b) after scheduling three nodes.

(a) (b)

Step Node Critical Composite AEST Sch. SLChild PE0 PE1 PE2 PE3 to

1 n1 n7 0 + 320 N.C. N.C. N.C. PE0 10202 n3 n7 80 + 120 200 + 240 N.C. N.C. PE0 9003 n7 n12 120 + 490 200 + 490 N.C. N.C. PE0 8404 n4 n9 180 + 220 200 + 300 N.C. N.C. PE0 8205 n9 n12 220 + 250 300 + 330 N.C. N.C. PE0 7406 n12 n16 250 + 590 330 + 590 N.C. N.C. PE0 6807 n5 n10 N.R. 200 + 300 N.C. N.C. PE1 6808 n10 n14 320 + 350 300 + 410 320 + 410 N.C. PE0 6809 n14 n17 350 + 670 N.C. 430 + 670 N.C. PE0 60010 n16 n18 370 + 530 N.C. 450 + 530 N.C. PE0 54011 n6 n11 N.R. N.C. 200 + 300 N.C. PE2 54012 n11 n15 N.C. N.C. 300 + 410 320 + 410 PE2 52013 n17 NIL 390 + 0 N.C. N.C. 510 + 0 PE0 52014 n15 n18 410 + 430 N.C. 410 + 510 410 + 510 PE0 52015 n18 NIL 430 + 0 N.C. N.C. 510 + 0 PE0 44016 n13 NIL 290 + 0 410 + 0 410 + 0 N.C. PE0 44017 n8 NIL N.R. 300 + 0 330 + 0 N.C. PE1 44018 n2 NIL N.R. 240 + 0 240 + 0 N.C. PE2 440

Table III : The scheduling steps of the DCP algorithm for the task graph in Figure2. (SL denotes schedule length; “N.R.” indicates there is “no room” for the nodeon the processor; “N.C.” indicates the processor is “not considered”.)

- 20 -

(note that the last node can also be n18). The DCP algorithm then selects n4 to be the next nodefor scheduling. It is apparent from this scenario that using the AEST and ALST values, theDCP algorithm can always select the most important node for scheduling. The schedulingsteps of the DCP algorithm are depicted in Table III. In the table, we show the node selectedfor scheduling as well as its critical child at each step. There are also four columns showingthe “composite AEST” values with the first number of each entry being the AEST of the nodeto be scheduled, whereas the second number being the AEST of its critical child. Althoughthe DCP algorithm assumes the availability of unlimited number of processors, only one newprocessor is considered at each step. This is because the AEST value of a node cannot beimproved even if more new processors are considered. One related point is that the schedulelength cannot be improved whenever a node is scheduled to a new processor. This is obviousbecause no communication cost of the node can be zeroed. Also note that only the processorsholding the parent and children nodes of a node are considered. Thus, for example, PE 1 isnot considered for scheduling of the node n14. Similarly, PE 1 and PE 2 are not considered forscheduling of the node n16. To see how the “looking ahead” processor selection strategyworks, consider the scheduling of the node n10. As can be noticed, its AEST value is thesmallest if it is scheduled to PE 1. However, since its critical child node n14 has a much largerAEST value on PE 1 than on PE 0, the node n10 is consequently scheduled to PE 0 instead ofPE 1. Finally, it should be noted that after the node n18 is scheduled, all the CP nodes havebeen scheduled and the schedule length cannot be improved any further. And, the DCPalgorithm only examines the three processors already in use (i.e., PE 0, PE 1 and PE 2) for thescheduling of the remaining three relatively unimportant nodes n13, n8 and n2 (if this is notdone, the node n2 will be scheduled to a new processor which can allow a smaller AESTvalue). Eventually, a schedule length of 440 time units is obtained.

6 Performance and ComparisonIn this section we present a performance comparison of all seven algorithms. For this

purpose, we consider a large set of task graphs as the workload for testing the algorithms.The set contains regular task graphs representing various parallel algorithms and alsosynthetic task graphs representing commonly encountered algorithmic structures. Thegeneration of the workload is described in the first subsection. The performance comparisonis carried out in four contexts. First, we compare the schedule lengths generated by thealgorithms. Second, we present a global pair-wise comparison of all algorithms so that wecan rank the algorithms by their performance. Third, we compare the number of processorsused by the algorithms. Finally, we compare the average running times of these algorithmson a SUN SPARC IPX workstation.

6.1 WorkloadIn our study, we first considered the macro data-flow graphs for the Gaussian elimination

algorithm of different sizes. We also considered the macro data-flow graphs for three otherparallel algorithms: fast Fourier transform (FFT) [2], LU-decomposition [26] and Laplaceequation solver [37]. These task graphs correspond to the macro data-flow graphs for thecorresponding parallel algorithms written in a SPMD style for distributed-memory systems.In addition, we generated synthetic task graphs of various commonly encounteredstructures: in-tree, out-tree, fork-join and completely random task graphs [2]. For eachcategory, we generated a number of graphs by varying the number of nodes and values ofCCR. For the in-tree, out-tree and fork-join task graphs, edges between two levels were

- 21 -

randomly placed. The cost of each node was randomly selected from a normal distributionwith mean equal to the specified average computation cost. The cost of each edge was alsorandomly selected from a normal distribution with mean equal to the product of the averagecomputation cost and the CCR. Miniature examples of each type of graph are shown inFigure 6.

The Gaussian elimination, LU-decomposition and Laplace equation solver algorithmscan be characterized by the size of the input data matrix because the number of nodes andedges in the task graph depends on the size of this matrix. For example, the task graph forGaussian elimination algorithm shown in Figure 2 is for a matrix of size 4. The number ofnodes in the task graphs of these algorithms is roughly where N is the size of thematrix. For our experiments, we varied the matrix sizes so that the graph size ranged fromabout 20 to 200 nodes. For FFT task graphs, the graph size is roughly O(M2M) where M is thenumber of input points which is called the order of the FFT. Again, we varied the orders sothat the graph size ranged from about 20 to 200 nodes. For the in-tree, out-tree, fork-join, andcompletely random task graphs, we varied the number of nodes from 20 to 200 withincrements of 20. For each size of the task graph, we generated six different graphs for CCRequal to 0.1, 0.5, 1.0, 2.0, 5.0 and 10.0.

6.2 Comparison of Schedule LengthsFor the first comparison, we present the schedule lengths produced by each algorithm for

various types of task graphs. The normalized schedule lengths (NSL) for each type of graphstructure are given in the charts shown in Figure 10 to Figure 10, which were obtained bynormalizing the schedule lengths produced by each algorithms to the lower bound. Thislower bound was determined by taking the sum of computation costs of the nodes on theoriginal critical path. It should be noted that the lower bound may not always be possible toachieve, and the optimal schedule length may be greater than this bound. Each of thesefigures also contains a ranking of the algorithms based on the observed schedule lengths for

n1 n2

n4

n3

n5

n1

n3n2

n4n5 n6

n1

n2 n3 n4

n5

n6 n7

n8

(c)

n1

n3n2 n4

n5

n6 n7

n8

n9

n1

n3n2

n4 n5n6

n7 n8

n9

n1

n2 n3

n4 n5 n6

n7 n8

n9

(f)(e)(d)

(b)(a)

n1

n2 n3

n4 n5 n6 n7

n8 n9 n10 n11

n12 n13 n14 n15

(g)

Figure 6: Miniature example for (a) an in-tree task graph, (b) an out-tree task graph; (c)a fork-join task graph, (d) a LU-decomposition task graph, (e) a mean value analysistask graph, (f) a Laplace equation solver task graph, and (g) a FFT task graph.

O N2( )

- 22 -

that particular task graph structure. This ranking indicates how each algorithm performedfor that type of task graphs.

As can be observed from Figure 10 to Figure 10, the ranking of the DLS, MCP, MD, ETF,DSC and EZ algorithms varies from graph to graph while the DCP algorithm ranks as thebest for all types of graphs. From these charts, we also observe that the values of the NSL forall algorithms show a slightly increasing trend if the task graph size is increased. This is dueto the fact that the proportion of nodes other than those on the CP increases which makes itdifficult to reach the lower bound. In the following, we discuss the relative performance of allthe algorithms for each type of task graph.

• Gaussian elimination graphs. For the Gaussian elimination task graphs, the CP-basedalgorithms (DCP, MD, MCP and DSC) have better performance (see Figure 10). This isbecause each Gaussian elimination task graph has only one or two dominating CPs. Thus,an efficient scheduling of nodes on the CPs can lead to good schedules. However, wefound that the re-scheduling process in DSC can lead to generating very inefficientschedules, especially when the task graph is large. As EZ only attempts to reduce the

Figure 7: Average normalized schedule lengths (with respect to lower bounds) at various graphsizes for Gaussian elimination graph; algorithm ranking: DCP, (MD, MCP, ETF), DLS, DSC, EZ.

Figure 8: Average normalized schedule lengths (with respect to lower bounds) at various graphsizes for Laplace equation graph; algorithm ranking: DCP, DLS, ETF, DSC, MCP, MD, EZ.

- 23 -

Figure 9: Average normalized schedule lengths (with respect to lower bounds) at various graphsizes for LU-Decomposition graph; algorithm ranking: DCP, MCP, DLS, MD, ETF, DSC, EZ.

Figure 10: Average normalized schedule lengths (with respect to lower bounds) at various graphsizes for fast Fourier transfom graph; algorithm ranking: DCP, MD, MCP, ETF, DLS, EZ, DSC.

Figure 11: Average normalized schedule lengths (with respect to lower bounds) at various graphsizes for mean value analysis graph; algorithm ranking: DCP, (ETF, DSC, MCP), (DLS, EZ), MD.

- 24 -

Figure 12: Average normalized schedule lengths (with respect to lower bounds) at variousgraph sizes for fork-join graph; algorithm ranking: DCP, MD, ETF, MCP, DSC, DLS, EZ.

Figure 13: Average normalized schedule lengths (with respect to lower bounds) at variousgraph sizes for in-tree graphs; algorithm ranking: DCP, EZ, DSC, ETF, MD, MCP, DLS.

Figure 14: Average normalized schedule lengths (with respect to lower bounds) at variousgraph sizes for out-tree graphs; algorithm ranking: DCP, DLS, DSC, (ETF, MCP, EZ), MD.

- 25 -

communication instead of fully exploiting the parallelism, it also generates veryinefficient schedules.

• Laplace equation solver graphs. For the Laplace equation solver task graphs (see Figure 10),MD, MCP and DSC did not show good performance because these algorithms cannotefficiently schedule nodes on the CPs. The reason is that there are many intervening CPsin every Laplace equation solver task graph. It can be seen that DCP outperformed otheralgorithms by a very large margin. This is because DCP can exploit the inherentparallelism by not scheduling some of the CP nodes too early in the Laplace equation taskgraphs by the looking-ahead strategy. DSC performed better than both MD and MCPbecause it schedules CP nodes as soon as possible by making use of more processors. Onthe other hand, ETF and DLS produced a similar performance.

• LU-decomposition graphs. As can be noticed from Figure 10, in the case of LU-decomposition, DCP again performed the best out of all other algorithms. The ranking ofother algorithms is almost the same as in the case of Gaussian elimination task graphprobably because the LU-decomposition task graph also has only one CP. We can observethat DLS and MCP yielded a similar performance. Also, DSC, MD and ETF gave a similarperformance that is slightly worse than DLS and MCP.

• FFT graphs. For the FFT task graphs, we can see from Figure 10 that the CP-basedalgorithms performed slightly better. This is due to the fact that although all paths in aFFT task graph are CPs, there is not much intervention among them. DCP and MDperformed slightly better than the other CP-based algorithms probably because both ofthem do not delay the scheduling of the CP nodes.

• Mean value analysis graphs. For the mean value analysis task graphs, all CP-basedalgorithms, except MD, performed better (see Figure 10). Similar to the FFT task graphs,all paths are CPs but they intervene largely with each other. MD cannot handle this kindof situation efficiently due to its processor selection criterion. ETF, which gave a similarperformance as MCP, performed the best among other non-CP-based algorithms.

• Fork-join graphs. For the fork-join task graphs, all CP-based algorithms performedconsiderably better than the other two algorithms (see Figure 10). As discussed above, thereason is that there is only one CP in every fork-join task graph. Here, ETF once again

Figure 15: Average normalized schedule lengths (with respect to lower bounds) at various graphsizes for completely random graphs; algorithm ranking: DCP, MD, (MCP, DLS, ETF), EZ, DSC.

- 26 -

performed the best among other non-CP-based algorithms.

• In-tree graphs. For the in-tree task graphs, with the exception of the DCP algorithm, theother CP-based algorithms did not perform well (see Figure 10). This is because althoughthere is only one CP in each in-tree task graph, there are many inward nodes incident onthe CP. The DCP algorithm can schedule different sections of the CPs to differentprocessors to avoid delaying their start times by an efficient scheduling of the inwardnodes to the CP. The other CP-based algorithms do not handle these cases very wellbecause they try to start every node on the CP as early as possible.

• Out-tree graphs. In each out-tree task graph, there is no inward node but many outwardnodes emerging from the single CP. Again, the other CP-based algorithms did notperform well because they tend to schedule the only CP to one processor. The resultsshown in Figure 10 indicate that the final schedule lengths do not always depend on theCP only but also on the outward nodes from the CP.

• Completely random graphs. For completely random task graphs, the DSC and EZalgorithms performed worse in general compared with other algorithms (see Figure 10).DSC performed considerably worse in quite a number of cases due to the fact that the re-scheduling process can severely block the subsequent scheduling of lower level nodes onthe CP.

In general, we can conclude from the above observation that when a task graph has onlya few intervening CPs, the CP-based algorithms can perform better. On the other hand, if thetask graph contains many CPs, the CP-based algorithms can be “confused” by a particularCP in that the algorithms attempt to start all nodes on that CP as early as possible withoutnoting that the nodes on other intervening CPs are delayed. The DCP algorithm tackles thisdrawback because it always performs a looking ahead processor selection so that it can avoidbeing confused by a particular CP.

6.3 A Global ComparisonIn order to rank all the algorithms in terms of the scheduled lengths, we made a global

comparison in which we observed the number of times each algorithm performed better,worse or the same compared to each of the other six algorithms. This comparison is given ina graphical form shown in Figure 16. Here, each box compares two algorithms — thealgorithm on the left side and the algorithm on the top. Each comparison is based on a total of966 task graphs which were generated by using the combination of all of the graph structuresmentioned above with various number of nodes and CCRs. Each box contains three numberspreceded by ‘>’, ‘<‘ and ‘=’ signs which indicate the number of times the algorithm on the leftperformed better, worse, and the same, respectively, compared to the algorithm shown on thetop. For example, the DCP algorithm performed better than the MD algorithm in 699 cases,performed worse in 73 cases and performed the same in 194 cases. Similarly, the DSCalgorithm performed better than the DLS algorithm in 362 cases, performed worse in 427cases and performed the same in 177 cases. An additional box for each algorithm comparesthat algorithm with all other algorithms combined. We can notice that the proposed DCPalgorithm outperformed all other algorithms. Furthermore, the numbers given in Figure 16indicate that the difference between the performance of DCP and other algorithms was muchhigher compared to the difference between the performance of the other six algorithms whencompared amongst each other. Based on these experiments, we can rank all seven algorithmsin the following order: DCP, MCP, DLS, ETF, MD, DSC and EZ. It should be noted that ETF

- 27 -

gave a performance close to DLS even though the complexity of DLS is higher.

6.4 Number of ProcessorsAnother quality of measure for a scheduling algorithm is the number of processors used

because each algorithm “spends” a processor in a different way1. Figure 17 shows theaverage number of processors used by each algorithm for different graph sizes. Theseaverages were taken across all types of task graphs and values of CCR. We observe that DSCused considerably large number of processors compared to the other algorithms while DLSand MCP used approximately the same number of processors. Here, MD outperformed allother algorithms while DCP used slightly more processors than MD. However, this is due tothe deficiency of MD because it tries to cluster task on fewer processors. As a results, theschedules generated by MD are not very well load balanced. On the other hand, DCPovercomes this deficiency of MD and produces better schedule lengths by performing someload balancing at the expense of a few more processors.

6.5 Comparison of Running TimesFinally, we compare the running times of these algorithms which are given in Figure 18.

From this figure, we can immediately notice that DLS is slower than the other algorithms. Itshould be noted that the version of DLS used by us was the one that generates the bestsolution but has a higher complexity. Both DSC and MCP are low complexity algorithms.However, they do not always produce short schedule lengths. The running times of DCP

1. Although MCP, DLS and ETF assume limited number of processors, they are given very large numberof processors in our experiment so that this parameter is not a limiting factor to their performance.

EZDSCMCPMD

DSC

MCP

MD

DCP

DLS

EZ

> 699< 73= 194

> 602< 129= 235

> 709< 89= 168

> 896< 37= 33

> 620< 106= 240

> 294< 506= 166

> 476< 402= 88

> 693< 195= 78

> 304< 496= 166

> 535< 312= 119

> 807< 127= 32

> 199< 183= 584

> 719< 213= 34

> 362< 427= 177

> 137< 796= 33

Figure 16: A Global comparison of the seven algorithmsin terms of better, worse and equal performance.

ALL

> 4156< 525= 1115

> 2150< 2784= 862

> 2381< 1689= 1726

> 2248< 2782= 766

> 849< 4703= 244

> 2198< 1792= 1806DLS

ETF

> 630< 91= 245

> 310< 486= 170

> 205< 171= 590

> 364< 422= 180

> 140< 792= 34

> 190< 170= 606

> 2132< 1839= 1825

ETF

- 28 -

were comparable to MD but more than DSC and MCP. However, the running times of DCPwere admissible. Note that the algorithm ranking shown in Figure 18 is consistent with thegiven complexities of these algorithms.

7 ConclusionsIn this paper, we have presented a new scheduling algorithm which outperforms six

other algorithms. The difference between the performance of our algorithm and the otheralgorithms is also much higher than the difference between the performance of otheralgorithms when compared against each other. The proposed algorithm works better onvarious types of graph structures. The number of processors used and the running time of theproposed algorithm makes it a viable choice for static compile-time scheduling of macro-dataflow graphs and other task graphs onto multiprocessors. The proposed algorithm in itspresent form assumes a network of fully-connected processors but can be generalized toother networks such as hypercube, mesh, etc. In order to accomplish that, the procedure forcomputing the start times of nodes on the processors will need to be modified and it willneed to take into account the hop distances of the processors holding the parent nodes.

Figure 17: Average number of processors used by each algorithm;algorithm ranking: MD, DCP, MCP, ETF, DLS, EZ, DSC.

Figure 18: Average running time for each algorithm;algorithm ranking: DSC, ETF, MCP, EZ, DCP, MD, DLS

- 29 -

AcknowledgementsWe would like to thank Min-You Wu for his help on understanding the MD algorithm.

We are also very thankful to the referees whose constructive comments and suggestions havegreatly improved the quality of this paper.

References[1] T.L. Adam, K. Chandy and J. Dickson, “A Comparison of List Scheduling for Parallel

Processing Systems,” Communications of the ACM, vol. 17, no. 12, pp.685-690, Dec. 1974.

[2] V.A.F. Almeida, I.M. Vasconcelos, J.N.C. Arabe and D.A. Menasce, “Using RandomTask Graphs to Investigate the Potential Benefits of Heterogeneity in Parallel Systems,”proc. of Supercomputing‘92, pp. 683-691, 1992.

[3] M.A. Al-Mouhamed, “Lower Bound on the Number of Processors and Time forScheduling Precedence Graphs with Communication Costs,” IEEE Trans. on SoftwareEngineering, vol. 16, no. 12, pp. 1390-1401, 1990.

[4] F.D. Anger, J.J. Hwang and Y.C. Chow, “Scheduling with Sufficient Loosely CoupledProcessors,” Journal of Parallel and Distributed Computing, vol. 9, pp. 87-92, 1990.

[5] A.F. Bashir, V. Susarla and K. Vairavan, “A Statistical Study of the Performance of a TaskScheduling Algorithm,” IEEE Trans. on Computers, vol. C-32, pp. 774-777, Dec. 1975.

[6] S. Bokhari, “A Shortest Tree Algorithm for Optimal Assignments Across Space andTime in Distributed Processor Systems,” IEEE Trans. on Software Engineering, vol. SE-7,no. 6, Nov. 1981.

[7] J. Bruno, E.G. Coffman and R. Sethi, “Scheduling Independent Tasks to Reduce MeanFinishing Time,”, Communications of the ACM, vol. 17, no. 7, pp. 382-387, Jul. 1974.

[8] V. Chaudhary and J.K. Aggarwal, “A Generalized Scheme for Mapping ParallelAlgorithms,” IEEE Trans. on Parallel and Distributed Systems, vol. 4, no. 3, pp. 328-346,Mar. 1993.

[9] E.G. Coffman, Computer and Job-Shop Scheduling Theory, Wiley, New York, 1976.

[10] M. Cosnard, M. Marrakchi, Y. Robert and D. Trystram, “Parallel Gaussian Eliminationon an MIMD Computer,” Parallel Computing, 6, pp. 275-296, 1988.

[11] E.B. Fernadez and B. Bussell, “Bounds on the Number of Processors and Time forMultiprocessor Optimal Schedules,” IEEE Trans. on Computers, vol. C-22, no. 8, pp. 745-751, Aug. 1973.

[12] M.R. Gary and D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness, W.H. Freeman and Company, 1979.

[13] A. Gerasoulis and T. Yang, “A Comparison of Clustering Heuristics for SchedulingDAG’s on Multiprocessors,” Journal of Parallel and Distributed Computing, vol. 16, no. 4,pp. 276-291, Dec. 1992.

[14] M.J. Gonzalez, “Deterministic Processor Scheduling,” ACM Computing Surveys, vol. 9,no. 3, pp. 173-204, Sep. 1977.

[15] R.L. Graham, E.L. Lawler, J.K. Lenstra, and A.H.G. Rinnoy Kan, “Optimization andApproximation in Deterministic Sequencing and Scheduling: A Survey,” Annals ofDiscrete Mathematics, no. 5, pp. 287-326, 1979.

[16] J.J. Hwang, Y.C. Chow, F.D. Anger and C.Y. Lee, “Scheduling Precedence Graphs inSystems with Interprocessor Communication Times,” SIAM Journal of Computing, vol.18, no. 2, pp. 244-257, Apr. 1989.

[17] D.S. Hochbaum and D.B. Shmoys, “Using Dual Approximation Algorithms forScheduling Problems: Theoretical and Practical Results,” Journal of the ACM, 34(1), pp.144-162, Jan. 1987.

- 30 -

[18] T.C. Hu, “Parallel Sequencing and Assembly Line Problems,” Oper. Research, vol. 19, no.6, pp.841-848, Nov. 1961.

[19] H. Kasahara and S. Narita, “Practical Multiprocessor Scheduling Algorithms forEfficient Parallel Processing,” IEEE Trans. on Computers, vol. C-33, pp. 1023-1029, Nov.1984.

[20] S.J. Kim and J.C. Browne, “A General Approach to Mapping of Parallel Computationupon Multiprocesor Architectures,” proc. International Conference on Parallel Processing,pp. 1-8, 1988.

[21] W.H. Kohler and K. Steiglitz, “Characterization and Theoretical Comparison of Branch-and-Bound Algorithms for Permutation Problems,” Journal of the ACM, vol. 21, no. 1, pp.140-156, Jan. 1974.

[22] W.H. Kohler, “A Preliminary Evaluation of the Critical Path Method for SchedulingTasks on MultiProcessor Systems,” IEEE Trans. on Computers, vol. C-24, pp. 1235-1238,Dec. 1975.

[23] B. Lee, A.R. Hurson and T.Y. Feng, “A Vertically Layered Allocation Scheme for DataFlow Systems,” Journal of Parallel and Distributed Computing, vol. 11, pp. 175-187, 1991.

[24] S.Y. Lee and J.K. Aggarwal, “A Mapping Strategy for Parallel Processing,” IEEE Trans.Computer, vol. C-36, pp. 433-442, Apr. 1987.

[25] T.G. Lewis and H. El-Rewini, Introduction to Parallel Computing, Prentice-Hall, NewYork, 1992.

[26] R.E. Lord, J.S. Kowalik and S.P. Kumar, “Solving Linear Algebraic Equations on anMIMD Computer,”Journal of the ACM, vol. 30, no. 1, pp. 103-117, Jan. 1983.

[27] C. McCreary and H. Gill, “Automatic Determination of Grain Size for Efficient ParallelProcessing,” Communications of the ACM, vol. 32, pp. 1073-1078, Sep. 1989.

[28] J.M. Ortega, Introduction to Parallel and Vector Solution of Linear Systems, New York:Plenum, 1988.

[29] C. Papadimitriou and M. Yannakakis, “Toward an Architecture Independent Analysisof Parallel Algorithms,” SIAM Journal of Computing, vol. 19, pp. 322-328, 1990.

[30] C.V. Ramamoorthy, K.M. Chandy and M.J. Gonzalez, “Optimal Scheduling Strategiesin a Multiprocessor System,” IEEE Trans. on Computers, vol. C-21, no. 2, pp. 137-146, Feb.1972.

[31] H. El-Rewini and T.G. Lewis, “Scheduling Parallel Programs onto Arbitrary TargetMachines,” Journal of Parallel and Distributed Computing, vol. 9, no. 2, pp. 138-153, Jun.1990.

[32] V. Sarkar, Partitioning and Scheduling Parallel Programs for Multiprocessors, MIT Press,Cambridge, MA, 1989.

[33] R. Sethi, “Scheduling Graphs on Two Processors,” SIAM Journal of Computing, vol. 5,no.1, pp. 73-82, Mar. 1976.

[34] B. Shirazi, M. Wang and G. Pathak, “Analysis and Evaluation of Heuristic Methods forStatic Scheduling,” Journal of Parallel and Distributed Computing, no. 10, pp. 222-232, 1990.

[35] G.C. Sih and E.A. Lee, “A Compile-Time Scheduling Heuristic for Interconnection-Constrained Heterogeneous Processor Architectures,” IEEE Trans. on Parallel andDistributed Systems, vol. 4, no. 2, pp.75-187, Feb. 1993.

[36] T. Yang and A. Gerasoulis, “DSC: Scheduling Parallel Tasks on an Unbounded Numberof Processors,” IEEE Trans. on Parallel and Distributed Systems, vol. 5, no. 9, Sep. 1994.

[37] M.Y. Wu and D.D. Gajski, “Hypertool: A Programming Aid for Message-PassingSystems,” IEEE Trans. on Parallel and Distributed Systems, vol. 1, no. 3, pp. 330-343, Jul.1990.

Date post:	04-Oct-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Dynamic Critical-Path Scheduling: An Effective Technique ...€¦ · Yu-Kwong Kwok and Ishfaq...

Documents