+ All Categories
Home > Documents > 13zaq Parallel

13zaq Parallel

Date post: 02-Jun-2018
Category:
Upload: fonsecar
View: 215 times
Download: 0 times
Share this document with a friend
19
 1 3 The Journal of Supercomputing An International Journal of High- Performance Computer Design, Analysis, and Use  ISSN 0920-8542  J Supercomput DOI 10.1007/s11227-013-102 2-8  new parallel algorithm for vertex riorities of data flow acyclic digraphs Zeyao Mo, Aiqing Zhang & Zhang Yang 
Transcript

8/10/2019 13zaq Parallel

http://slidepdf.com/reader/full/13zaq-parallel 1/18

         1 3

The Journal of Supercomputing

An International Journal of High-

Performance Computer Design,

Analysis, and Use

 

ISSN 0920-8542

 

J Supercomput

DOI 10.1007/s11227-013-1022-8

 new parallel algorithm for vertex riorities of data flow acyclic digraphs

Zeyao Mo, Aiqing Zhang & Zhang Yang 

8/10/2019 13zaq Parallel

http://slidepdf.com/reader/full/13zaq-parallel 2/18

         1 3

Your article is protected by copyright and all

rights are held exclusively by Springer Science

+Business Media New York. This e-offprint is

for personal use only and shall not be self-

archived in electronic repositories. If you wishto self-archive your article, please use the

accepted manuscript version for posting on

your own website. You may further deposit

the accepted manuscript version in any

repository, provided it is only made publicly

available 12 months after official publication

or later and provided acknowledgement is

given to the original source of publication

and a link is inserted to the published article

on Springer's website. The link must be

accompanied by the following text: "The final

publication is available at link.springer.com”.

8/10/2019 13zaq Parallel

http://slidepdf.com/reader/full/13zaq-parallel 3/18

J Supercomput

DOI 10.1007/s11227-013-1022-8

A new parallel algorithm for vertex priorities of data

flow acyclic digraphs

Zeyao Mo   · Aiqing Zhang   · Zhang Yang

© Springer Science+Business Media New York 2013

Abstract   Data flow acyclic directed graphs (digraph) are widely used to describe the

data dependency of mesh-based scientific computing. The parallel execution of such

digraphs can approximately depict the flowchart of parallel computing. During the

period of parallel execution, vertex priorities are key performance factors. This paper

firstly takes the distributed digraph and its resource-constrained parallel scheduling as

the vertex priorities model, and then presents a new parallel algorithm for the solution

of vertex priorities using the well-known technique of forward–backward iterations.Especially, in each iteration, a more efficient vertex ranking strategy is proposed. In

the case of simple digraphs, both theoretical analysis and benchmarks show that the

vertex priorities produced by such an algorithm will make the digraph scheduling

time converge non-increasingly with the number of iterations. In other cases of non-

simple digraphs, benchmarks also show that the new algorithm is superior to many

traditional approaches. Embedding the new algorithm into the heuristic framework 

for the parallel sweeping solution of neutron transport applications, the new vertex

priorities improve the performance by 20 % or so while the number of processors

scales up from 32 to 2048.

Keywords   Acyclic digraph · Parallel algorithm · Neutron transport

1 Introduction

The data flow acyclic directed graphs (digraph) [9] are usually used to describe the

data dependency for a wide range of mesh-based scientific computing. Each of these

digraphs consists of weighted vertices and arcs, each vertex often refers to a mesh

Z. Mo (B) ·  A. Zhang ·  Z. Yang

Laboratory of Computational Physics, Institute of Applied Physics and Computational Mathematics,

P.O. Box 8009, Beijing, 100088, China

e-mail: [email protected]

8/10/2019 13zaq Parallel

http://slidepdf.com/reader/full/13zaq-parallel 4/18

Z. Mo et al.

cell, and its weight often represents the workloads; each arc often depicts the data

dependency between two neighboring cells and its weight often represents the de-

pendent overheads.

The parallel sweeping solvers are the numerical kernel for the seven-dimensional

radiation or neutron transport equations [20] when the discrete ordinates methods(S n) are used. They are typically mesh-based scientific computing applications whose

data dependency are suitable for digraph description. Baker et al. [2] and Koch et al.

[19] addressed these solvers on rectangular meshes on earlier massively parallel com-

puters, Plimpton et al. [29] and Pautz et al. [27] prolonged these researches to unstruc-

tured meshes, Mo et al. [33] supplemented these works for cylindrical coordinate sys-

tem, and recently Pautz et al. [28] presented another heuristic method to improve the

inherent parallelism for long characteristics S n discretization. Besides from these par-

allel sweeping solvers, many other mesh-based applications exist and are suitable for

digraph description, for example, the parallel downstream relaxation for the direct so-

lution of upper or lower sparse triangle linear system arising from the discretization of 

convection-dominated problems [4, 11, 13], the well-known ILU factorization [12],

the dense matrix LU factorization and their multi-threaded versions [3], the patch-

based structured mesh AMR simulations [22, 25], and their multi-threaded versions

[23], and so on.

The flowchart of parallel computing for above mesh-based scientific computing

applications can be approximately depicted by the parallel execution of the associated

digraphs. Nevertheless, the solution for the minimal execution time of the digraphs

are still NP-hard [18]. Mo et al. [32] present a heuristic framework. It consists of three

components. The first is the partitioning method assigning digraph vertices acrossprocessors, the second is the parallel sweeping solver for the execution of distributed

digraph, and the third is the vertex priorities strategy to decide which vertex should be

executed when many vertices are executable in each processor. For a given distributed

digraph, the vertex priorities approach is the most crucial for parallel efficiency.

There are two types of approaches for the calculation of vertex priorities. The

first is local and another is global. The local approaches only use the data flow

locally in each processor and the global approaches use the data flow of digraph

across processors. The First-In-First-Out (FIFO) strategy, the Geometrical Coordi-

nates KBA strategy [29], the Shortest processor-Boundary Path strategy (SBP) [32],and the Sweeping Direction Upwind strategy [33] are typically local approaches. The

Largest End Time strategy [7], the Latest Start Time strategy (LST) [17], the Least

Relaxation Time Strategy [7], the Maximal Number of Successors Strategy [1], the

Hybrid Strategies [5, 30], the Sampling Strategies [6, 8], and the Depth First sweep-

ing strategies (DFHDS) [27] are the typically global approaches. Usually, the local

approaches are cheaper and less efficient, the global approaches are favorable while

the vertex priorities are reusable.

Generally, most of above vertex priorities approaches can be depicted by the well-

known resource-constrained scheduling models widely used for many of digraph-

based projects or networks [9,   15,   16] except that the constrained resources refer

to the number of available processors. Each of these models will produce a parallel

scheduling for which each vertex has a start or an end time for execution and the

parallel execution time of the scheduling is equal to the difference between the max-

imal vertices end time and the minimal vertices start time. Taking the start or the

8/10/2019 13zaq Parallel

http://slidepdf.com/reader/full/13zaq-parallel 5/18

8/10/2019 13zaq Parallel

http://slidepdf.com/reader/full/13zaq-parallel 6/18

Z. Mo et al.

Simple digraph is such a type of digraphs that all its vertices have equal weights

and all its arcs have zero weights. Simple digraphs often exist in the field of mesh-

based scientific computing. For example, weights of each vertex are usually equal to

each other because the computational formulae are similar across mesh cells, the data

transfer overheads between neighboring vertices are neglectable provided that eachvertex has enough computational workloads.

A distributed digraph consists of many non-overlapping sub-digraphs and each is

assigned to a processor. A vertex is local to a sub-digraph if and only if it belongs to

this sub-digraph, a vertex is local to another vertex if and only if they belong to the

same sub-digraph. An arc is a local arc if and only if both of its head and tail belong to

the same sub-digraph, otherwise, it is a cut arc. Usually, a sub-digraph includes all its

local vertices and all its arcs whose head or tail is local. Denote R= (r1, r2, . . . , rn)

be the mapping vector whose element ri (1 ≤ i ≤ P ) is the rank of processor owning

vertex vi , P  is the number of processors.Given a distributed digraph with P  sub-digraphs, the vertex priorities approaches

as introduced in the first section are equivalent to the solution of resource-constrained

scheduling model as follows:

min[ f ]

subject to

f i + wi,j  ≤ f j  − qj , i ∈ Dj , j  = 1, . . . , n

|Ak(t )| ≤ 1, k = 1, . . . , P , ∀t 

(1)

Here,   f   = (f 1, f 2, . . . , f  n)   is a scheduling and is represented by a vector of vertex

end time, n  is the number of vertices,  P  is the number of processors, Dj  is the set of 

heads of vertex vj ,

Π  =   max1≤j ≤n

{f j }, Λ =   min1≤j ≤n

{f j  − qj },   [ f ] = Π  − Λ   (2)

denotes the end time, start time, and execution time of the digraph, respectively,

Ak(t )

= {vj   : f j  − qj  ≤ t < f j } ∩ {vj  : rj  = k}   (3)

is the set of vertices executed on processor  k  at time t . Obviously, assign each vertex

vi  the priority f i , then the digraph has the final execution time [  f ].

Two constraints are proposed in Eq. (1). The first is the sequence constraint such

that a vertex should not execute until all its predecessors have finished, and the sec-

ond is the resource constraint such that at most one vertex executes concurrently in

each processor. A scheduling is feasible if and only if these constraints are satisfied.

A scheduling is optimal if and only if  [  f ] is the minimal.

3 Parallel forward–backward iterations

Li et al. [21] presents a serial technique of Forward–Backward iterations (FB) to

reduce the execution time of a feasible scheduling for projects or networks. Here,

8/10/2019 13zaq Parallel

http://slidepdf.com/reader/full/13zaq-parallel 7/18

A new parallel algorithm for vertex priorities of data flow acyclic

we consider its parallel version for a distributed digraph in Algorithm 3.1. Different

from the serial forward–backward iterations, a new ranking strategy is introduced

satisfying the sequence constraint of Eq. (1).

Algorithm 3.1   PFB(G , r, q,w,α,β,   f 0

, M its, ε,   f )INPUT:

G    : local sub-digraph of the digraph;

r   : processor mapping vector for all vertices of the digraph;

q   : weight vector of local vertices;

w   : weight matrix of both local arcs and cut arcs;

α   : forward ranks of local vertices, smaller rank has high priority;β   : backward ranks of local vertices, larger rank has high priority;

M its   : maximal number of forward-backward iterations;

ε   : convergence error threshold;f 0 : initial scheduling of local vertices.

OUTPUT:

f   :  final scheduling of local vertices.

BEGIN

h = 0, m = number of local vertices.

Λ0 = start time of initial scheduling, Π 0 = end time of initial scheduling.

DO in PARALLEL {

(1) Execute backward iteration for schedule   f h+1/2 from   f h.

(1.1) Compute ranks for local vertices from   f h and sort.

(1.1.1) Compute   β   from   f h using a rank strategy as discussed in

the next section.

(1.1.2) Order local vertex indices {ig}g=1,...,m  satisfying

βi1, f hi1 ≥ βi2

, f hi2 ≥ · · · ≥ βim , f himHere, (a1, b1) ≥ (a2, b2) means

(a1 > b1)||

(a1 = b1)&(a2 > b2)

.

(1.2)   Π h+1/2 = Π h, I  = (−∞, Π h+1/2) initial intervals available.

(1.3)   FOR each vertex ig : g = 1, 2, . . . , m − 1, m DO  {

(1.3.1) Let t ig = f higand goto step (1.3.4) if  vig  is a sink.

(1.3.2) Receive every sh+1/2j    from each remote tail vj .

(1.3.3) Compute t ig  = min(vig ,vl )∈E{sh+1/2l   − wig ,l }.

(1.3.4) Update f h+1/2

ig= max{t  : (t  ≤ t ig )&((t  − qig , t) ⊆ I )}.

(1.3.5) Update sh+1/2ig

= f h+1/2

ig− qig .

(1.3.6) Update the latest start time I  = I  \ (sh+1/2ig

, f h+1/2

ig).

8/10/2019 13zaq Parallel

http://slidepdf.com/reader/full/13zaq-parallel 8/18

Z. Mo et al.

(1.3.7) Send sh+1/2ig

to each processor owing heads of  vig .

(1.3) } END  for  g  = 1, 2, . . . , m − 1, m.

(1.4) Synchronize the latest start time across all processors:

Λh+1/2

= min

sh+1/2

l   : vl ∈ V

.

(2) Execute forward iteration for schedule   f h+1 from   f h+1/2.

(2.1) Compute ranks for local vertices using   f h+1/2 and sort.

(2.1.1) Compute  α from   f h+1/2 using a rank strategy as discussed

in the next section.

(2.1.2) Order local vertex indices {ig}g=1,...,m  satisfying

αi1 , sh+1/2i1 ≤ αi2 , s

h+1/2i2 ≤ · · · ≤ αim , s

h+1/2im

(2.2)   Λh+1 = Λh+1/2, I  = (Λh+1, +∞) //initial intervals available.

(2.3)   FOR each vertex ig : g = 1, 2, . . . , m − 1, m DO  {

(2.3.1) Let t ig = sh+1/2ig

and goto step (2.3.4) if  vig  is a source.

(2.3.2) Receive every f h+1l   from each remote head vl .

(2.3.3) Compute t ig  = max(vj ,vig )∈E{f h+1j    + wj,ig }.

(2.3.4) Update s h+1ig

= min{t  : t  ≥ t ig   & ([t, t  + qig ) ⊆ I )}.

(2.3.5) Update f h+1ig

= sh+1ig

+ qig . //.

(2.3.6) Update the earliest end time of  I  = I  \ (sh+1ig

, f h+1ig

).

(2.3.7) Send f h+1ig

to each processor owning tails of  vig .

} END  for  g  = 1, 2, . . . , m − 1, m.

(2.4) Synchronize the earliest end time across all processors:

Π h+1 = max{f h+1j    : vj  ∈V}.

(3)   h = h + 1.

} UNTIL (h > M its or |(Π h−1

− Λh−1

) − (Π h

− Λh

)| < ε).

 Remark 3.1  The sequence of vertices  vig (g =  1, 2, . . . , m − 1, m) in step (1.1.1) or

step (2.1.1) satisfies the sequence constraint of Eq. (1).

 Remark 3.2   In step (1.3.4),   t ig   is the latest end time satisfying the sequence con-

straint, f h+1/2

igis the latest end time satisfying both constraints.

 Remark 3.3  In step (2.3.4),  t ig  is the earliest start time satisfying the sequence con-

straint, s h+1ig

is the earliest start time satisfying both constraints.

 Remark 3.4  The last row in Algorithm 3.1 is the terminal condition. If the sequence

[ f h] non-increasingly converge with h, the final output is the solution; otherwise, we

should take the output having the shortest execution time in the solution history.

8/10/2019 13zaq Parallel

http://slidepdf.com/reader/full/13zaq-parallel 9/18

A new parallel algorithm for vertex priorities of data flow acyclic

4 Vertex ranking strategies

In step (1.1) and step (2.1), the vertex ranking strategies are crucial not only for the

convergence but also for the quality of sequence [ f h]. Li et al. [21] assign each vertex

the rank of its end time after last backward iteration and the rank of its start time afterlast forward iteration, Ozdamar et al. [26] present another similar but more complex

strategy, Mo et al. [32] use the Shortest processor-Boundary Path strategy (SBP) for

which vertices are ranked by their distance away from the sub-digraph boundaries.

Local approaches for vertex priorities introduced in the first section can also give

rank sequences. However, no matter what strategies are used, one problem is still

open such that whether and how [  f h] converges.

In the following part of this section, a new vertex ranking strategy is proposed

called Cut Arc Preference strategy (CAP). It not only makes above open problem be

answerable but also gives better execution time. Moreover, its computational com-

plexity is similar to other local approaches for vertex priorities in the literatures as

SBP. This strategy mainly differs from traditional strategies in the viewpoint of that

cut arcs should be preferentially executed since their earlier finish indicate larger in-

herent parallelism.

In step (2.1.1) after a backward iteration, we rank a vertex by the earliest start time

of all its downstream cut arcs as follows:

αi =   min(vk ,vj )∈S i

s

h+1/2j    − wk,j 

∪ {+∞}   (4)

Here, S i  is the set of cut arcs whose heads are the successors of vertex  vi , wk,j   is the

weight of cut arc  (vk, vj )  and  sh+1/2j    is the start time after last backward iteration,

superscript h  represents the iteration step. Smaller  αi  indicates higher priority.

Similarly, in step (1.1.1) after a forward iteration, we rank a vertex by the latest

end time of all its upstream cut arcs as follows:

βi =   max(vj ,vk )∈Bi

f hj   + wj,k

∪ {−∞}   (5)

Here, B i  is the set of local cut arcs whose tails are the predecessors of vertex vi , wj,kis the weight of cut arc  (vj , vk), and  f hj    is the end time after last forward iteration,

superscript h  represents this iteration. Larger βi  indicates higher priority.

Strategy  CAP  naturally satisfies the sequence constraint stated in step (2.1.2) and

step (1.1.2). In fact, assuming vertex  va  the predecessor of vertex  vb, vertex set  S bmust be a subset of  S a , so  αa ≤ αb. Similarly, assuming vertex  vc  the successor of 

vertex vd , vertex set B c  must be a subset of B d , so βc ≤ βd .

Let CAP-F be the forward iteration using formula (4) and CAP-B be the backward

iteration using formula (5), denote the forward–backward iteration as CAP-PFB. Fig-

ure 1 gives an example to illustrate these iterations for a scheduling. Figure 1a gives

a distributed simple digraph with two sub-digraphs divided by the dashed line. Each

circle and its central number represent a vertex, the weight of each vertex is normal-

ized to 1.0. Each arrow represents an arc showing the data dependency between two

vertices, the weight of each arc is equal to zero. Three cut arcs exist along the dashed

line. Figure 1b gives an initial scheduling. The horizontal axis is the execution steps

8/10/2019 13zaq Parallel

http://slidepdf.com/reader/full/13zaq-parallel 10/18

Z. Mo et al.

Fig. 1   One backward-forward iteration of  CAP-PFB

representing the execution time, the vertical axis is the executed vertices across two

processors. This scheduling requires 7 steps in total.Taking the set of end time {f 0i  }10

i=1 in Fig. 1b as the input, Fig. 1c assigns each ver-

tex a rank defined by formula (5). All local vertices in the 0th processor have the rank 

{−∞} since no local upstream cut arcs exist. B 9 = {(v2, v9),(v3, v7)}, f 03   = 3, f 02   =

2, so   β9 = 3. Similarly, we have  β7 =  β10 =  3, β8 = f 04   =  4. After one backward

iteration as stated in  ALGORITHM 3.1, a better scheduling is generated in Fig.  1d

where one step is reduced.

Taking the set of start time {s1/2i   }10

i=1  in Fig. 1d as the input, Fig. 1e updates ver-

tex ranks defined by formula (4). All local vertices in the 1th processor have the

rank  {+∞} since no local downstream cut arcs exist. S 1 = {(v2, v9),(v4, v8)}, s1/29   =4, s

1/28   = 6, so α1 = 4. Similarly, we have  α2 = 4, α3 = 3, α4 = 6. After one forward

iteration as stated in   ALGORITHM 3.1, a better scheduling is generated again in

Fig. 1f where another step is further reduced.

Until now, one backward–forward iteration finishes, an optimal scheduling with 5

steps is gained. However, the ranking strategies presented in [21, 26] cannot improve

the results in Fig. 1b.

5 Convergence analysis for  CAP-PFB

The section gives the convergence analysis for algorithm   CAP-PFB   for simple di-

graphs. No loss of generalization, we normalize the vertex weight to be  q1 =  q2 =· · · = qn = 1.

8/10/2019 13zaq Parallel

http://slidepdf.com/reader/full/13zaq-parallel 11/18

8/10/2019 13zaq Parallel

http://slidepdf.com/reader/full/13zaq-parallel 12/18

Z. Mo et al.

so vertex vk  does not belong to set K and (vk , vk ) is a cut arc whose head belongs

to the list J . Moreover, k < C  because αk  > αk ≥ αj C . Therefore,

f h+1k   ≤ s

h+1/2

k   (15)

from the induction assumption in Eq. (9). So, we have  f h+1j C

≤ sh+1/2

k   + |K|   from

above five equations. Let j  = argmax{f h+1/2

i   : i ∈ K}, then sh+1/2

k   + |K| ≤ f h+1/2

j   ,

so f h+1j C

≤ f h+1/2

j   . Equation (10) is obtained.

From Eq. (10) and the inequality of  αj  ≤ αj C , we can further conclude that

f h+1j C

≤ f h+1/2

j   ≤ αj   ≤ αj C  ≤ s

h+1/2k   (16)

Inequality (6) and Lemma 5.1 are correct.  

Lemma 5.2  If the distributed acyclic digraph is simple, the scheduling   f h+1 gener-

ated by the forward iteration  CAP-F is non-increasing such that 

 f h+1

 f h+1/2

  (17)

 Here, [  f ] is the parallel execution time as defined in Eq. (2).

Proof   Given a vertex   vk , assuming one of its successor being the head of cut arc

(vj c , vt ), we have

f h+1k   ≤ f h+1

j c≤ s

h+1/2t    < f 

h+1/2t    ≤ Π h+1/2 (18)

by Lemma 5.1 and the sequence constraint. Otherwise, it is a sink or local predecessor

of a sink. Let vertex  vs  be the sink, we can also conclude that  f h+1k   ≤ f h+1

s   by the

sequence constraint and

f h+1s   ≤ f 

h+1/2

j   ≤ Π h+1/2 (19)

by Eq. (10). Therefore, we always have

f h+1

= Π h+1 − Λh+1 ≤ Π h+1/2 − Λh+1/2 =

f h+1/2

  (20)

Lemma 5.2 is obtained.  

Similarly, we have the following conclusion.

Lemma 5.3  If the distributed acyclic digraph is simple, the scheduling   f h+1/2 gen-

erated by the backward iteration  CAP-B is non-increasing such that 

 f h+1/2

 f h

  (21)

 Here, [  f ] is the parallel execution time as defined in Eq. (2).

From Lemmas 5.1 and 5.2, we have the following conclusion.

8/10/2019 13zaq Parallel

http://slidepdf.com/reader/full/13zaq-parallel 13/18

A new parallel algorithm for vertex priorities of data flow acyclic

Theorem 5.1   The sequence {[ f h]}+∞h=0 generated by CAP-PFB of Algorithm 3.1 non-

increasingly converges for simple acyclic digraphs starting from any initial schedul-

ing   f 0.

6 Complexity analysis for CAP-PFB

The calculation complexity of strategy  CAP-PFB is the same as that of strategy  FB

apart from vertex ranking strategies. Let   N  be the total number of vertices,   P   be

the number of processors, each sub-digraph  G  have equal number of vertices, then

step (1.1.1) and step (2.1.1) have equal complexity  O ( N P 

 ). Furthermore, step (1.1.2)

and step (2.1.2) have equal complexity  O( N P 

  log  N P 

 ). The calculation complexity of 

step (1.3) and step (1.4) are the same as that of step (1.1.1) and step (2.1.1). The

global reduction in step (1.4) or step (2.4) has the calculation complexity of  O ( N P 

 ) +

O(log P ). So, Algorithm 3.1 has the calculation complexity as follows:

T cal ∼= O

+O

P log

+O

+O(log P ) ∼= O

P log

+O(log P )

(22)

Assume the average degree of vertices be   O(D)  and the diagraph be uniformly

partitioned, then each processor has  O(

 N P 

 )  cut arcs in the two-dimensional case.

So, the message passing complexity is about

T mes ∼= O

D × 

N P 

  (23)

Similarly, the message passing complexity can be estimated in the three-dimensional

case

T mes ∼= O

D ×

23

  (24)

It is difficult to evaluate the overheads for step (1.3.2) and (1.3.7) because of the

uncertainty of message passing. However, they can be estimated as follows:

T pas ∼= O(T mes × L)   (25)

Here, L  is the message passing latency. Similarly, this estimation is suitable for steps

(2.3.2) and (2.3.7).

The complexity of Algorithm   3.1   is the sum of   T cal   and   T pass. In the case of 

N   P ,   t cal   dominates. If we fix   N/P   and increase   P , the global reduction will

dominates. If the diagraph parallelism is smaller,  T pas or the message passing latency

will dominate.

In fact, Algorithm 3.1 is the sequence of a number of iterations and each iteration

has double overheads of a parallel sweeping as introduced in the first section. So,

Algorithm 3.1 has the complexity of 2M  parallel sweeping provided that it converges

within M  iterations. This implies that Algorithm 3.1 has the disadvantage of that it is

useful if and only if the vertex priorities approach is reusable for the parallel sweeping

of a series of digraphs or the parallel sweeping is repeated for the real applications.

8/10/2019 13zaq Parallel

http://slidepdf.com/reader/full/13zaq-parallel 14/18

Z. Mo et al.

Fig. 2   A structured mesh

generated by the software [10]

in the scale of 120 × 49 across

an airfoil

7 Theoretical benchmarks

This section validates Theorem 5.1 for Algorithm 3.1. Given a distributed digraph,

we define the speedup  S P  of a feasible scheduling as the serial execution time over

the parallel execution time for P  processors. Here, the execution time is measured by

steps. For example, for a simple digraph, the vertex weight is equal to one step and

the arc weight is equal to zero. The number of steps can be accurately calculated by

the symbolic sweeping. Obviously, larger speedup means shorter parallel execution

time and implies superior vertex priorities.

7.1 Simple digraph benchmarks

The simple digraphs come from one flux sweeping of the discrete ordinates solution

of the neutron transport applications as studied in [32] and [33]. The geometry is a

two-dimensional structured mesh in the scale of 120 × 49 across an airfoil as shown

in Fig. 2; the discrete ordinates include 24 angles arising from the non-overlapping

partitioning of a unit spherical surface. The digraph includes about 140 K vertices and

280 K arcs. Each vertex represents the task of a pair of  ci , am where flux is swept

across cell  ci   for angle  am   and its weight is equal to one step. Each arc represents

the data dependence among two tasks such that (ci , am, cj , am) where cell  ci   isthe upstream neighbor of cell  cj   for angle  am  and its weight is equal to zero. It is

easily concluded that the digraph is acyclic if and only if each cell is convex [ 32].

This digraph has the maximal speedup of 666 provided that unlimited number of 

processors are available.

The distributed digraph with   P   subgraph is generated by the non-overlapping

mesh partitioning method of Inertial Kemighan–Lin (IKL) implemented in the tool

Chaco [14]. For each partitioning, above tasks defined on each cell are assigned to

the same processor.

Figure 3 lists the speedup convergence history with the number of iterations for

400 or 500 processors, respectively. Here, legend  CAP-FB represents Algorithm 3.1

coupled with the vertex ranking strategy  CAP, legend  FB   represents the traditional

forward–backward iterations [21] coupled with the vertex ranking strategy  LST. The

horizontal axis is the number of iterations where the half means the a backward iter-

ation and the integer means a backward–forward iteration.

8/10/2019 13zaq Parallel

http://slidepdf.com/reader/full/13zaq-parallel 15/18

A new parallel algorithm for vertex priorities of data flow acyclic

Fig. 3   Speedup convergence history for forward–backward iterations

Fig. 4   Speedup comparison for

six vertex priorities approaches

Fig. 5   Speedup convergence history for forward–backward iterations

Figure 3  shows the non-decreasing speedup convergence. These results coincide

with Theorem 5.1. Moreover, the speedup almost converges after one or two itera-

tions. Figure 3 also shows that the vertex ranking strategy  CAP is superior to  LST.

Figure 4 lists the speedup for six vertex priorities approaches using hundreds of 

processors. Legend   LST,   DFHDS, and   SBP  represent three traditional approaches

introduced in the first section. Legend CAP-FB+LST,  CAP-FB+DFHDS, and  CAP-

FB+SBP  refer to the approach of   CAP-PFB  taking the output of  LST,  DFHDS  and

SBP  as its input, respectively. These curves show that   CAP-PFB   can significantly

improve the speedup in each case. Most of all,  CAP-PFB has increased the speedup

for SBP  from 201 to 302 while 500 processors are used.

7.2 Non-simple digraphs

Assigning each cut arc a weight of one-third step, above digraph is not simple any

more. Figure 5 lists the speedup convergence history for Algorithm 3.1. Though the

convergence is achieved within 8 iterations, non-increasing behavior is broken after

the 7th iteration. Of course, such breaks perhaps challenge the stopping criteria of 

Algorithm 3.1. Nevertheless, similar to Fig. 3, Fig. 5  also shows that 2 iterations are

enough for satisfying vertex priorities.

8/10/2019 13zaq Parallel

http://slidepdf.com/reader/full/13zaq-parallel 16/18

Z. Mo et al.

8 Real applications

Embedding the vertex priorities approach given by Algorithm 3.1 into the heuristic

sweeping framework [32] for a distributed diagraph, we can apply it to solve the

neutron transport applications as studied in a series of publications such as [27, 29,32, 33]. By the way, the traditional forward-backward approach of  FB  [21] and the

local approach of  SBP [32] are used for comparison.

The multi-group neutron transport equations [32] are discretized by the methods

of both discrete ordinates and discontinuous finite element on a two dimensional un-

structured quadrilateral mesh with 57600 cells. 48 angles are used to evenly partition

the unit spherical surface, and 24 groups are used to partition the energy distribution.

A digraph is constructed from the task of a pair of  ci , am where 24 group-flux are

swept across cell ci   for angle am. Similarly, each arc represents the data dependence

among two tasks such that (ci , am, cj , am) where cell ci  is the upstream neighborof cell cj   for angle am.

The parallel computer is TianHe-1A [31]. It is a distributed memory machine with

1024 nodes, each node includes two Intel Xeon EX5675 2.93 GHz CPUs, each CPU

has 6 cores, and each core has the peak performance of 11.72 GFLOPS. This ma-

chine has the fat-tree crossbar interconnect network, the dual-direction bandwidth is

equal to 160 Gbps and the message passing MPI [24] latency is about 1.57 micro-

seconds. Though multi-threaded parallelization is supported within each node, only

MPI parallelization is considered.

Table 1 lists the elapsed time and the real speedup on TianHe-1A for the heuristicsweeping framework [32] using three vertex priorities approaches of  SBP,   FB  and

CAP-PFB, respectively. The number of processors scales up from 32 to 2048. Here,

each processor means a CPU core. Four cores are used in each CPU. In total, 100

physical time steps are performed and 12 parallel sweeping are executed for each time

step. Simultaneously, two forward–backward iterations are executed for the vertex

priorities of  CAP-PFB and the vertex priorities are reused within each time step.

In the case of 2,048 processors,   CAP-PFB  reduces the sweeping time by 24 %

and by 14 % while compared to  SBP and  FB, respectively. In the case of hundreds

of processors, superlinear speedup appears. This phenomena mainly benefits from

Table 1   The performance results for parallel sweeping using three strategies

P    32 64 128 256 512 1024 2048

elapsed

time

(seconds)

SBP   1781 908 413 193 103 58 41

FB   1552 762 362 167 93 51 36

CAP-PFB   1380 702 331 151 81 45 31

speedup   SBP   1.0 1.87 4.31 9.22 17.24 30.70 43.41

FB   1.0 2.04 4.29 9.29 16.69 30.43 43.11

CAP-PFB   1.0 1.97 4.17 9.14 17.04 30.67 44.52

ratio   CAP-PFB v  SBP   22 % 23 % 17 % 22 % 22 % 21 % 24 %

CAP-PFB v  FB   11 % 8.0 % 8.5 % 9.6 % 13 % 12 % 14 %

8/10/2019 13zaq Parallel

http://slidepdf.com/reader/full/13zaq-parallel 17/18

8/10/2019 13zaq Parallel

http://slidepdf.com/reader/full/13zaq-parallel 18/18

Z. Mo et al.

8. Drexl A (1991) Scheduling of project networks by job assignment. Manag Sci 37(12):1590–1602

9. Gross JL, Yellen J (eds) (2003) Handbook of graph theory. Series: discrete mathematics and its appli-

cations, vol 25. CRC Press, Boca Raton

10. Gridgen (2012) User’s manual for version 15.  http://www.pointwise.com/  gridgen

11. Hackbush W, Probst T (1997) Downwind Gauss–Seidel smoothing for convection dominated prob-

lems. Numer Linear Algebra Appl 4:85–10212. Hackbush W, Wittum G (eds) (1993) Incomplete decompositions (ILU)—algorithms, theory and ap-

plications. Notes on numerical fluid mechanics, vol 41. Vieweg, Wiesbaden

13. Han H, Ilin VP, Kellogg RB, Yuan W (1992) Analysis of flow directed iterations. J Comput Math

10(1):57–76

14. Hendrickson B, Leland R (1994) The Chaco user’s guide: version 2.0. Technical Report, SAND94-

2692, Sandia National Laboratories, Albuquerque, NM

15. Kolisch R, Hartmann S (1999) Heuristic algorithms for solving the resource-constrained project

scheduling problem: classification and computational analysis. In: Weglarz J (ed) Project schedul-

ing—recent models, algorithms and applications. Kluwer Academic, Boston, pp 147–178

16. Kolisch R, Hartmann E (2006) Experimental investigation of heuristics for resource-constrained

project scheduling: an update. Eur J Oper Res 174(1):23–37

17. Kolisch R (1995) Project scheduling under resource constraints—efficient heuristics for several prob-lem classes. Physica. Springer, Heidelberg

18. Jørgen BJ, Gregory G (2001) Digraphs: theory, algorithms and applications. Springer, London

19. Koch KR, Baker RS, Alcouffe RE, Baker RS, Alcouffe RE (1997) Parallel 3-d  S n  performance for

MPI on cray-T3D. In: Proc joint intl conference on mathematics methods and supercomputing for

nuclear applications, vol 1, pp 377–393

20. Lewis EE, Miller WF (1984) Computational methods of neutron transport. Wiley, New York 

21. Li KY, Willis RJ (1992) An iterative scheduling technique for resource-constrained project schedul-

ing. Eur J Oper Res 56:370–379

22. Meng Q, Luitjens J, Berzins M (2010) Dynamic task scheduling for the Uintah framework. In:

Proceedings of the 3rd IEEE workshop on many-task computing on grids and supercomputers

(MATAGS10)23. Meng Q, Berzins M, Schmidt J (2011) Using hybrid parallelism to improve memory use in the Uintah

framework. In: TeraGrid’11, Solt Lake City, Utah, USA, 18–21 July

24. Gropp W, Lusk E, Skjellum A (1999) Using MPI: portable parallel programming with the message-

passing interface, 2nd edn. MIT Press, Cambridge

25. Notz PK, Pawlowski RP, Sutherland JC (2012) Graph-based software design for managing complexity

and enabling concurrency in multiphysics PDE software. ACM Trans Math Software 39(3):1

26. Ozdamar L, Ulusoy G (1996) An iterative local constraint based analysis for solving the resource

constrained project scheduling problem. J Oper Manag 14(3):193–208

27. Pautz SD (2002) An algorithm for parallel S n sweeps on unstructured meshes. Nucl Sci Eng 140:111–

136

28. Pautz SD, Pandya T, Adams ML (2011) Scalable parallel prefix solvers for discrete ordinates trans-

port. Nucl Sci Eng 169:245–26129. Plimpton S, Hendrickson B, Burns S, McLendon W (2000) Parallel algorithms for radiation transport

on unstructured grids. In: Proceeding of SuperComputing’2000

30. Thomas P, Salhi S (1997) An investigation into the relationship of heuristic performance with

network-resource characteristics. J Oper Res Soc 48(1):34–43

31. Yang X, Liao X, Lu K, Hu Q, Song J, Su J (2011) The TianHe-1A supercomputer: its hardware and

software. J Comput Sci Technol 26(3):344–351

32. Mo Z, Zhang A, Wittum G (2009) Scalable heuristic algorithms for the parallel execution of data flow

acyclic digraphs. SIAM J Sci Comput 31(5):3626–3642

33. Mo Z, Fu L (2004) Parallel flux sweep algorithm for neutron transport on unstructured grid. J Super-

comput 30(1):5–17


Recommended