8/10/2019 13zaq Parallel
http://slidepdf.com/reader/full/13zaq-parallel 1/18
1 3
The Journal of Supercomputing
An International Journal of High-
Performance Computer Design,
Analysis, and Use
ISSN 0920-8542
J Supercomput
DOI 10.1007/s11227-013-1022-8
new parallel algorithm for vertex riorities of data flow acyclic digraphs
Zeyao Mo, Aiqing Zhang & Zhang Yang
8/10/2019 13zaq Parallel
http://slidepdf.com/reader/full/13zaq-parallel 2/18
1 3
Your article is protected by copyright and all
rights are held exclusively by Springer Science
+Business Media New York. This e-offprint is
for personal use only and shall not be self-
archived in electronic repositories. If you wishto self-archive your article, please use the
accepted manuscript version for posting on
your own website. You may further deposit
the accepted manuscript version in any
repository, provided it is only made publicly
available 12 months after official publication
or later and provided acknowledgement is
given to the original source of publication
and a link is inserted to the published article
on Springer's website. The link must be
accompanied by the following text: "The final
publication is available at link.springer.com”.
8/10/2019 13zaq Parallel
http://slidepdf.com/reader/full/13zaq-parallel 3/18
J Supercomput
DOI 10.1007/s11227-013-1022-8
A new parallel algorithm for vertex priorities of data
flow acyclic digraphs
Zeyao Mo · Aiqing Zhang · Zhang Yang
© Springer Science+Business Media New York 2013
Abstract Data flow acyclic directed graphs (digraph) are widely used to describe the
data dependency of mesh-based scientific computing. The parallel execution of such
digraphs can approximately depict the flowchart of parallel computing. During the
period of parallel execution, vertex priorities are key performance factors. This paper
firstly takes the distributed digraph and its resource-constrained parallel scheduling as
the vertex priorities model, and then presents a new parallel algorithm for the solution
of vertex priorities using the well-known technique of forward–backward iterations.Especially, in each iteration, a more efficient vertex ranking strategy is proposed. In
the case of simple digraphs, both theoretical analysis and benchmarks show that the
vertex priorities produced by such an algorithm will make the digraph scheduling
time converge non-increasingly with the number of iterations. In other cases of non-
simple digraphs, benchmarks also show that the new algorithm is superior to many
traditional approaches. Embedding the new algorithm into the heuristic framework
for the parallel sweeping solution of neutron transport applications, the new vertex
priorities improve the performance by 20 % or so while the number of processors
scales up from 32 to 2048.
Keywords Acyclic digraph · Parallel algorithm · Neutron transport
1 Introduction
The data flow acyclic directed graphs (digraph) [9] are usually used to describe the
data dependency for a wide range of mesh-based scientific computing. Each of these
digraphs consists of weighted vertices and arcs, each vertex often refers to a mesh
Z. Mo (B) · A. Zhang · Z. Yang
Laboratory of Computational Physics, Institute of Applied Physics and Computational Mathematics,
P.O. Box 8009, Beijing, 100088, China
e-mail: [email protected]
8/10/2019 13zaq Parallel
http://slidepdf.com/reader/full/13zaq-parallel 4/18
Z. Mo et al.
cell, and its weight often represents the workloads; each arc often depicts the data
dependency between two neighboring cells and its weight often represents the de-
pendent overheads.
The parallel sweeping solvers are the numerical kernel for the seven-dimensional
radiation or neutron transport equations [20] when the discrete ordinates methods(S n) are used. They are typically mesh-based scientific computing applications whose
data dependency are suitable for digraph description. Baker et al. [2] and Koch et al.
[19] addressed these solvers on rectangular meshes on earlier massively parallel com-
puters, Plimpton et al. [29] and Pautz et al. [27] prolonged these researches to unstruc-
tured meshes, Mo et al. [33] supplemented these works for cylindrical coordinate sys-
tem, and recently Pautz et al. [28] presented another heuristic method to improve the
inherent parallelism for long characteristics S n discretization. Besides from these par-
allel sweeping solvers, many other mesh-based applications exist and are suitable for
digraph description, for example, the parallel downstream relaxation for the direct so-
lution of upper or lower sparse triangle linear system arising from the discretization of
convection-dominated problems [4, 11, 13], the well-known ILU factorization [12],
the dense matrix LU factorization and their multi-threaded versions [3], the patch-
based structured mesh AMR simulations [22, 25], and their multi-threaded versions
[23], and so on.
The flowchart of parallel computing for above mesh-based scientific computing
applications can be approximately depicted by the parallel execution of the associated
digraphs. Nevertheless, the solution for the minimal execution time of the digraphs
are still NP-hard [18]. Mo et al. [32] present a heuristic framework. It consists of three
components. The first is the partitioning method assigning digraph vertices acrossprocessors, the second is the parallel sweeping solver for the execution of distributed
digraph, and the third is the vertex priorities strategy to decide which vertex should be
executed when many vertices are executable in each processor. For a given distributed
digraph, the vertex priorities approach is the most crucial for parallel efficiency.
There are two types of approaches for the calculation of vertex priorities. The
first is local and another is global. The local approaches only use the data flow
locally in each processor and the global approaches use the data flow of digraph
across processors. The First-In-First-Out (FIFO) strategy, the Geometrical Coordi-
nates KBA strategy [29], the Shortest processor-Boundary Path strategy (SBP) [32],and the Sweeping Direction Upwind strategy [33] are typically local approaches. The
Largest End Time strategy [7], the Latest Start Time strategy (LST) [17], the Least
Relaxation Time Strategy [7], the Maximal Number of Successors Strategy [1], the
Hybrid Strategies [5, 30], the Sampling Strategies [6, 8], and the Depth First sweep-
ing strategies (DFHDS) [27] are the typically global approaches. Usually, the local
approaches are cheaper and less efficient, the global approaches are favorable while
the vertex priorities are reusable.
Generally, most of above vertex priorities approaches can be depicted by the well-
known resource-constrained scheduling models widely used for many of digraph-
based projects or networks [9, 15, 16] except that the constrained resources refer
to the number of available processors. Each of these models will produce a parallel
scheduling for which each vertex has a start or an end time for execution and the
parallel execution time of the scheduling is equal to the difference between the max-
imal vertices end time and the minimal vertices start time. Taking the start or the
8/10/2019 13zaq Parallel
http://slidepdf.com/reader/full/13zaq-parallel 6/18
Z. Mo et al.
Simple digraph is such a type of digraphs that all its vertices have equal weights
and all its arcs have zero weights. Simple digraphs often exist in the field of mesh-
based scientific computing. For example, weights of each vertex are usually equal to
each other because the computational formulae are similar across mesh cells, the data
transfer overheads between neighboring vertices are neglectable provided that eachvertex has enough computational workloads.
A distributed digraph consists of many non-overlapping sub-digraphs and each is
assigned to a processor. A vertex is local to a sub-digraph if and only if it belongs to
this sub-digraph, a vertex is local to another vertex if and only if they belong to the
same sub-digraph. An arc is a local arc if and only if both of its head and tail belong to
the same sub-digraph, otherwise, it is a cut arc. Usually, a sub-digraph includes all its
local vertices and all its arcs whose head or tail is local. Denote R= (r1, r2, . . . , rn)
be the mapping vector whose element ri (1 ≤ i ≤ P ) is the rank of processor owning
vertex vi , P is the number of processors.Given a distributed digraph with P sub-digraphs, the vertex priorities approaches
as introduced in the first section are equivalent to the solution of resource-constrained
scheduling model as follows:
min[ f ]
subject to
f i + wi,j ≤ f j − qj , i ∈ Dj , j = 1, . . . , n
|Ak(t )| ≤ 1, k = 1, . . . , P , ∀t
(1)
Here, f = (f 1, f 2, . . . , f n) is a scheduling and is represented by a vector of vertex
end time, n is the number of vertices, P is the number of processors, Dj is the set of
heads of vertex vj ,
Π = max1≤j ≤n
{f j }, Λ = min1≤j ≤n
{f j − qj }, [ f ] = Π − Λ (2)
denotes the end time, start time, and execution time of the digraph, respectively,
Ak(t )
= {vj : f j − qj ≤ t < f j } ∩ {vj : rj = k} (3)
is the set of vertices executed on processor k at time t . Obviously, assign each vertex
vi the priority f i , then the digraph has the final execution time [ f ].
Two constraints are proposed in Eq. (1). The first is the sequence constraint such
that a vertex should not execute until all its predecessors have finished, and the sec-
ond is the resource constraint such that at most one vertex executes concurrently in
each processor. A scheduling is feasible if and only if these constraints are satisfied.
A scheduling is optimal if and only if [ f ] is the minimal.
3 Parallel forward–backward iterations
Li et al. [21] presents a serial technique of Forward–Backward iterations (FB) to
reduce the execution time of a feasible scheduling for projects or networks. Here,
8/10/2019 13zaq Parallel
http://slidepdf.com/reader/full/13zaq-parallel 7/18
A new parallel algorithm for vertex priorities of data flow acyclic
we consider its parallel version for a distributed digraph in Algorithm 3.1. Different
from the serial forward–backward iterations, a new ranking strategy is introduced
satisfying the sequence constraint of Eq. (1).
Algorithm 3.1 PFB(G , r, q,w,α,β, f 0
, M its, ε, f )INPUT:
G : local sub-digraph of the digraph;
r : processor mapping vector for all vertices of the digraph;
q : weight vector of local vertices;
w : weight matrix of both local arcs and cut arcs;
α : forward ranks of local vertices, smaller rank has high priority;β : backward ranks of local vertices, larger rank has high priority;
M its : maximal number of forward-backward iterations;
ε : convergence error threshold;f 0 : initial scheduling of local vertices.
OUTPUT:
f : final scheduling of local vertices.
BEGIN
h = 0, m = number of local vertices.
Λ0 = start time of initial scheduling, Π 0 = end time of initial scheduling.
DO in PARALLEL {
(1) Execute backward iteration for schedule f h+1/2 from f h.
(1.1) Compute ranks for local vertices from f h and sort.
(1.1.1) Compute β from f h using a rank strategy as discussed in
the next section.
(1.1.2) Order local vertex indices {ig}g=1,...,m satisfying
βi1, f hi1 ≥ βi2
, f hi2 ≥ · · · ≥ βim , f himHere, (a1, b1) ≥ (a2, b2) means
(a1 > b1)||
(a1 = b1)&(a2 > b2)
.
(1.2) Π h+1/2 = Π h, I = (−∞, Π h+1/2) initial intervals available.
(1.3) FOR each vertex ig : g = 1, 2, . . . , m − 1, m DO {
(1.3.1) Let t ig = f higand goto step (1.3.4) if vig is a sink.
(1.3.2) Receive every sh+1/2j from each remote tail vj .
(1.3.3) Compute t ig = min(vig ,vl )∈E{sh+1/2l − wig ,l }.
(1.3.4) Update f h+1/2
ig= max{t : (t ≤ t ig )&((t − qig , t) ⊆ I )}.
(1.3.5) Update sh+1/2ig
= f h+1/2
ig− qig .
(1.3.6) Update the latest start time I = I \ (sh+1/2ig
, f h+1/2
ig).
8/10/2019 13zaq Parallel
http://slidepdf.com/reader/full/13zaq-parallel 8/18
Z. Mo et al.
(1.3.7) Send sh+1/2ig
to each processor owing heads of vig .
(1.3) } END for g = 1, 2, . . . , m − 1, m.
(1.4) Synchronize the latest start time across all processors:
Λh+1/2
= min
sh+1/2
l : vl ∈ V
.
(2) Execute forward iteration for schedule f h+1 from f h+1/2.
(2.1) Compute ranks for local vertices using f h+1/2 and sort.
(2.1.1) Compute α from f h+1/2 using a rank strategy as discussed
in the next section.
(2.1.2) Order local vertex indices {ig}g=1,...,m satisfying
αi1 , sh+1/2i1 ≤ αi2 , s
h+1/2i2 ≤ · · · ≤ αim , s
h+1/2im
(2.2) Λh+1 = Λh+1/2, I = (Λh+1, +∞) //initial intervals available.
(2.3) FOR each vertex ig : g = 1, 2, . . . , m − 1, m DO {
(2.3.1) Let t ig = sh+1/2ig
and goto step (2.3.4) if vig is a source.
(2.3.2) Receive every f h+1l from each remote head vl .
(2.3.3) Compute t ig = max(vj ,vig )∈E{f h+1j + wj,ig }.
(2.3.4) Update s h+1ig
= min{t : t ≥ t ig & ([t, t + qig ) ⊆ I )}.
(2.3.5) Update f h+1ig
= sh+1ig
+ qig . //.
(2.3.6) Update the earliest end time of I = I \ (sh+1ig
, f h+1ig
).
(2.3.7) Send f h+1ig
to each processor owning tails of vig .
} END for g = 1, 2, . . . , m − 1, m.
(2.4) Synchronize the earliest end time across all processors:
Π h+1 = max{f h+1j : vj ∈V}.
(3) h = h + 1.
} UNTIL (h > M its or |(Π h−1
− Λh−1
) − (Π h
− Λh
)| < ε).
Remark 3.1 The sequence of vertices vig (g = 1, 2, . . . , m − 1, m) in step (1.1.1) or
step (2.1.1) satisfies the sequence constraint of Eq. (1).
Remark 3.2 In step (1.3.4), t ig is the latest end time satisfying the sequence con-
straint, f h+1/2
igis the latest end time satisfying both constraints.
Remark 3.3 In step (2.3.4), t ig is the earliest start time satisfying the sequence con-
straint, s h+1ig
is the earliest start time satisfying both constraints.
Remark 3.4 The last row in Algorithm 3.1 is the terminal condition. If the sequence
[ f h] non-increasingly converge with h, the final output is the solution; otherwise, we
should take the output having the shortest execution time in the solution history.
8/10/2019 13zaq Parallel
http://slidepdf.com/reader/full/13zaq-parallel 9/18
A new parallel algorithm for vertex priorities of data flow acyclic
4 Vertex ranking strategies
In step (1.1) and step (2.1), the vertex ranking strategies are crucial not only for the
convergence but also for the quality of sequence [ f h]. Li et al. [21] assign each vertex
the rank of its end time after last backward iteration and the rank of its start time afterlast forward iteration, Ozdamar et al. [26] present another similar but more complex
strategy, Mo et al. [32] use the Shortest processor-Boundary Path strategy (SBP) for
which vertices are ranked by their distance away from the sub-digraph boundaries.
Local approaches for vertex priorities introduced in the first section can also give
rank sequences. However, no matter what strategies are used, one problem is still
open such that whether and how [ f h] converges.
In the following part of this section, a new vertex ranking strategy is proposed
called Cut Arc Preference strategy (CAP). It not only makes above open problem be
answerable but also gives better execution time. Moreover, its computational com-
plexity is similar to other local approaches for vertex priorities in the literatures as
SBP. This strategy mainly differs from traditional strategies in the viewpoint of that
cut arcs should be preferentially executed since their earlier finish indicate larger in-
herent parallelism.
In step (2.1.1) after a backward iteration, we rank a vertex by the earliest start time
of all its downstream cut arcs as follows:
αi = min(vk ,vj )∈S i
s
h+1/2j − wk,j
∪ {+∞} (4)
Here, S i is the set of cut arcs whose heads are the successors of vertex vi , wk,j is the
weight of cut arc (vk, vj ) and sh+1/2j is the start time after last backward iteration,
superscript h represents the iteration step. Smaller αi indicates higher priority.
Similarly, in step (1.1.1) after a forward iteration, we rank a vertex by the latest
end time of all its upstream cut arcs as follows:
βi = max(vj ,vk )∈Bi
f hj + wj,k
∪ {−∞} (5)
Here, B i is the set of local cut arcs whose tails are the predecessors of vertex vi , wj,kis the weight of cut arc (vj , vk), and f hj is the end time after last forward iteration,
superscript h represents this iteration. Larger βi indicates higher priority.
Strategy CAP naturally satisfies the sequence constraint stated in step (2.1.2) and
step (1.1.2). In fact, assuming vertex va the predecessor of vertex vb, vertex set S bmust be a subset of S a , so αa ≤ αb. Similarly, assuming vertex vc the successor of
vertex vd , vertex set B c must be a subset of B d , so βc ≤ βd .
Let CAP-F be the forward iteration using formula (4) and CAP-B be the backward
iteration using formula (5), denote the forward–backward iteration as CAP-PFB. Fig-
ure 1 gives an example to illustrate these iterations for a scheduling. Figure 1a gives
a distributed simple digraph with two sub-digraphs divided by the dashed line. Each
circle and its central number represent a vertex, the weight of each vertex is normal-
ized to 1.0. Each arrow represents an arc showing the data dependency between two
vertices, the weight of each arc is equal to zero. Three cut arcs exist along the dashed
line. Figure 1b gives an initial scheduling. The horizontal axis is the execution steps
8/10/2019 13zaq Parallel
http://slidepdf.com/reader/full/13zaq-parallel 10/18
Z. Mo et al.
Fig. 1 One backward-forward iteration of CAP-PFB
representing the execution time, the vertical axis is the executed vertices across two
processors. This scheduling requires 7 steps in total.Taking the set of end time {f 0i }10
i=1 in Fig. 1b as the input, Fig. 1c assigns each ver-
tex a rank defined by formula (5). All local vertices in the 0th processor have the rank
{−∞} since no local upstream cut arcs exist. B 9 = {(v2, v9),(v3, v7)}, f 03 = 3, f 02 =
2, so β9 = 3. Similarly, we have β7 = β10 = 3, β8 = f 04 = 4. After one backward
iteration as stated in ALGORITHM 3.1, a better scheduling is generated in Fig. 1d
where one step is reduced.
Taking the set of start time {s1/2i }10
i=1 in Fig. 1d as the input, Fig. 1e updates ver-
tex ranks defined by formula (4). All local vertices in the 1th processor have the
rank {+∞} since no local downstream cut arcs exist. S 1 = {(v2, v9),(v4, v8)}, s1/29 =4, s
1/28 = 6, so α1 = 4. Similarly, we have α2 = 4, α3 = 3, α4 = 6. After one forward
iteration as stated in ALGORITHM 3.1, a better scheduling is generated again in
Fig. 1f where another step is further reduced.
Until now, one backward–forward iteration finishes, an optimal scheduling with 5
steps is gained. However, the ranking strategies presented in [21, 26] cannot improve
the results in Fig. 1b.
5 Convergence analysis for CAP-PFB
The section gives the convergence analysis for algorithm CAP-PFB for simple di-
graphs. No loss of generalization, we normalize the vertex weight to be q1 = q2 =· · · = qn = 1.
8/10/2019 13zaq Parallel
http://slidepdf.com/reader/full/13zaq-parallel 12/18
Z. Mo et al.
so vertex vk does not belong to set K and (vk , vk ) is a cut arc whose head belongs
to the list J . Moreover, k < C because αk > αk ≥ αj C . Therefore,
f h+1k ≤ s
h+1/2
k (15)
from the induction assumption in Eq. (9). So, we have f h+1j C
≤ sh+1/2
k + |K| from
above five equations. Let j = argmax{f h+1/2
i : i ∈ K}, then sh+1/2
k + |K| ≤ f h+1/2
j ,
so f h+1j C
≤ f h+1/2
j . Equation (10) is obtained.
From Eq. (10) and the inequality of αj ≤ αj C , we can further conclude that
f h+1j C
≤ f h+1/2
j ≤ αj ≤ αj C ≤ s
h+1/2k (16)
Inequality (6) and Lemma 5.1 are correct.
Lemma 5.2 If the distributed acyclic digraph is simple, the scheduling f h+1 gener-
ated by the forward iteration CAP-F is non-increasing such that
f h+1
≤
f h+1/2
(17)
Here, [ f ] is the parallel execution time as defined in Eq. (2).
Proof Given a vertex vk , assuming one of its successor being the head of cut arc
(vj c , vt ), we have
f h+1k ≤ f h+1
j c≤ s
h+1/2t < f
h+1/2t ≤ Π h+1/2 (18)
by Lemma 5.1 and the sequence constraint. Otherwise, it is a sink or local predecessor
of a sink. Let vertex vs be the sink, we can also conclude that f h+1k ≤ f h+1
s by the
sequence constraint and
f h+1s ≤ f
h+1/2
j ≤ Π h+1/2 (19)
by Eq. (10). Therefore, we always have
f h+1
= Π h+1 − Λh+1 ≤ Π h+1/2 − Λh+1/2 =
f h+1/2
(20)
Lemma 5.2 is obtained.
Similarly, we have the following conclusion.
Lemma 5.3 If the distributed acyclic digraph is simple, the scheduling f h+1/2 gen-
erated by the backward iteration CAP-B is non-increasing such that
f h+1/2
≤
f h
(21)
Here, [ f ] is the parallel execution time as defined in Eq. (2).
From Lemmas 5.1 and 5.2, we have the following conclusion.
8/10/2019 13zaq Parallel
http://slidepdf.com/reader/full/13zaq-parallel 13/18
A new parallel algorithm for vertex priorities of data flow acyclic
Theorem 5.1 The sequence {[ f h]}+∞h=0 generated by CAP-PFB of Algorithm 3.1 non-
increasingly converges for simple acyclic digraphs starting from any initial schedul-
ing f 0.
6 Complexity analysis for CAP-PFB
The calculation complexity of strategy CAP-PFB is the same as that of strategy FB
apart from vertex ranking strategies. Let N be the total number of vertices, P be
the number of processors, each sub-digraph G have equal number of vertices, then
step (1.1.1) and step (2.1.1) have equal complexity O ( N P
). Furthermore, step (1.1.2)
and step (2.1.2) have equal complexity O( N P
log N P
). The calculation complexity of
step (1.3) and step (1.4) are the same as that of step (1.1.1) and step (2.1.1). The
global reduction in step (1.4) or step (2.4) has the calculation complexity of O ( N P
) +
O(log P ). So, Algorithm 3.1 has the calculation complexity as follows:
T cal ∼= O
N
P
+O
N
P log
N
P
+O
N
P
+O(log P ) ∼= O
N
P log
N
P
+O(log P )
(22)
Assume the average degree of vertices be O(D) and the diagraph be uniformly
partitioned, then each processor has O(
N P
) cut arcs in the two-dimensional case.
So, the message passing complexity is about
T mes ∼= O
D ×
N P
(23)
Similarly, the message passing complexity can be estimated in the three-dimensional
case
T mes ∼= O
D ×
N
P
23
(24)
It is difficult to evaluate the overheads for step (1.3.2) and (1.3.7) because of the
uncertainty of message passing. However, they can be estimated as follows:
T pas ∼= O(T mes × L) (25)
Here, L is the message passing latency. Similarly, this estimation is suitable for steps
(2.3.2) and (2.3.7).
The complexity of Algorithm 3.1 is the sum of T cal and T pass. In the case of
N P , t cal dominates. If we fix N/P and increase P , the global reduction will
dominates. If the diagraph parallelism is smaller, T pas or the message passing latency
will dominate.
In fact, Algorithm 3.1 is the sequence of a number of iterations and each iteration
has double overheads of a parallel sweeping as introduced in the first section. So,
Algorithm 3.1 has the complexity of 2M parallel sweeping provided that it converges
within M iterations. This implies that Algorithm 3.1 has the disadvantage of that it is
useful if and only if the vertex priorities approach is reusable for the parallel sweeping
of a series of digraphs or the parallel sweeping is repeated for the real applications.
8/10/2019 13zaq Parallel
http://slidepdf.com/reader/full/13zaq-parallel 14/18
Z. Mo et al.
Fig. 2 A structured mesh
generated by the software [10]
in the scale of 120 × 49 across
an airfoil
7 Theoretical benchmarks
This section validates Theorem 5.1 for Algorithm 3.1. Given a distributed digraph,
we define the speedup S P of a feasible scheduling as the serial execution time over
the parallel execution time for P processors. Here, the execution time is measured by
steps. For example, for a simple digraph, the vertex weight is equal to one step and
the arc weight is equal to zero. The number of steps can be accurately calculated by
the symbolic sweeping. Obviously, larger speedup means shorter parallel execution
time and implies superior vertex priorities.
7.1 Simple digraph benchmarks
The simple digraphs come from one flux sweeping of the discrete ordinates solution
of the neutron transport applications as studied in [32] and [33]. The geometry is a
two-dimensional structured mesh in the scale of 120 × 49 across an airfoil as shown
in Fig. 2; the discrete ordinates include 24 angles arising from the non-overlapping
partitioning of a unit spherical surface. The digraph includes about 140 K vertices and
280 K arcs. Each vertex represents the task of a pair of ci , am where flux is swept
across cell ci for angle am and its weight is equal to one step. Each arc represents
the data dependence among two tasks such that (ci , am, cj , am) where cell ci isthe upstream neighbor of cell cj for angle am and its weight is equal to zero. It is
easily concluded that the digraph is acyclic if and only if each cell is convex [ 32].
This digraph has the maximal speedup of 666 provided that unlimited number of
processors are available.
The distributed digraph with P subgraph is generated by the non-overlapping
mesh partitioning method of Inertial Kemighan–Lin (IKL) implemented in the tool
Chaco [14]. For each partitioning, above tasks defined on each cell are assigned to
the same processor.
Figure 3 lists the speedup convergence history with the number of iterations for
400 or 500 processors, respectively. Here, legend CAP-FB represents Algorithm 3.1
coupled with the vertex ranking strategy CAP, legend FB represents the traditional
forward–backward iterations [21] coupled with the vertex ranking strategy LST. The
horizontal axis is the number of iterations where the half means the a backward iter-
ation and the integer means a backward–forward iteration.
8/10/2019 13zaq Parallel
http://slidepdf.com/reader/full/13zaq-parallel 15/18
A new parallel algorithm for vertex priorities of data flow acyclic
Fig. 3 Speedup convergence history for forward–backward iterations
Fig. 4 Speedup comparison for
six vertex priorities approaches
Fig. 5 Speedup convergence history for forward–backward iterations
Figure 3 shows the non-decreasing speedup convergence. These results coincide
with Theorem 5.1. Moreover, the speedup almost converges after one or two itera-
tions. Figure 3 also shows that the vertex ranking strategy CAP is superior to LST.
Figure 4 lists the speedup for six vertex priorities approaches using hundreds of
processors. Legend LST, DFHDS, and SBP represent three traditional approaches
introduced in the first section. Legend CAP-FB+LST, CAP-FB+DFHDS, and CAP-
FB+SBP refer to the approach of CAP-PFB taking the output of LST, DFHDS and
SBP as its input, respectively. These curves show that CAP-PFB can significantly
improve the speedup in each case. Most of all, CAP-PFB has increased the speedup
for SBP from 201 to 302 while 500 processors are used.
7.2 Non-simple digraphs
Assigning each cut arc a weight of one-third step, above digraph is not simple any
more. Figure 5 lists the speedup convergence history for Algorithm 3.1. Though the
convergence is achieved within 8 iterations, non-increasing behavior is broken after
the 7th iteration. Of course, such breaks perhaps challenge the stopping criteria of
Algorithm 3.1. Nevertheless, similar to Fig. 3, Fig. 5 also shows that 2 iterations are
enough for satisfying vertex priorities.
8/10/2019 13zaq Parallel
http://slidepdf.com/reader/full/13zaq-parallel 16/18
Z. Mo et al.
8 Real applications
Embedding the vertex priorities approach given by Algorithm 3.1 into the heuristic
sweeping framework [32] for a distributed diagraph, we can apply it to solve the
neutron transport applications as studied in a series of publications such as [27, 29,32, 33]. By the way, the traditional forward-backward approach of FB [21] and the
local approach of SBP [32] are used for comparison.
The multi-group neutron transport equations [32] are discretized by the methods
of both discrete ordinates and discontinuous finite element on a two dimensional un-
structured quadrilateral mesh with 57600 cells. 48 angles are used to evenly partition
the unit spherical surface, and 24 groups are used to partition the energy distribution.
A digraph is constructed from the task of a pair of ci , am where 24 group-flux are
swept across cell ci for angle am. Similarly, each arc represents the data dependence
among two tasks such that (ci , am, cj , am) where cell ci is the upstream neighborof cell cj for angle am.
The parallel computer is TianHe-1A [31]. It is a distributed memory machine with
1024 nodes, each node includes two Intel Xeon EX5675 2.93 GHz CPUs, each CPU
has 6 cores, and each core has the peak performance of 11.72 GFLOPS. This ma-
chine has the fat-tree crossbar interconnect network, the dual-direction bandwidth is
equal to 160 Gbps and the message passing MPI [24] latency is about 1.57 micro-
seconds. Though multi-threaded parallelization is supported within each node, only
MPI parallelization is considered.
Table 1 lists the elapsed time and the real speedup on TianHe-1A for the heuristicsweeping framework [32] using three vertex priorities approaches of SBP, FB and
CAP-PFB, respectively. The number of processors scales up from 32 to 2048. Here,
each processor means a CPU core. Four cores are used in each CPU. In total, 100
physical time steps are performed and 12 parallel sweeping are executed for each time
step. Simultaneously, two forward–backward iterations are executed for the vertex
priorities of CAP-PFB and the vertex priorities are reused within each time step.
In the case of 2,048 processors, CAP-PFB reduces the sweeping time by 24 %
and by 14 % while compared to SBP and FB, respectively. In the case of hundreds
of processors, superlinear speedup appears. This phenomena mainly benefits from
Table 1 The performance results for parallel sweeping using three strategies
P 32 64 128 256 512 1024 2048
elapsed
time
(seconds)
SBP 1781 908 413 193 103 58 41
FB 1552 762 362 167 93 51 36
CAP-PFB 1380 702 331 151 81 45 31
speedup SBP 1.0 1.87 4.31 9.22 17.24 30.70 43.41
FB 1.0 2.04 4.29 9.29 16.69 30.43 43.11
CAP-PFB 1.0 1.97 4.17 9.14 17.04 30.67 44.52
ratio CAP-PFB v SBP 22 % 23 % 17 % 22 % 22 % 21 % 24 %
CAP-PFB v FB 11 % 8.0 % 8.5 % 9.6 % 13 % 12 % 14 %
8/10/2019 13zaq Parallel
http://slidepdf.com/reader/full/13zaq-parallel 18/18
Z. Mo et al.
8. Drexl A (1991) Scheduling of project networks by job assignment. Manag Sci 37(12):1590–1602
9. Gross JL, Yellen J (eds) (2003) Handbook of graph theory. Series: discrete mathematics and its appli-
cations, vol 25. CRC Press, Boca Raton
10. Gridgen (2012) User’s manual for version 15. http://www.pointwise.com/ gridgen
11. Hackbush W, Probst T (1997) Downwind Gauss–Seidel smoothing for convection dominated prob-
lems. Numer Linear Algebra Appl 4:85–10212. Hackbush W, Wittum G (eds) (1993) Incomplete decompositions (ILU)—algorithms, theory and ap-
plications. Notes on numerical fluid mechanics, vol 41. Vieweg, Wiesbaden
13. Han H, Ilin VP, Kellogg RB, Yuan W (1992) Analysis of flow directed iterations. J Comput Math
10(1):57–76
14. Hendrickson B, Leland R (1994) The Chaco user’s guide: version 2.0. Technical Report, SAND94-
2692, Sandia National Laboratories, Albuquerque, NM
15. Kolisch R, Hartmann S (1999) Heuristic algorithms for solving the resource-constrained project
scheduling problem: classification and computational analysis. In: Weglarz J (ed) Project schedul-
ing—recent models, algorithms and applications. Kluwer Academic, Boston, pp 147–178
16. Kolisch R, Hartmann E (2006) Experimental investigation of heuristics for resource-constrained
project scheduling: an update. Eur J Oper Res 174(1):23–37
17. Kolisch R (1995) Project scheduling under resource constraints—efficient heuristics for several prob-lem classes. Physica. Springer, Heidelberg
18. Jørgen BJ, Gregory G (2001) Digraphs: theory, algorithms and applications. Springer, London
19. Koch KR, Baker RS, Alcouffe RE, Baker RS, Alcouffe RE (1997) Parallel 3-d S n performance for
MPI on cray-T3D. In: Proc joint intl conference on mathematics methods and supercomputing for
nuclear applications, vol 1, pp 377–393
20. Lewis EE, Miller WF (1984) Computational methods of neutron transport. Wiley, New York
21. Li KY, Willis RJ (1992) An iterative scheduling technique for resource-constrained project schedul-
ing. Eur J Oper Res 56:370–379
22. Meng Q, Luitjens J, Berzins M (2010) Dynamic task scheduling for the Uintah framework. In:
Proceedings of the 3rd IEEE workshop on many-task computing on grids and supercomputers
(MATAGS10)23. Meng Q, Berzins M, Schmidt J (2011) Using hybrid parallelism to improve memory use in the Uintah
framework. In: TeraGrid’11, Solt Lake City, Utah, USA, 18–21 July
24. Gropp W, Lusk E, Skjellum A (1999) Using MPI: portable parallel programming with the message-
passing interface, 2nd edn. MIT Press, Cambridge
25. Notz PK, Pawlowski RP, Sutherland JC (2012) Graph-based software design for managing complexity
and enabling concurrency in multiphysics PDE software. ACM Trans Math Software 39(3):1
26. Ozdamar L, Ulusoy G (1996) An iterative local constraint based analysis for solving the resource
constrained project scheduling problem. J Oper Manag 14(3):193–208
27. Pautz SD (2002) An algorithm for parallel S n sweeps on unstructured meshes. Nucl Sci Eng 140:111–
136
28. Pautz SD, Pandya T, Adams ML (2011) Scalable parallel prefix solvers for discrete ordinates trans-
port. Nucl Sci Eng 169:245–26129. Plimpton S, Hendrickson B, Burns S, McLendon W (2000) Parallel algorithms for radiation transport
on unstructured grids. In: Proceeding of SuperComputing’2000
30. Thomas P, Salhi S (1997) An investigation into the relationship of heuristic performance with
network-resource characteristics. J Oper Res Soc 48(1):34–43
31. Yang X, Liao X, Lu K, Hu Q, Song J, Su J (2011) The TianHe-1A supercomputer: its hardware and
software. J Comput Sci Technol 26(3):344–351
32. Mo Z, Zhang A, Wittum G (2009) Scalable heuristic algorithms for the parallel execution of data flow
acyclic digraphs. SIAM J Sci Comput 31(5):3626–3642
33. Mo Z, Fu L (2004) Parallel flux sweep algorithm for neutron transport on unstructured grid. J Super-
comput 30(1):5–17