+ All Categories
Home > Documents > Scheduling Second-Order Computational Load in Master-Slave Paradigm

Scheduling Second-Order Computational Load in Master-Slave Paradigm

Date post: 22-Sep-2016
Category:
Upload: tg
View: 213 times
Download: 1 times
Share this document with a friend
14
Scheduling Second-Order Computational Load in Master-Slave Paradigm S. SURESH, Senior Member, IEEE Nanyang Technological University CUI RUN HYOUNG JOONG KIM, Member, IEEE Korea University THOMAS G. ROBERTAZZI, Fellow, IEEE SUNY, Stony Brook YOUNG-IL KIM Korea Telecom Scheduling divisible loads with the nonlinear computational complexity is a challenging task as the recursive equations are nonlinear and it is difficult to find closed-form expression for processing time and load fractions. In this study we attempt to address a divisible load scheduling problem for computational loads having second-order computational complexity in a master-slave paradigm with nonblocking mode of communication. First, we develop algebraic means of determining the optimal size of load fractions assigned to the processors in the network using a mild assumption on communication-to-computation speed ratio. We use numerical simulation to verify the closeness of the proposed solution. Like in earlier works which consider processing loads with first-order computational complexity, we study the conditions for optimal sequence and arrangements using the closed-form expression for optimal processing time. Our finding reveals that the condition for optimal sequence and arrangements for second-order computational loads are the same as that of linear computational loads. This scheduling algorithm can be used for aerospace applications such as Hough transform for image processing and pattern recognition using hidden Markov model (HMM). Manuscript received April 21, 2010; revised September 3 and November 29, 2010; released for publication February 11, 2011. IEEE Log No. T-AES/48/1/943648. Refereeing of this contribution was handled by L. Kaplan. The work of S. Suresh was supported by NTU-SUG program by Nanyang Technological University. The work of H-J. Kim was supported by the IT R&D program (ITRC), the CTRC program of MCST/KOCCA, Korea University, and the 3DLife project by the National Research Foundation. The work of T. G. Robertazzi was supported by DOE Grant DE-SC0003361. Authors’ addresses: S. Suresh, School of Computer Engineering, Nanyang Technological University, #02b-67, Bik N4, Singapore, 637820, Singapore, E-mail: ([email protected]); C. Run and H. J. Kim, CIST, Graduate School of Information Management and Security, Korea University, Seoul 136-701, Korea; T. G. Robertazzi, Department of Electrical and Computer Engineering, State University of New York at Stony Brook, Stony Brook, NY 11794-2350; Y-I. Kim, Korea Telecom, KT Central R&D Center, Seoul 137-792, Korea. 0018-9251/12/$26.00 c ° 2012 IEEE I. INTRODUCTION Researchers are producing a huge amount of data to solve complex and interdisciplinary problems. The efforts to solve such complex problems are hindered by time-consuming postprocessing in a single workstation. Data-driven computation is an active area of research, which addresses the issue of handling huge data sets. The main objective in data-driven computation is to minimize the processing time of computing loads by using distributed computing system. These computing loads are assumed to be divisible arbitrarily into small fractions and processed independently in the processors. The above assumption on computing loads is suitable for many practical applications involving data parallelism such as image processing, pattern recognition, bio-informatics, data mining, etc. The main thrust in the parallel processing of divisible loads is to design efficient scheduling algorithms that minimize the total load processing time. The domain of scheduling divisible loads in a multiprocessor system is commonly referred as divisible load theory (DLT) and is of interest to researchers in the field of scheduling loads in computer networks. The problem of scheduling divisible loads in intelligent sensor networks started in 1988 by Cheng and Robertazzi [13]. Here, an intelligent sensor network with master-slave architecture is considered where a master processor can measure, compute, and communicate with other intelligent sensors for collaborative computing. The first mathematical model considered [13] is similar to a linear network of processors. The optimal load allocation strategy presented in [13] is extended to tree networks in [14] and bus networks in [11], [34]. An optimal load allocation for linear network of processors is presented by the theory that all processors stop computing at the same time instant [13]. In fact, this condition has been shown to be a necessary and sufficient condition for obtaining optimal processing time in linear networks [33] by using the concept of processor equivalence. An analytical proof of this assumption in bus networks is presented in [35]. This assumption has been proven in a rigorous manner and it is shown that this assumption is true only in a restricted sense [8]. The concepts of optimal sequencing and optimal arrangement are introduced [4, 29] and parameters for computation and communication are probed for adaptive distributed processing [22]. Since 1988 research works [6—8, 11—14, 17—20, 22, 25, 29, 33—37, 41] in DLT framework have been carried out by algebraic means to determine optimal fractions of a load distributed to processors in the network such that the total load processing time is minimum. A number of scheduling policies have been investigated including multi-installments [5], 780 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 48, NO. 1 JANUARY 2012
Transcript

Scheduling Second-Order

Computational Load in

Master-Slave Paradigm

S. SURESH, Senior Member, IEEE

Nanyang Technological University

CUI RUN

HYOUNG JOONG KIM, Member, IEEE

Korea University

THOMAS G. ROBERTAZZI, Fellow, IEEE

SUNY, Stony Brook

YOUNG-IL KIM

Korea Telecom

Scheduling divisible loads with the nonlinear computationalcomplexity is a challenging task as the recursive equations arenonlinear and it is difficult to find closed-form expression forprocessing time and load fractions. In this study we attempt toaddress a divisible load scheduling problem for computationalloads having second-order computational complexity in amaster-slave paradigm with nonblocking mode of communication.First, we develop algebraic means of determining the optimalsize of load fractions assigned to the processors in the networkusing a mild assumption on communication-to-computationspeed ratio. We use numerical simulation to verify the closenessof the proposed solution. Like in earlier works which considerprocessing loads with first-order computational complexity, westudy the conditions for optimal sequence and arrangementsusing the closed-form expression for optimal processing time.Our finding reveals that the condition for optimal sequenceand arrangements for second-order computational loads arethe same as that of linear computational loads. This schedulingalgorithm can be used for aerospace applications such as Houghtransform for image processing and pattern recognition usinghidden Markov model (HMM).

Manuscript received April 21, 2010; revised September 3 and

November 29, 2010; released for publication February 11, 2011.

IEEE Log No. T-AES/48/1/943648.

Refereeing of this contribution was handled by L. Kaplan.

The work of S. Suresh was supported by NTU-SUG program by

Nanyang Technological University. The work of H-J. Kim was

supported by the IT R&D program (ITRC), the CTRC program of

MCST/KOCCA, Korea University, and the 3DLife project by the

National Research Foundation. The work of T. G. Robertazzi was

supported by DOE Grant DE-SC0003361.

Authors’ addresses: S. Suresh, School of Computer Engineering,

Nanyang Technological University, #02b-67, Bik N4, Singapore,

637820, Singapore, E-mail: ([email protected]); C. Run and

H. J. Kim, CIST, Graduate School of Information Management

and Security, Korea University, Seoul 136-701, Korea; T. G.

Robertazzi, Department of Electrical and Computer Engineering,

State University of New York at Stony Brook, Stony Brook, NY

11794-2350; Y-I. Kim, Korea Telecom, KT Central R&D Center,

Seoul 137-792, Korea.

0018-9251/12/$26.00 c° 2012 IEEE

I. INTRODUCTION

Researchers are producing a huge amount of data

to solve complex and interdisciplinary problems.

The efforts to solve such complex problems are

hindered by time-consuming postprocessing in

a single workstation. Data-driven computation

is an active area of research, which addresses

the issue of handling huge data sets. The main

objective in data-driven computation is to minimize

the processing time of computing loads by using

distributed computing system. These computing

loads are assumed to be divisible arbitrarily into

small fractions and processed independently in the

processors. The above assumption on computing loads

is suitable for many practical applications involving

data parallelism such as image processing, pattern

recognition, bio-informatics, data mining, etc. The

main thrust in the parallel processing of divisible

loads is to design efficient scheduling algorithms that

minimize the total load processing time. The domain

of scheduling divisible loads in a multiprocessor

system is commonly referred as divisible load theory

(DLT) and is of interest to researchers in the field of

scheduling loads in computer networks. The problem

of scheduling divisible loads in intelligent sensor

networks started in 1988 by Cheng and Robertazzi

[13]. Here, an intelligent sensor network with

master-slave architecture is considered where a master

processor can measure, compute, and communicate

with other intelligent sensors for collaborative

computing.

The first mathematical model considered [13]

is similar to a linear network of processors. The

optimal load allocation strategy presented in [13] is

extended to tree networks in [14] and bus networks

in [11], [34]. An optimal load allocation for linear

network of processors is presented by the theory

that all processors stop computing at the same time

instant [13]. In fact, this condition has been shown to

be a necessary and sufficient condition for obtaining

optimal processing time in linear networks [33]

by using the concept of processor equivalence. An

analytical proof of this assumption in bus networks is

presented in [35]. This assumption has been proven in

a rigorous manner and it is shown that this assumption

is true only in a restricted sense [8]. The concepts

of optimal sequencing and optimal arrangement are

introduced [4, 29] and parameters for computation

and communication are probed for adaptive distributed

processing [22].

Since 1988 research works [6—8, 11—14, 17—20,

22, 25, 29, 33—37, 41] in DLT framework have been

carried out by algebraic means to determine optimal

fractions of a load distributed to processors in the

network such that the total load processing time is

minimum. A number of scheduling policies have

been investigated including multi-installments [5],

780 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 48, NO. 1 JANUARY 2012

multi-round scheduling [7, 42], multiple loads [17],

limited memory [20, 38], simultaneous distribution

[24, 32], simultaneous start [36], start-up delay

[9, 39], detailed parameterizations and solution

time optimization [1], and combinatorial schedule

optimization [21]. Divisible loads may be divisible

in fact or as an approximation as in the case of a

large number of relatively small independent tasks

[3, 10]. Ten reasons to use the concept of divisible

load scheduling theory have recently been presented

[34]. Results and open problems in divisible load

scheduling in single level tree network are highlighted

in [6]. A complete survey and results in divisible

load scheduling algorithm can be found in [8], [34],

[36]. The aforementioned research works in the

domain of divisible load scheduling in distributed

systems consider processing load requiring linear

computational power.

There is an increasing amount of research on

real-time modeling and simulation of complex

systems such as nuclear modeling, aircraft/spacecraft

simulation, biological systems, bio-physical

modeling, genome search, etc. It is well known that

many algorithms require nonlinear computational

complexity, i.e., the computational time of the given

data/load is a nonlinear function of the load size

(N). For the first time in the literature, a nonlinear

cost function is considered [19, 25]. In [25] the

computational loads require nonlinear processing time

depending on the size of load fractions. It has been

mentioned that because of nonlinear dependency the

speed-up achieved by simultaneous-start is superlinear

[19, 25]. Finding an algebraic solution for nonlinear

computational loads is a challenging issue. In this

paper we present an approximate algebraic solution

for second-order computational loads.

Image processing and pattern analysis for

aerospace applications of which computational

complexity is O(N2) include line detection using

the Hough transform [15], and pattern recognition

using 2D hidden Markov model (HMM) [31]. The

classical Hough transform was concerned with the

identification of lines in the image, but later this

transform was extended to identifying positions of

arbitrary shapes, most commonly circles or ellipses.

The computational complexity for N points is

approximately proportional to N2. When N is large,

parallel or distributed processing is desired [23]. A

separable 2D HMM for face recognition builds on

an assumption of conditional independence in the

relationship between adjacent blocks. This allows

the state transition to be separated into vertical and

horizontal state transitions. This separation of state

transitions brings the complexity of the hidden layer

of the proposed model from the order of O(N3k)

to the order of O(N2k), where N is the number of

the states in the model and k is the total number of

Fig. 1. Master-slave network.

observation blocks in the image [23]. In addition,

we can also find real-world problems like molecular

dynamic simulation of macromolecular systems,

learning vector quantization neural network [27],

and block tri-diagonalization of real symmetric

matrices [2] which require second-order computational

complexity.

In this paper we address the scheduling problem

for second-order computational loads in a master-slave

paradigm with nonblocking mode communication.

Here the second-order time complexity computational

load arrives at the master processor and it distributes

the load fractions one-by-one to the slave processors

in the network using the nonblocking mode of

communication. Using a mild assumption on the

communication to computation speed ratio and

the minimum granularity of any load fractions, we

derive an algebraic solution for the optimal size of

the each load fraction and the total load processing

time. Numerical solutions are compared with the

algebraic solution to see if they conform to each

other. The results clearly indicate that the algebraic

closed-form expression matches closely with the

numerical solution. Finally, we study the conditions

for optimal sequence and optimal arrangement using

the closed-form expression. Our finding reveals that

the condition for optimal sequence/arrangements is the

same as that of linear computational loads.

II. MATHEMATICAL FORMULATION

In this section, we describe the master-slave

model and formulate the problem. We consider a

second-order computational load which is arbitrarily

divisible. The user submits the computational load

in the master processor (p0). The master processor

p0 is connected to m slave processors (p1,p2, : : : ,pm)

through the links (l1, l2, : : : , lm) as shown in Fig. 1.

The root processor (p0) divides the processing load

into m+1 fractions (®0,®1, : : : ,®m), keeps ®0 for

itself and distributes the remaining m fractions to

child processors (p0,p1,p2, : : : ,pm) in the network.

The processing time to compute the load fraction

depends linearly on the computing speed of the

processor and nonlinearly in terms of the size of load

fraction. In this paper we use nonblocking mode of

SURESH, ET AL.: SCHEDULING SECOND-ORDER COMPUTATIONAL LOAD IN MASTER-SLAVE PARADIGM 781

Fig. 2. Timing diagram describing load distribution process in master-slave network.

communication [28, 40] to distribute the load fractions

(®0,®1, : : : ,®m) to slave processors (p1,p2, : : : ,pm). In

the nonblocking mode of communication, the child

processor will start the computation process while

its front-end starts receiving the fraction of loads.

The objective of this study is to find the optimal

size of load fractions assigned to the processors in

the network such that the total processing time is

minimum. The following are the notations used in this

paper.

®0 Fraction of the load assigned to the root

processor p0.

®i Fraction of the load assigned to the child

processor pi.

Ai Inverse computing speed on the processor pi.

Gi Inverse link speed on the link li.

T(m) Total time taken to process the complete load.

N Total size of the load fractions.

m Number of the slave processors.

n Order of processing.

± Minimum granularity of any load fraction.

A. Optimal Load Scheduling

We derive the closed-form expressions for the

load fractions and processing time for nonlinear

processing load in the nonblocking mode of

communication model. For the purpose of derivation

of the closed-form expression, we consider a sequence

of load distribution, p1,p2, : : : ,pm, in that order.

The problem is to find the optimal sizes of the

load fractions that are assigned to the processors in

the network such that the final processing time is

minimal. The load distribution process by the master

processor p0 is illustrated by means of a timing

diagram as shown in Fig. 2. As in the case of linear

computational loads [8], the processing time for

nonlinear computational loads is minimum only when

all processors stop computing at the same time. The

detailed proof for second-order computational loads is

given in the Appendix.

From the timing diagram, we can write the

recursive load distribution equations as follows:

(®1N)nA1 = (®oN)

nA0 (1)

(®i+1N)nAi+1 + (®iN)Gi = (®iN)

nAi,

i = 1,2, : : : ,m¡ 1: (2)

The above equations are reduced to

(®1N)n = (®oN)

nf1, (3)

(®i+1N)n = (®iN)

nfi+1¡ (®iN)¯ifi+1,i= 1,2, : : : ,m¡ 1 (4)

where

fi+1 =AiAi+1

, i= 0,1,2, : : : ,m¡ 1 (5)

¯i =GiAi, i= 1,2, : : : ,m¡ 1: (6)

The normalization equation is

mXi=0

®i = 1: (7)

Equations (3) and (4) can be reduced to

ga1N = ®0Nnpf1 (8)

gai+1N = ®iNnpfi+1

·1¡ ¯i

(®iN)n¡1

¸1=n,

i = 1,2, : : : ,m¡ 1: (9)

782 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 48, NO. 1 JANUARY 2012

The size of load fractions can be obtained

by substituting (8) and (9) in (7) and solved

analytically. Solving these equations is difficult and

computationally intensive. In this paper we derive a

closed-form expression for the size of load fraction

and processing time by approximating the terms inside

the root. Finding approximate closed-form expression

for higher power is difficult. Hence, in this paper

we consider only the second power (n= 2). If we

substitute n with 2 in (8) and (9), then the equations

are reduced as

®1N = ®0Npf1 (10)

®i+1N = ®iNpfi+1

s1¡ ¯i

®iN, i= 1,2, : : : ,m¡ 1:

(11)

Assumption: We assume that the ratio of

communication time to computation (¯i) is very small

in most practical distributed systems. Also, the size

of load fraction assigned to the child processor ®iN is

larger than ¯i.

Using the above assumption, we express the term

(p1¡¯i=®iN) in (11) in Taylor series ass

1¡ ¯i®iN

= 1¡ ¯i(®iN)

+O

Ãμ¯i®iN

¶2!: (12)

Note that the communication-to-computation ratio

(¯i) is less than 1 and the load fraction assigned

to the child processor is greater than the minimum

granularity of processing load (®iN > ±). Hence,

the higher order terms of ¯i=®iN are small and are

neglected.

In this paper we consider a first-order

approximation of square root to derive the

closed-form expression.s1¡ ¯i

®iN¼ 1¡ ¯i

2®iN: (13)

The approximation holds only when ¯i=®iN is much

smaller than one and ¯i=®iN moves closer to ¯i=±,

the approximation become worse. By substituting

the approximation of the square root, (11) can be

simplified as

®i+1N ¼ ®iNpfi+1¡

¯ipfi+12

, i= 1,2, : : : ,m¡1:

(14)

By substituting (14) and (10) in normalization (7),

we can derive the closed-form expression for the load

fraction ®0 assigned to the root processor p0 as

®0 =N + x(m)

Ny(m)(15)

where

x(m) =1

2

m¡1Xi=1

¯i

24 mXj=i+1

jYk=i+1

pfk

35 (16)

y(m) = 1+

24 mXi=1

iYj=1

qfj

35 : (17)

From (10) and (15), the load fraction ®i can be

expressed in terms of load fraction ®0 as

®iN ¼ ®0Npf1f2 ¢ ¢ ¢fi¡

1

2

i¡1Xj=1

¯j

iYk=j+1

pfk,

i= 1,2, : : : ,m: (18)

By substituting the closed-form expression for load

fraction ®0 in (18), one can easily calculate the size of

load fraction assigned to any processor in the network

as follows:

®i =1

N

24N + x(m)y(m)

pf1f2 ¢ ¢ ¢fi¡

1

2

i¡1Xj=1

¯j

iYk=j+1

pfk

35 ,i= 1,2, : : : ,m: (19)

Now we derive the closed-form expression for the

total load processing time. From the timing diagram

shown in Fig. 2, the total load processing time T(m) is

given as follows:

T(m) = (®0N)2A0 =

·N + x(m)

y(m)

¸2A0: (20)

One should remember that the above closed-form

expression for processing time is derived under the

assumption that the communication time is less than

the computation time. When the communication

time is greater than the computation time (¯i > 1),

simultaneous processing is not possible. The processor

will have cycles of the work and wait period. For

this case, finding closed-form expression is not

straightforward. This case can be handled easily

using the equivalent processor concept explained in

[28], [40].

The advantage of the closed-form expression

is that we can directly derive conditions for the

optimal sequence of load distribution and the optimal

arrangement of processors. Before analyzing the

theoretical results, we present a numerical example to

understand the characteristics of nonlinear DLT with

nonblocking mode of communication.

B. Numerical Example 1

Consider the task of finding ellipses in a 512£ 512image. Lets assume that the ellipses are oriented

along the principle axes. Hence, we need four

parameters (k = 4) (two for the center of the ellipse

and two for the radii) to describe the ellipse. The

SURESH, ET AL.: SCHEDULING SECOND-ORDER COMPUTATIONAL LOAD IN MASTER-SLAVE PARADIGM 783

TABLE I

Processor and Communication Link Parameters used in the

Numerical Example 1

Parameters P0 P1 P2 P3

A 900 800 120 100

G – 20 1 0.85

computational complexity in identifying the ellipse

is O(Nk¡2), which is O(N2). Here, N is image space

(N = 262144). For simplicity we consider a small

region of interest 10£ 10 (N = 100) in our example.The root processor divides the image size into small

fractions and distributes them to child processors.

Each child processor computes the Hough space for

a given resolution and generates the accumulator

array for their fraction of image region. The size

of accumulator array depends on the resolution and

does not depend on the image size. Finally, the root

processor collects all the arrays and identifies the

candidate points for ellipses. For simplicity we neglect

the result collection time (resolution is much smaller

than image size) from each processor.

Consider a single-level tree network with

three processors (m= 3). The time to compute the

accumulator array for one pixel (processors parameter)

and the time to communicate one pixel through the

link (link parameters) are given in Table I. The total

size of load fraction N is assumed to be 100 units.

Using the closed-form expression, the values of

fractions assigned to the processors are computed as

follows: ®0 = 0:12840, ®1 = 0:13619, ®2 = 0:35132,

and ®3 = 0:38480. The corresponding total load

processing time is 148,384 units of time. The total

load processing time obtained by analytically solving

the nonlinear recursive equations using a nonlinear

least square solver [16] is 148,170 units of time. The

load fractions obtained using the analytical solution

are: ®0 = 0:128309, ®1 = 0:13609, ®2 = 0:351068,

and ®3 = 0:38453. From the results we can see that

the closed-form expressions closely approximate the

actual solution.

The processing time obtained using the

closed-form expression and actual solution obtained

using the analytical solution are given in Table II.

From the table we can see that the processing time

obtained using the approximate closed-form solution

matches with the analytical solution. The difference

between the solutions depends on the ratio between

communication time to computation time (¯i) and size

of load fraction (®iN). The error is small when ¯i=®iN

is close to zero and it becomes worse when ¯i=®iN

moves closer to ¯i=±.

The main objective of deriving the closed-form

expression is to study the behavior of second-order

load scheduling problems. In the following section

we show that the approximate closed-form solution

TABLE II

Total Load Processing Time Obtained using Analytical Solution of

Recursive Equations and Approximate Closed-Form Expression

# of Child Approximate Analytical

Processors Solution Solution

1 2,119,482 2,119,482

2 391,247 390,995

3 148,384 148,170

can be directly used to find the conditions for

optimal arrangements and optimal sequence of load

distribution.

C. Homogeneous System

As a special case for the homogeneous system

(Ai = A and Gi =G), the load fraction assigned to the

root processor (®0) is obtained by substituting fi = 1

and ¯i = ¯ in (15) as follows:

®0 =4N +m¯(m¡ 1)4N(m+1)

: (21)

The load fraction assigned to any child processor

pi is obtained as follows:

®i =4N +m¯(m¡1)4N(m+1)

¡ (i¡ 1)¯2N

, i= 1,2, : : : ,m:

(22)The total load processing time for the

homogeneous system is computed as follows:

T(m) =

·4N +m¯(m¡ 1)

4(m+1)

¸2A: (23)

In the homogeneous case, if the

communication-to-computation ratio tends to be zero,

the load fractions assigned to the processors converge

to equal load fraction, i.e.,

®0 = lim¯!0

4N +m¯(m¡ 1)4N(m+1)

=1

m+1(24)

and

®i = lim¯!0

·4N +m¯(m¡ 1)4N(m+1)

¡ (i¡ 1)i¯2N

¸=

1

m+1,

i= 1,2, : : : ,m (25)

and the total load processing time converges to

T(m) =

·N

m+1

¸2A: (26)

From (26), we can see that the total processing time is

superlinear with increase in the number of processors.

III. OPTIMAL SEQUENCE OF LOAD DISTRIBUTION

In the linear DLT, the closed-form expression is

used to find the condition for the optimal sequence

of load distribution. Similarly, one needs to derive

the closed-form expression to study the behavior of

784 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 48, NO. 1 JANUARY 2012

the nonlinear divisible load condition. In this sectionwe present the condition for optimal sequence ofload distribution obtained from the approximateclosed-form expression. First, we present an exampleto understand the effect of changing the sequence ofload distribution and later generalize the result. Forthis purpose we consider a three-processor (m= 3)network. From (20) we can see that the processingtime is a function of load fraction ®0 assigned to theprocessor p0. Hence, it is sufficient to analyze thebehavior of ®0 instead of processing time T(m).Case A: The sequence of load distribution is

(p1,p2,p3), i.e., the root processor p0 first sends theload fraction to the processor p1, next to the processorp2, and last to the processor p3. Using the closed-formexpression, we can write ®0 as

®0N =N +¯1

¡pf2 +

pf2f3

¢=2+¯2

¡pf3¢=2

1+pf1 +

pf1f2 +

pf1f2f3

:

(27)

The above equation can be expressed in terms ofsystem parameters (Ai,Gi) as

®0N =2NpA1A2A3 +G1

¡pA2 +

pA3

¢+G2

pA1

2¡p

A1A2A3 +pA0A2A3 +

pA0A1A3 +

pA0A1A2

¢ :(28)

Case B: Now, we change the load distributionsequence as (p1,p3,p2), i.e., the root processor p0 firstsends the load fraction to the processor p1, next, tothe processor p3 and finally to the processor p2. Theload fraction (®00) can be obtained by interchanging(A2,G2) and (A3,G3) in the earlier expression.

®00N =2NpA1A2A3 +G1

¡pA2 +

pA3

¢+G3

pA1

2¡p

A1A2A3 +pA0A2A3 +

pA0A1A3 +

pA0A1A2

¢ :(29)

Now, we have to find the condition for ®0 · ®00.By subtracting (29) and (28), we get

®0N ¡®00N

=

pA1(G2¡G3)

2¡p

A1A2A3 +pA0A2A3 +

pA0A1A3 +

pA0A1A2

¢ :(30)

From the above equation, we can say that the totalload processing time is minimal for load distributionsequence (p1,p2,p3) if and only if G2 is less thanG3. From the results obtained for the three-processornetwork case, we can generalize the result as follows.

Optimal Sequencing Theorem Given an(m+1)-processor single-level tree network withnonblocking mode of communication, the optimalsequence of load distribution is produced if the rootprocessor distributes the load fractions in ascendingorder of communication speed parameter Gi of thelinks.

PROOF For m processors, consider a case when

the root processor p0 distributes the load fractions

to child processors in the following sequence

(p1,p2, : : : ,pi¡1,pi,pi+1, : : : ,pm). The value of loadfraction ®0 assigned to the root processor for this

sequence is

®0 =N + x(m)

Ny(m): (31)

Consider another sequence of load distribution

where the root processor distributes the load

fractions to child processors in a sequence

(p1,p2, : : : ,pi¡1,pi+1,pi, : : : ,pm). The value of loadfractions assigned to the root processor in this

sequence is

®00 =N + x0(m)Ny0(m)

: (32)

The load fraction for the new sequence can be

obtained by exchanging the (Gi,Ai) and (Gi+1,Ai+1)

in (31). The interchange affects terms fi, fi+1, fi+2,

¯i, and ¯i+1 only, and does not affect the other terms.

Note that because of this interchange, y(m) and y0(m)will not change. Now, we will find the conditions for

®0 · ®00, which is the same as x(m)· x0(m). The termsx(m) and x0(m) are a function of f and ¯.

x(m) =1

2

8>><>>:¯1

£pf2 +

pf2f3 + ¢ ¢ ¢+

pf2f3 ¢ ¢ ¢fm

¤+ ¢ ¢ ¢

+¯i£p

fi+1 +pfi+1fi+2 + ¢ ¢ ¢+

pfi+1fi+2 ¢ ¢ ¢fm

¤+ ¢ ¢ ¢+¯m¡1

pfm

:

(33)Now, x(m)¡ x0(m) is given as follows:

x(m)¡ x0(m) = Gi¡Gi+12pAiAi+1

: (34)

Then,

®0N ¡®00N =Gi¡Gi+1

2y(m)pAiAi+1

: (35)

Here, note that ®0N · ®00N only when Gi ·Gi+1.By recursively applying the above condition, we can

get the optimal load distribution sequence which

satisfies the condition G1 ·G2 · ¢¢ ¢ ·Gm. This provesthe theorem.

The result obtained from the optimal sequencing

theorem is similar to that of the optimal sequence of

load distribution presented for the linear case [8, 29].

A. Numerical Example 2

In this example we consider the same parameters

used in the numerical example 1. In the previous

example, we used load distribution sequence

(p1,p2,p3). The total load processing time is

148,384 units. By applying the optimal sequencing

theorem, the optimal sequence of load distribution

is (p3,p2,p1). The load fractions assigned to the

processors in the network are ®0 = 0:128236, ®1 =

0:136015, ®2 = 0:351175, and ®3 = 0:38465. The

SURESH, ET AL.: SCHEDULING SECOND-ORDER COMPUTATIONAL LOAD IN MASTER-SLAVE PARADIGM 785

total load processing time is 148,000 units. From this

result, we can see that the total processing time for

the optimal sequence is less than that for the previous

sequence.

IV. OPTIMAL ARRANGEMENT OF PROCESSORS

In this section we derive the condition for the

optimal arrangement of processors in the nonlinear

divisible load problem using our closed-form

expressions. First we present an example to

understand the effect of changing the processor

arrangement and later generalize the result. For

this purpose, we consider a three-processor (m= 3)

network. Here, the sequence of load distribution is

fixed as (p1,p2,p3).

Case A: The processor p1 is connected to link l1,

processor p2 is connected to link l2, and processor

p3 is connected to link l3. Using our closed-form

expression, we can write ®0 as (28).

Case B: Now we change the arrangement of

processors in the network. The processor p1 is

connected to link l2 and the processor p2 is connected

to link l1. The load fraction (®00) can be obtained by

interchanging A1 and A2 in the earlier expression

as (28).

®00N =2NpA1A2A3 +G1

¡pA1 +

pA3

¢+G2

pA2

2¡p

A1A2A3 +pA0A2A3 +

pA0A1A3 +

pA0A1A2

¢ :(36)

Now we have to find the condition for ®0 · ®00. Bysubtracting (28) and (36), we get

®0N ¡®00N

=

¡pA1¡

pA2

¢(G2¡G1)

2¡p

A1A2A3 +pA0A2A3 +

pA0A1A3 +

pA0A1A2

¢ :(37)

From the above equation, we know that the

processing time is a minimum if and only if the

sequence of load distribution based on ascending

order of communication speed parameter, i.e., G1 ·G2. Hence, from the above equation, we can change

the arrangement if and only if the processing speed

A2 is less than A1. Now, we generalize the result as

follows:

Optimal Arrangement Theorem Given an

(m+1)-processor single-level tree network with

optimal sequence of load distribution, the total load

processing time is minimum if the processors are

connected to the links in ascending order of processor

speed parameter Ai.

PROOF For m processors, consider a case when

the root processor p0 distributes the load fractions

to child processors in the following sequence

(p1,p2, : : : ,pi¡1,pi,pi+1, : : : ,pm). Here the networkarrangement

is (p1, l1), (p2, l2), : : : , (pi, li), (pi+1, li+1), : : : , (pm, lm).

The value of load fraction ®0 assigned to the root

processor in this arrangement is given as (31).

Consider another arrangement where a processor

pi is connected to a link li+1 and a processor pi+1is connected to a link li, i.e., the arrangement is

(p1,p2, : : : ,pi¡1,pi,pi+1, : : : ,pm). Here the networkarrangement is

(p1, l1), (p2, l2), : : : , (pi+1, li), (pi, li+1), : : : , (pm, lm). The

value of load fractions assigned to the root processor

in this arrangement is given as (32).

The load fraction for the new arrangement can

be obtained by exchanging the Ai and Ai+1 in (31).

The interchange affects terms fi, fi+1, fi+2, ¯i, and

¯i+1 only, and does not affect the other terms. Note

that because of this interchange, y(m) and y0(m) willnot change. Now, we find the conditions for ®0 · ®00which is the same as x(m)· x0(m). The terms x(m)and x0(m) are a function of fs and ¯s.Now, x(m)¡ x0(m) is given as follows:

x(m)¡ x0(m)

=(Gi+1¡Gi)

¡pAi¡

pAi+1

¢nPm

j=i+2

Qj

k=i+2

pfk

o2pAiAi+1

:

(38)Then,

®0N ¡®00N

=(Gi+1¡Gi)

¡pAi¡

pAi+1

¢nPm

j=i+2

Qj

k=i+2

pfk

o2y(m)

pAiAi+1

:

(39)

Here, note that ®0N · ®00N only when Ai · Ai+1.By recursively applying the above condition, we can

get the optimal load distribution sequence which

satisfies the condition A1 · A2 · ¢¢ ¢ · Am. This provesthe theorem.

In the above analysis, the speed condition of the

root processor is not included. Now, we prove the

speed condition on the root processor.

Let us consider a two-processor network and

the arrangement of processors in the network is

(p1, l1) and (p2, l2). The processing time for this

arrangement is

T =

(2NpA1A2 +G1

2¡pA1A2 +

pA0A1 +

pA0A2

¢)2A0: (40)Now, assume that the processor p1 should

distribute the load fractions instead of processor p0.

Then, we have to consider another arrangement:

(p0, l1) and (p2, l2). The total load processing time for

this arrangement is

T0 =

(2NpA0A2 +G1

2¡pA0A2 +

pA1A0 +

pA1A2

¢)2A1: (41)786 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 48, NO. 1 JANUARY 2012

Fig. 3. Timing diagram for load distribution process (m= 3).

The value T¡T0 is computed as follows:T¡T0

=G1

£4NpA0A1A2 +G1

¡pA0 +

pA1

¢¤2¡p

A0A1 +pA0A2 +

pA1A2

¢2 ³pA0¡

pA1

´:

(42)

Hence, T · T0 only when A0 · A1. From here we

can say that the first processor should be fastest. Note

that to find the speed condition of the root processor,

we have to use the processing time expression. For the

speed condition of the child processors, it is sufficient

to consider the value of the ®0 expression rather than

the processing time expression.

A. Numerical Example 3

In this example, we consider the same parameters

used in the numerical example 1. In the numerical

example 1, we have used load distribution sequence

(p1,p2,p3). The total load processing time is

148,384 units. By applying the optimal arrangement

theorem, the optimal sequence of load distribution

is (p2, l3), (p1, l2), (p0, l1). The load originating

processor is now p3. The total load processing time

is 147,975 units. From this result we can see that

the total processing time with the optimal sequence

and arrangement is less than that of the total load

processing time for the other sequences.

V. CONCLUSIONS

In this paper we have dealt with parallel

processing of second-order computational loads in

a single-level tree network with the nonblocking

mode of communication. With a mild assumption

on communication-to-computation speed ratio, we

have shown how to derive a closed-form expression

for optimal load partition such that the total load

processing time is minimum. Numerical examples are

presented to illustrate the closeness of the solution.

The main advantage of the closed-form expression is

in the study of characteristics of the system. Using the

closed-form expressions, we derive the condition for

optimal sequencing and arrangements of processors.

These results can be used in intelligent scheduling of

divisible second-order processing loads.

APPENDIX

For linear processing loads, it has been proved

that the processing time is minimum only when all

processors stop computing at the same time [8].

In this Appendix, we prove that it is true even for

nonlinear computational loads. First we present a

motivational example and next we formally define the

theorem and prove it.

A. Numerical Example A1

Let us consider a three-processor (m= 3) system

with the following parameters: A0 = 1, A1 = 1:1, A2 =

1:5, A3 = 2, G1 = 1, G2 = 1:5, and G1 = 2. Total size

of the processing load is 100. First, we assume that

the processors participating in the computation stop

computing at the same time. Using our closed-form

expression of the load fraction, we can determine

the size of load fractions assigned to the processors.

The load fractions are: ®0 = 0:29096, ®1 = 0:27742,

®2 = 0:23365, and ®3 = 0:19797. The timing diagram

describing the communication and computation time

for each processor is shown in Fig. 3.

From the timing diagram shown in Fig. 3, the

finishing times for processors p0, p1, p2, and p3are: T0 = 846:577, T1 = 846:580, T2 = 846:627, and

T3 = 846:631. The total load processing time is the

maximum of T1,T2,T3, and T4 which is 846.631.

SURESH, ET AL.: SCHEDULING SECOND-ORDER COMPUTATIONAL LOAD IN MASTER-SLAVE PARADIGM 787

Fig. 4. Timing diagram for load distribution process (m = 3) by changing load fraction assigned to p2.

Fig. 5. Variation of finishing times for processor p0 and p1.

There is a small deviation in finishing times due to

approximation in the derivation of the load fractions.

Since the child processor p2 can compute faster

than p3, we assign additional load from p3 to p2. Now

the load fractions are ®0 = 0:29096, ®1 = 0:27742,

®2 = 0:24365, and ®3 = 0:18797. For this load

distribution, the timing diagram is shown in Fig. 4.

From the figure the finishing times for processors

p0, p1, p2, and p3 are: T0 = 846:577, T1 = 846:580,

T2 = 918:19, and T3 = 770:919. From the result we can

see that the child processor p2 requires more time to

complete the load processing, where as others finish

their computation earlier. The total load processing

time is a maximum of T1, T2, T3, and T4 which is

918.19. From this result, we can say that the total

processing time is the minimum if all participating

processors stop computing at the same time. Now we

formally state the theorem for the nonlinear case and

prove the statement is true.

THEOREM I If all nodes of the nonlinear computing

model receiving non-zero load fractions stop computing

at the same time, then the processing time T is a

minimum.

PROOF Let ®= f®0,®1, : : : ,®mg be the load fractionsassigned to the processors p0,p1, : : : ,pm respectively.

Let T0,T1, : : : ,Tm be the corresponding finishing times.

Case A: We consider the finishing times of

processor p0 and p1. The rest of the finishing times

are assumed to be arbitrary and the load fractions

assigned to other processors are assumed to be

arbitrary constants.

C0 =

mXi=2

®i: (43)

Here C0 is a constant. Then

®1 = 1¡®0¡mXi=2

®i = (1¡C0)¡®0, 0· ®0 · 1¡C0:

(44)

From the timing diagram given in Fig. 3, we can write

the finishing times of processor p0 and p1 as

T0 = (®0N)2A0

T1 = (®1N)2A1:

(45)

By substituting ®1 in T1, we get

T1 = (1¡C0¡®0)2N2A1: (46)

The optimal processing time is the time that

minimizes the maxfT0,T1g. The variation of finishingtimes T0 and T1 for different values of ®0 are given in

Fig. 5.

From Fig. 3, we can see that the processing time

is a minimum, if the finishing times for processor p0and p1 are the same, i.e., T0 = T1. At this point, we can

express ®1

®1 = ®0

sA0A1= k1®0: (47)

788 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 48, NO. 1 JANUARY 2012

Fig. 6. Variation of finishing times with respect to loads

fraction ®0.

Case B: Now we examine the case with three

processors (p0,p1,p2) and their finishing times are T0,

T1, and T2, respectively. Here again we assume that

the load fractions assigned to other processors in the

network are arbitrary constants.

C1 =

mXi=3

®i: (48)

Now the load fraction assigned to the child processor

p2 can be expressed in terms of load fraction ®0 and

®1 as,

®2 = 1¡ (®3 +®4 + ¢ ¢ ¢®m)¡®0¡®1: (49)

Using (47) and (48), we can express ®2 in terms of

®0 as

®2 = 1¡C1¡ (1+ k1)®0, 0· ®0 ·1¡C11+ k1

(50)

where k1 =pf1. From the timing diagram given in

Fig. 4, finishing time for T2 and T0 are expressed as

T0 = (®0N)2A0 (51)

T2 = (®1N)G1 + (®2N)2A2: (52)

The finishing time T2 for processor p2 can be

expressed in terms of ®0 as

T2 = (k1®0N)G1 + ([1¡C1¡ (1+ k1)®0]N)2A2:(53)

Now we plot the finishing times T0 and T2 with

respect to the load fraction ®0 as shown in Fig. 6.

When the load fraction ®0 equals to the value

(1¡C1)=(1+ k1), the load fraction ®2 assigned tothe processor p2 is zero. Hence, the finishing time

T2 is zero. From the figure we can observe that the

finishing times meet each other at one point which

is the minimum processing time point. From the

previous case, we can say that the finishing time of

T1 is the same as T0. Hence, at the minimum point,

T2 = T1 = T0.

Using this condition, we can express the load

fraction ®2 in terms of the load fraction ®0 as given

Fig. 7. Variation of finish times with respect to load fraction ®0.

in (18)

®2N = ®0Npf1f2¡

¯1pf2

2= k2®0N ¡ r2 (54)

where k2 =pf1f2 and r2 = ¯1

pf2=2.

Case C: Now, we examine four processors

(p0,p1,p2,p3) and their finishing times are T0, T1,

T2, and T3, respectively. Here again we assume that

the load fractions assigned to other processors in the

network are arbitrary constants.

C2 =

mXi=4

®i: (55)

Now the load fraction assigned to the child

processor p3 can be expressed in terms of the load

fraction ®0, ®1, and ®2 as,

®3 = 1¡ (®4 +®5 + ¢ ¢ ¢®m)¡®0¡®1¡®2: (56)

Using (55), (54), and (47), we can express ®3 in terms

of ®0 as

®3 = 1¡C2 +r2N¡ (1+ k1 + k2)®0

0· ®0 ·(1¡C2 + r2=N)(1+ k1 + k2)

:

(57)

From the timing diagram given in Fig. 4, finishing

time for T3 is expressed as

T3 = (®1N)G1 + (®2N)G2 + (®3N)2A3: (58)

The finishing time T3 for processor p3 can be

expressed in terms of ®0 as

T3 = (k1®0N)G1 + (k2®0N ¡ r2)G2+³h1¡C2 +

r2N¡ (1+ k1 + k2)®0

iN´: (59)

Now we plot the finishing times T0 and T3 which are

shown in Fig. 7. When the load fraction ®0 equals

to the value (1¡C1 + r2=N)=(1+ k1 + k2), the loadfraction ®3 assigned to processor p3 is zero. Hence,

the finishing time T3 at this condition is zero. From

the figure we can observe that the finishing times

SURESH, ET AL.: SCHEDULING SECOND-ORDER COMPUTATIONAL LOAD IN MASTER-SLAVE PARADIGM 789

meet each other at one point which is the minimum

processing time point. From previous cases we can

say that the finishing times of T1 and T2 are the same

as T0. Hence, at the minimum point, T3 = T2 = T1 = T0.

Using this condition, we can express the load

fraction ®3 in terms of load fraction ®0 as given

in (18),

®3N = ®0Npf1f2f3¡

¯1pf2f32

¡ ¯2pf3

2= k3®0N ¡ r3

(60)

where k3 =pf1f2f3 and r3 = ¯1

pf2f3=2+¯2

pf3=2.

Case D: Based on the results in the previous

cases, we can extend the proof to show that the

minimum processing time is achieved when T0 = T1 =

¢ ¢ ¢= Ti for i+1 processors (p0,p1, : : : ,pi). Let

Ci =

mXj=i+1

®j: (61)

Then

®i = 1¡Ci¡i¡1Xj=0

®j: (62)

From the results of previous cases, we can express

®j in terms of ®0 as

®j = kj®0N ¡ rj , j = 1,2, : : : , i¡ 1 (63)

where kj =

qQjk=1fk and rj =

Pj¡1k=1¯k

qQjl=k+1fl=2.

Note that r1 = 0.

Now we can express ®i in terms of ®0 as

®i = 1¡Ci+i¡1Xk=1

rkN¡ (k1 + ¢ ¢ ¢+ ki¡1)®0: (64)

From the above equation the feasible values for ®0are

0· ®0 ·1¡Ci+

Pi¡1k=1

rkN

(k1 + ¢ ¢ ¢+ ki¡1)= C: (65)

From the timing diagram given in Fig. 2, finish

time Ti for processor pi can be expressed as

Ti = (®1N)G1 + ¢ ¢ ¢+(®i¡1N)Gi¡1 + (®iN)2Ai:(66)

When ®0 = C, the load fraction (®i) assigned to

the processor pi is zero, and hence, finish time is

zero. Similarly, when ®0 = 0, the load fraction (®i)

assigned to the processor pi is 1¡Ci. Now the finishtime is (1¡Ci)2NAi. From this we can conclude that

there exists a minimum processing time at a crossover

point where T0 = T1 ¢ ¢ ¢= Ti. Using mathematicalinduction, one can generalize that the processing

time is a minimum if all participating processors

stop computing at the same time, i.e., T0 = T1 = ¢ ¢ ¢= Tm.

REFERENCES

[1] Adler, M., et al.

Optimal sharing of bags of tasks in heterogeneous

clusters.

In Proceedings of the Annual ACM Symposium on Parallel

Algorithms and Architectures, San Diego, CA, 2003, 1—10.

[2] Bai, Y. and Robert, R. C.

Parallel block tridiagonalization of real symmetric

matrices.

Journal of Parallel and Distributed Computing, 68 (2008),

703—715.

[3] Beaumont, O., et al.

Bandwidth-centric allocation of independent tasks on

heterogeneous platforms.

In Proceedings of the International Parallel and Distributed

Processing Symposium, Ft. Lauderdale, FL, 2002, 67—72.

[4] Bharadwaj, V., Ghose, D., and Mani, V.

Optimal sequencing and arrangement in distributed

single-level tree networks with communication delays.

IEEE Transactions on Parallel and Distributed Systems, 5,

9 (1994), 968—976.

[5] Bharadwaj, V., Ghose, D., and Mani, V.

Multi-installment load distribution in tree networks with

delay.

IEEE Transactions on Aerospace and Electronic Systems,

31 (1995), 555—567.

[6] Beaumont, O., et al.

Scheduling divisible loads on star and tree networks:

Results and open problems.

IEEE Transactions on Parallel Distributed Systems, 16

(2005), 207—218.

[7] Beaumont, O., Legrand, A., and Robert, Y.

Scheduling divisible workloads on heterogeneous

platforms.

Parallel Computing, 29 (2003), 1121—1132.

[8] Bharadwaj, V., et al.

Scheduling Divisible Loads in Parallel and Distributed

Systems.

Hoboken, NJ: Wiley, 1996.

[9] Bharadwaj, V., Li, X., and Ko, C. C.

On the influence of start-up costs in scheduling divisible

loads on bus networks.

IEEE Transactions on Parallel and Distributed Systems, 11,

12 (2000), 1288—1305.

[10] Bharadwaj, V. and Viswanadham, N.

Suboptimal solutions using integer approximation

techniques for scheduling divisible loads on distributed

bus networks.

IEEE Transactions on System, Man, and Cybernetics–Part

A: Systems and Humans, 30 (2000), 680—691.

[11] Bataineh, S. and Robertazzi, T. G.

Distributed computation for a bus network with

communication delays.

Proceedings of Information Science and Systems, (1991),

709—714.

[12] Bataineh, S. and Robertazzi, T. G.

Bus oriented load sharing for a network of sensor driven

processors.

IEEE Transactions on Systems, Man and Cybernetics, 21, 5

(1991), 1202—1205.

[13] Cheng, Y. C. and Robertazzi, T. G.

Distributed computation with communication delays.

IEEE Transactions on Aerospace and Electronic Systems,

24, 6 (1988), 700—712.

[14] Cheng, Y. C. and Robertazzi, T. G.

Distributed computation for a tree network with

communication delays.

IEEE Transactions on Aerospace and Electronic Systems,

26, 3 (1990), 511—516.

790 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 48, NO. 1 JANUARY 2012

[15] Duda, R. O. and Hart, P. E.

Use of the Hough transformation to detect lines and

curves in pictures.

Communications of the ACM, 15 (1972), 11—15.

[16] Dennis, Jr., J. E.

Nonlinear least-squares.

In D. Jacobs (Ed.), State of the Art in Numerical Analysis,

Burlington, MA: Academic Press, 1977, 269—312.

[17] Drozdowski, M., Lawenda, M., and Guinand, F.

Scheduling multiple divisible loads.

International Journal of High Performance Computing

Applications, 20 (2006), 19—30.

[18] Drozdowski, M. and Lawenda, M.

The combinatorics in divisible load scheduling.

Foundations of Computing and Decision Sciences, 30

(2005), 297—308.

[19] Drozdowski, M. and Wolniewicz, P.

Out-of-core divisible load processing.

IEEE Transactions on Parallel and Distributed Systems, 14

(2003), 1048—1056.

[20] Drozdowski, M. and Wolniewicz, P.

Optimum divisible load scheduling on heterogeneous

stars with limited memory.

European Journal of Operational Research, 172 (2006),

545—559.

[21] Dutot, P-F.

Divisible load on heterogeneous linear array.

In Proceedings of the International Parallel and Distributed

Processing Symposium, Nice, France, 2003.

[22] Ghose, D., Kim, H. J., and Kim, T. H.

Adaptive divisible load scheduling strategies for

workstation clusters with unknown network resources.

IEEE Transactions on Parallel and Distributed Systems, 16,

10 (2005), 897—907.

[23] Guil, N., Villalba, J., and Zapata, E. L.

A fast Hough transform for segment detection.

IEEE Transactions on Image Processing, 4, 11 (1995),

1541—1548.

[24] Hung, J. T., Kim, H. J., and Robertazzi, T. G.

Scalable scheduling in parallel processors.

In Proceedings of the Conference on Information Sciences

and Systems, Princeton University, Princeton, NJ, 2002.

[25] Hung, J. T. and Robertazzi, T. G.

Distributed scheduling of nonlinear computational loads.

In Proceedings of the Conference on Information Sciences

and Systems, The Johns Hopkins University, Baltimore,

MD, Mar. 2003.

[26] Hung, J. T. and Robertazzi, T. G.

Divisible load cut through switching in sequential tree

networks.

IEEE Transactions on Aerospace and Electronic Systems,

40 (2004), 968—982.

[27] Khalifa, K. B., et al.

Learning vector quantization neural network

implementation using parallel and serial arithmetic.

International Journal of Computer Sciences and

Engineering Systems, 2, 4 (2008), 251—256.

[28] Kim, H. J.

A novel optimal load distribution algorithm for divisible

loads.

Cluster Computing, 6, 1 (2003), 41—46.

[29] Kim, H. J., Jee, G-I., and Lee, J. G.

Optimal load distribution for tree network processors.

IEEE Transactions on Aerospace and Electronic Systems,

32, 2 (1996), 607—612.

[30] Orr, R. S.

The order of computation for finite discrete Gobar

transform.

IEEE Transactions on Signal Processing, 41, 1 (1993),

122—130.

[31] Othman, H. and Aboulnasr, T.

A separable low complexity 2D HMM with application to

face recognition.

IEEE Transactions on Pattern Analysis and Machine

Intelligence, 25, 10 (2003), 1229—1238.

[32] Piriyakumar, D. A. L. and Murthy, C. S. R.

Distributed computation for a hypercube network of

sensor-driven processors with communication delays

including setup time.

IEEE Transactions on Systems, Man, and

Cybernetics–Part A: Systems and Humans, 28 (1998),

245—251.

[33] Robertazzi, T. G.

Processor equivalence for daisy chain load sharing

processors.

IEEE Transactions on Aerospace and Electronic Systems,

29, 4 (1993), 1216—1221.

[34] Robertazzi, T. G.

Ten reasons to use divisible load theory.

IEEE Computer, 36, 5 (2003), 63—68.

[35] Sohn J. and Robertazzi, T. G.

Optimal divisible job load sharing on bus networks.

IEEE Transactions on Aerospace and Electronic Systems,

32, 1 (1996), 34—40.

[36] Ghose, D. and Robertazzi, T. G.

Divisible load scheduling.

Cluster Computing (special issue), 6, 1 (2003), 5—86.

[37] Suresh, S., et al.

Scheduling nonlinear divisible loads in a single level tree

network.

Journal of Super Computing, (2011), 1—21.

DOI 10.1007/s11227-011-0677-2.

[38] Suresh, S., et al.

Divisible load scheduling in distributed system with

buffer constraints: Genetic algorithm and linear

programming approach.

International Journal of Parallel, Emergent and Distributed

Systems, 21, 5 (2006), 303—321.

[39] Suresh, S., Omkar, S. N., and Mani, V.

The effect of start-up delays in scheduling divisible loads

on bus networks: An alternate approach.

Computer and Mathematics with Applications, 46, 10—11

(2003), 1545—1557.

[40] Suresh, S., et al.

An equivalent network for divisible load scheduling in

nonblocking mode of communication.

Computers and Mathematics with Applications, 49, 9—10

(2005), 1413—1431.

[41] Suresh, S., et al.

A new load distribution strategy for linear network with

communication delays.

Mathematics and Computers in Simulation, 79, 5 (2009),

1488—1501.

[42] Yang, Y. and Casanova, H.

UMR: A multi-round algorithm for scheduling divisible

workloads.

In Proceedings of the International Parallel and Distributed

Processing Symposium, Nice, France, 2003.

SURESH, ET AL.: SCHEDULING SECOND-ORDER COMPUTATIONAL LOAD IN MASTER-SLAVE PARADIGM 791

Sundaram Suresh (M’08–SM’10) received the B.E degree in electrical and

electronics engineering from Bharathiyar University in 1999, and the M.E. (2001)

and Ph.D. (2005) degrees in aerospace engineering from Indian Institute of

Science, India.

He was a post-doctoral researcher in the School of Electrical Engineering,

Nanyang Technological University from 2005 to 2007. From 2007—2008, he

was in INRIA-Sophia Antipolis, France, as an ERCIM research fellow. He

was at Korea University for a short period as a visiting faculty in industrial

engineering. From January 2009 to December 2009, he was at the Indian Institute

of Technology—Delhi as an assistant professor in the Department of Electrical

Engineering. Currently, he is working as an assistant professor at the School

of Computer Engineering, Nanyang Technological University, Singapore, since

2010. His research interest includes flight control, unmanned aerial vehicle

design, machine learning, optimization, and computer vision.

Hyoung Joong Kim (M’04) received his B.S., M.S., and Ph.D. degrees from

Seoul National University, Korea, in 1978, 1986, and 1989, respectively.

He joined the faculty of Kangwon National University, Korea, in 1989. He is

currently a Professor at Korea University, Korea.

Dr. Kim has published numerous technical papers including more than 40

peer-reviewed journal papers covering distributed computing and multimedia

computing. He served as guest editor of several journals including IEEE

Transactions on Circuits and Systems for Video Technology. He is a Vice

Editor-in-Chief of the LNCS Transactions on Data Hiding and Multimedia Security.

His main research interests include security engineering.

Cui Run received his B.S. from Harbin Institute of Technology in 2008.

He is a research scholar in Graduate School of Information Management

and Security, Korea University, Korea. His research interest includes database

Security, parallel and distributed computing, and data mining.

792 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 48, NO. 1 JANUARY 2012

Thomas G. Robertazzi (S’75–M’77–SM’91–F’06) received the Ph.D. from

Princeton University, Princeton, NJ, in 1981 and the B.E.E. from the Cooper

Union, New York, NY, in 1977.

Dr. Robertazzi is presently a professor in the Department of Electrical

and Computer Engineering at Stony Brook University, Stony Brook, NY.

In supervising a very active research group, he has published extensively in

the areas of parallel processing and grid scheduling, ad hoc radio networks,

telecommunications network planning, ATM switching, queueing and Petri

networks. He has also authored, coauthored or edited five books in the areas of

networking, performance evaluation, scheduling and network planning. For eleven

years he has been the Faculty Director of the Stony Brook Living Learning

Center in Science and Engineering.

Young-Il Kim received his B.S. degree from Chonnam National University,

Korea, in 1984, and his M.S. degree from Hankuk University of Foreign Studies,

Korea, in 1986, and his Ph.D. degree from Chungbuk National University, Korea,

in 1999, all in computer science.

Since 1986, Dr. Kim has been with Korea Telecom, where he is currently

vice president. He has served as a committee member of National Broadcast

& Communication Standard, Korea Communications Commission, since

September 2005, and also served as an expert committee member of Edge Fusion

Technologies, National Science & Technology Council, since February 2010. His

current research interests include network planning, architecture and systems for

wired/wireless home network including ubiquitous sensor network.

SURESH, ET AL.: SCHEDULING SECOND-ORDER COMPUTATIONAL LOAD IN MASTER-SLAVE PARADIGM 793


Recommended