Robust Resource Allocation in Parallel and Distributed ... · Robust Resource Allocation in...

Robust Resource Allocation in Parallel and Distributed

Computing Systems(tentative)

Ph.D. candidate V. Shestak

Colorado State UniversityElectrical and Computer Engineering Department

Fort Collins, Colorado, USA [email protected]

2

V. Shestak: Progress Toward Ph.D.

n start: August 2003

n research completed: 75% (3 parts out of 4)

n publications:510 accepted (9 conferences, one journal)5one under review (journal)5one draft in preparation (journal)

n patents: one filed, two in process

n graduation: December 2007

3

Outline

n part 1: two-stage approach to resource allocation for periodicstrings of applications

n part 2: resource allocation in IBM cluster-based printing system

n part 3: stochastic robustness metric and its use for static resourceallocations

n part 4: robust resource allocation under random node failures and recoveries – in progress

4

PART 1: Shipboard Computing Environment

n computation resources

5 heterogeneous set of machines

5multitasking enabled

n communication network

5 independent virtual point-to-point communication routes

5 fixed available bandwidth on each route

n resource mapper

5 centralized approach

5 initial static resource allocation

5 robust against increases in workload

5

PART 1: Workload

n periodic continuously running applications organized in strings

n string QoS constraints

5 throughput = 1/P (where P is time interval between input arrivals)

5 end-to-end latency L

≤ P

≤ L

≤ P

[1]tt[1]ct −[ 1]tt n [ ]ct n

•strings have priority factors

6

PART 1: Performance Goal for Initial Allocation

n primary objective: maximize the sum of priority factors of strings allocated in the system

n secondary objective: maximize system slackness

5 system slackness is the minimum unused utilization across all machines and communication routes in the system

5 system slackness quantitatively reflects the system’s potential to absorb unpredictable increases in workload

7

PART 1: Resource Utilization

b

a

b

a ab b

b

8

PART 1: Two-Stage Solution Approach

n first stage: Genitor-based global search algorithm coupled with low-level greedy heuristic

5 global search algorithm operates in the permutation space

5 greedy heuristic maps chromosomes into the solution space

n second stage: Branch-and-Bound depth first search algorithm5 Integer Linear Programming (ILP) formulation5 continuous lower bound tightening over time

•solution passed

9

PART 1: Results – 1 Trial

10

PART 1: Results – 50 Trials

11

PART 1: References

n V. Shestak, E. K. P. Chong , A. A. Maciejewski, H. J. Siegel, L Benmohamed, I. J. Wang, R. Daley, “Resource allocation for periodic applications in a shipboard environment,” 14th Heterogeneous Computing Workshop (HCW 2005), in proceedings of 19th International Parallel and Distributed Processing Symposium (IPDPS 2005), Apr. 2005, pp. 122–127.

n V. Shestak, E. K. P. Chong, A. A. Maciejewski, H. J. Siegel, L. Benmohamed, I-J. Wang, and R. Daley, “A two-stage approach to resource allocation for periodic strings of applications in a shipboard environment,” submitted to Journal of Parallel and Distributed Computing (JPDC). Under review.

12

Outline





13

PART 2: IBM Printer System Layout

n processing must be done in distributed fashion

n printheads consume bitmaps in page order

14

PART 2: Goals for Cluster Controller Project

n algorithm for assigning sheetsides to blades

5mathematical model of the environment

5optimized sheetside workload distribution algorithm

n system performance simulation

5evaluate algorithm’s efficiency

5determine cost effective system configuration

g minimize number of blades

g minimize memory sizes

15

PART 3: IBM Cluster Controller Project: Results

min RIP completion timeround robinrandom

bitmap lifetime (sec.)

how long bitmap exists in the system

16

PART 2: References

n J. Smith, V. Shestak, H. J. Siegel, S. Price, L. Teklits, and P. Sugavanum “Resource allocation in cluster-based imaging systems,”2007 International Conference on Parallel & Distributed Techniques and Applications (PDPTA’07). Accepted, to appear.

n patent: V. Shestak, S. Price, J. Smith, L. Teklits, H. J. Siegel,and P. Sugavanam, “Methods and Systems for Improved PrintingSystem Sheet Side Dispatch in a Clustered Printer Controller,”filed as IBM Docket BLD 920060015US1, Sep. 1 2006.

17

Outline



n part 3: stochastic robustness metric and its use for staticresource allocations


18

PART 3: QoS-Constrained Resource Allocation

n establish system performance metric

n develop mathematical model that provides functional dependence between performance metric, input parameters, and uncertainties in the system

n integrate this model into adapted ordeveloped optimization technique

n evaluate quality of the received sub-optimal solution(s)

19

PART 3: QoS-Constrained Example System

1MaMn Ma

Λ

11a11na

n periodic data setsn processing of each data set to be completed within time unitsΛ

20

PART 3: Stochastic Robustness Metric

for a given resource allocation

5 set of applications on compute node j

5 (random variable) execution time of on compute node j

5 (random variable) makespan

5 and specify acceptable range for

1 2{ , ,..., }jj j j n jS a a a=

ijT ija

ψ1

11 1

max{ ,..., }Mn n

i iMi i

T Tψ= =

= ∑ ∑minβ ψ

stochastic robustness metric is the probability that the performance

characteristic is confined to the interval :

min max[ ]P β ψ β≤ ≤min max[ , ]β β

maxβ

21

PART 3: Stochastic Resource Allocation

•node 1

•node 2

application assigned to:

makespan constraint

est. makespan(mean) probability of

exceeding makespan

time

time

prob

abili

ty d

ensi

ty fu

nctio

n

22

PART 3: Independence

n among local performance characteristics

allows stochastic robustness metric to be computed as 1

jn

j iji

Tψ=

=∑

1

[0 ] [0 ]M

jj

P Pψ ψ=

≤ ≤ Λ = ≤ ≤ Λ∏

n among random variables

allows convolution to be applied to find pdf of

5 Fast Fourier Transform (FFT) method can be used

ijT

1

jn

iji

T=∑

n if dependencies, apply bootstrap approximation method

23

0

20

40

60

80

100

120

300 350 400 450 500 550 600 650 700

makespan (sec.) based on mean values

stoc

hast

ic r

obus

tnes

s (%

)

PART 3: Comparison Analysis

1,000 randomly generated resource allocations

Tij discrete distributions constructed randomly in the same range

24

PART 3: Heuristics

heuristics

n two-phase greedy

5 basic, conflict resolution

n one-phase greedy

5 sorting, mean load balancing

n global search

5 steady-state genetic algorithm

5 ant colony optimization

5 simulated annealing

n allocate N independent applications across M nodes

n minimize period between data sets while maintaining value[ ]P ψ ≤ Λ

25

PART 3: Greedy Heuristics: Results

n value was set to 0.9

n results are based on 50 experimental trials

[ ]Pψ ≤ Λ

26

PART 3: Global Search Heuristics: Results

n value was set to 0.9

n results are based on 50 experimental trials

[ ]Pψ ≤ Λ

27

PART 3: References

n V. Shestak, J. Smith, A. A. Maciejewski, and H. J. Siegel, “A stochastic approach to measuring the robustness of a resource allocation in distributed systems,” 2006 International Conference on Parallel Processing (ICPP’06), Aug. 2006, pp. 459–470.

n V. Shestak, J. Smith, R. Umland, J. Hale, P. Moranville, A. A. Maciejewski, and H. J. Siegel, “Greedy approaches to stochastic robust resource allocation in sensor driven distributed systems,” 2006 International Conference on Parallel & Distributed Techniques and Applications (PDPTA’06), June 2006, pp. 4–13.

n V. Shestak, J. Smith, A. A. Maciejewski, and H. J. Siegel, “Iterative algorithms for stochastically robust static resource allocation in periodic sensor driven clusters,” 8th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS 2006), Nov. 2006.

28

Outline





29

PART 4: System Prototype

•task pool

•cluster controller

•heterogeneous cluster•workload

•system log

30

PART 4: System Prototype

•heterogeneous cluster


log

log

lo

log g

•stage i •time

31

PART 4: Known Parameters & Assumptions

n each task has a importance factor

n estimated time to compute each task is known

n node failure & recovery statistics is known

n total time to execute task batch is T

n no new arrivals during T

n stage length: λ time units (fixed)

n system log is received at the end of each stage

n mapping decision is generated per stage

n no credit is given for partial task execution

n if node recovers in stage i it will be used in stage i + 1

32

PART 4: Goal for Cluster Controller

n maximize revenue, i.e., expected sum of importance factors of the tasks completed over T

n maximize sum of importance factors of the tasks completed per each stage λ

33

PART 4: Off-Line Policy Generation (Hypothetical Solution)


n off-line generated policy:

5 result – lookup table

5 optimal control selection at each stage

5 finite horizon DP

5 intractable even for

medium-scale problems

•

0 •λproduce mapping execute tasks

34

PART 4: On-line Policy Generation


n on-line policy generation:

5 Monte Carlo simulation

5 limited horizon DP

n time to select control varies

•

0 •λproduce mapping execute tasks

35

PART 4: Estimating Expected Revenue from Future States

[ ] [ ] ( )( )( )1 1 2 2revenue , ( ) , ( ) ...

stages

E imp x u x E imp x u x E

N

= + +1444444442444444443

total number of stages

MDP state

( ) control applied to

[ , ( )] accumulated importance in stage

i

i i

i i

N

x

u x x

imp x u x i

←

←

←

←

computecomputecompute estimate

36

PART 4: Estimating Expected Revenue from Future States

certain number of stages

input

current state

control

probabilities

output

expectedaccumulatedimportancefrom future

states

method

machinelearning:

regression,neural

networks…

Can we achieve the desired accuracy?For how many stages?

37

Outline


n part 2: stochastic robustness metric and its use for static resourceallocations (done jointly with J. Smith, and will appear in his thesis)

n part 3: resource allocation in IBM cluster-based printing system(done jointly with J. Smith, and will appear in his thesis)

n part 4: robust resource allocation under random node failures and recoveries

38

Summary

n part 1: designed two-stage approach to static resource allocation for periodic strings of applications in QoS-constrained system

n part 2: designed workload distribution algorithm for IBM printer cluster controller

n part 3: presented a methodology for deriving stochastic robustness metric for resource allocation5illustrated methodology for example distributed system

n part 4: propose an idea for resource allocation in distributed systems with random node failures and recoveries

Date post:	19-Jul-2018
Category:	Documents
Upload:	phamlien
View:	221 times
Download:	0 times

Robust Resource Allocation in Parallel and Distributed ... · Robust Resource Allocation in...

Documents