[IEEE 2011 12th IEEE/ACM International Conference on Grid Computing (GRID) - Lyon, France...

Performance Evaluation of Overload Controlin Multi-Cluster Grids

Nezih Yigitbasi, Ozan Sonmez, Alexandru Iosup, and Dick EpemaDelft University of Technology

{m.n.yigitbasi,o.o.sonmez,a.iosup,d.h.j.epema}@tudelft.nl

Abstract—Multi-cluster grids are widely employed to ex-ecute workloads consisting of compute- and data-intensiveapplications in both research and production environments.Such workloads, especially when they are bursty, may stressshared system resources, to the point where overload conditionsoccur. Overloads can severely degrade the system performanceand responsiveness, potentially causing user dissatisfaction andperhaps even revenue loss. However, the characteristics ofmulti-cluster grids, such as their complexity and heterogeneity,raise numerous nontrivial issues while controlling overload insuch systems. In this work we present an extensive performanceevaluation of overload control in multi-cluster grids. We adapta dynamic throttling mechanism that enforces a concurrencylimit indicating the maximum number of tasks running con-currently for every application. Using diverse workloads weevaluate several throttling mechanisms including our dynamicmechanism in our DAS-3 multi-cluster grid. Our results showthat throttling can be used for effective overload controlin multi-cluster grids, and in particular, that our dynamictechnique improves the application performance by as muchas 50% while also improving the system responsiveness by upto 80%.

I. INTRODUCTION

Many scientists rely on the execution of applicationson multi-cluster grids, that is, of large-scale distributedsystems comprised of heterogeneous clusters. Multi-clustergrids such as the DAS-3 in the Netherlands, the EGEE gridin Europe, and the Open Science Grid in the US provideefficient execution infrastructures for applications with aloosely coupled structure, such as bags-of-tasks (BoTs) andworkflows. When executing such applications, the systemmay become overloaded, that is, the system resources sharedby running applications may become bottlenecks—the disksof the cluster file systems may become saturated, the gridcommunication protocols may break down due to thousandsof concurrent submissions, etc. Because overloads causesystem performance degradation and can lead to systemcrashes, many overload control techniques have been de-signed [1], [2], [3], [4]; among them, throttling, that is,controlling the rate at which workloads are pushed throughthe system, is a relatively simple technique that can delivergood performance. However, few of these techniques havebeen adapted for and investigated in the context of multi-cluster grids. In this work we present a dynamic throttlingtechnique along with an extensive performance evaluation ofthrottling-based overload control techniques for multi-clustergrids.

The typical effects of overload are increasing backlogsat shared resources and decreased performance and respon-siveness leading to unpredictable system behavior and userdissatisfaction. As a result, in production systems, overload

can cause significant loss of revenue to service providers. Forexample, Amazon reported that even small (100 ms) delaysfor web page generation will cause a significant (1%) dropin sales [5]. Similarly, Google reports that an extra 0.5s insearch time causes a traffic drop of 20% [5]. Overloads arecommon in multi-cluster grids, leading to task wait timesoften in excess of several hours [6].

There are two primary causes of overload in multi-clustergrids. First, grid workloads may be very bursty or evendifficult to predict, at both short and long time scales [7]. Toillustrate this, Figure 1 shows the number of tasks submittedto the DAS-3, the SHARCNET, and the GRID3 multi-clustergrids, and to a multi-thousand node production MapReducecluster of an online social networking company. Second, theapplications submitted to multi-cluster grids can be largerelative to the system in terms of number of tasks, runtime,and I/O requirements [8].

The overload control problem has been studied exten-sively and several techniques for alleviating overloads, suchas congestion and admission control [2], control theoreticapproaches [9], scheduling [4], and overprovisioning [3],have been proposed. However, they have not been inves-tigated in the context of multi-cluster grids, which differsignificantly from these other systems in both structure andworkload. Structurally, multi-cluster grids are comprised ofheterogeneous clusters distributed over a wide-area network.The typical workload of a multi-cluster grid consists ofscientific applications with BoT, workflow, and parallel HPCstructure [6], [10]. Among these application types, BoTs arethe dominant application type in grids, as they account forover 75% of all submitted tasks and are responsible for over90% of the total CPU-time consumption [10].

The main contributions of this paper are:1. We adapt a dynamic throttling technique to controloverload in multi-cluster grids under bursty workloads(Section III).

2. We investigate the performance of three throttling tech-niques, including our technique, with extensive experi-ments using diverse workloads in our DAS-3 multi-clustergrid (Section IV).Our performance evaluation leads to two main observa-

tions. First, we find that throttling can significantly improveboth application performance and system responsiveness inmulti-cluster grids, even under bursty workloads. Second,we find that, for multi-cluster grids, the dynamic throttling-based overload control techniques can replace the static(hand-tuned). The latter result is particularly significant inmulti-cluster grid settings, where hand-tuning is slow and

0

5000

10000

15000

20000

May 2007 Feb. 2008 Oct. 2008

Num

ber

of S

ubm

itte

d T

asks

(a) DAS-3

0

1000

2000

3000

4000

Feb. 2006 Aug. 2006 Feb. 2007

Num

ber

of S

ubm

itte

d T

asks

(b) SHARCNET

0

1000

2000

3000

May 2004 Sep. 2005 Nov. 2006

Num

ber

of S

ubm

itte

d T

asks

(c) GRID3

0

5000

10000

15000

20000

25000

30000

35000

40000

2 Oct. 2009 6 Oct. 2009 10 Oct. 2009

Num

ber

of S

ubm

itte

d T

asks

(d) MapReduce ClusterFigure 1. Numbers of tasks submitted to three multi-cluster systems anda multi-thousand node production MapReduce cluster within five minuteintervals. All systems have periods of burst submissions.

costly due to the number of clusters, and difficult due toworkload burstiness.

II. MULTI-CLUSTER GRID MODEL

In this study we focus on multi-cluster grids comprisingheterogeneous clusters. Such systems usually include a head-node for each cluster, which is a central node that usersconnect and which uses middleware to interact with the restof the system. The middleware operates in each cluster andis responsible for managing the compute resources (workernodes). Tasks that are submitted to the middleware areinitially placed into the middleware queue until there areenough resources to execute the tasks. After the submission,the middleware dispatches the tasks to the assigned nodesand manages the task execution. This model fits manyproduction multi-cluster systems, including the world-wideLCG, Grid5000, TeraGrid, and our DAS-3 system. Ourmodel also fits other multi-cluster systems, and in particularthe numerous deployed systems using Globus, which isarguably the most used middleware [11], the Grid Engine,and PBS/Torque. Our model does not exactly fit the systemsbased on loosely-integrated resources, such as the systemsbased on Condor; however, while other configurations arepossible, many Condor pools use in practice a single Nego-tiator, which effectively plays the role of the cluster head-node in our model.

As an example, our DAS-3 system employs two primarycomponents: a runner (application-level scheduler) deployedon a head-node which is responsible for a single applicationsubmission, and an execution service deployed on eachhead-node which is responsible for interacting with themiddleware and performing protocol conversion between themiddleware and the runners. These two components maycommunicate over the local area network or the wide areanetwork. In each of the clusters, the head-node communi-cates with the worker nodes for task execution managementand with the distributed file system for the file transfers.

III. OVERLOAD CONTROL TECHNIQUESIn this section we describe the throttling-based overload

control techniques that we investigate in this work.To detect overload we use the head-node CPU load

and the disk utilization metrics which we think are goodindicators of overload based on our experience with multi-cluster grids and their workloads. For each metric we set athreshold and a maximum value. During workload executiondepending on the measured values of these metrics and thethreshold and maximum values a cluster may be in oneof two states, overloaded or underloaded. An underloadedcluster transitions to the overloaded state when either thehead-node’s CPU load or the disk utilization exceeds itsmaximum value, or when both metrics exceed their thresholdvalues. Similarly, an overloaded cluster transitions to theunderloaded state when either of these metrics falls belowits threshold value. After an overload is detected, the throt-tling technique reacts by enforcing a concurrency limit–themaximum number of concurrently running tasks–for everyapplication in the system. We describe in the following ourthrottling-based overload control techniques, in turn.

1. Static throttling (Static): This technique uses astatic concurrency limit for throttling. With Static it ispossible to underutilize the system with a low concurrencylimit, and overload the system with a high concurrencylimit. Thus, it is crucial to determine the best concurrencylimit for a particular system and workload with Static.For our experiments we have manually tuned the concur-rency limit to the value that gives the best performanceover many experiments, so in our evaluation (Section V)Static provides the best performance for our systemand workloads.

2. Bang Bang Control (BBC) [12]: With BBC, the exe-cution service notifies the runner to stop submitting taskswhen the head-node transitions to the overloaded state.When a head-node transitions back to underloaded state,the execution service notifies the runner to resume its tasksubmission. BBC lets the runner to temporarily overloada cluster, as too many tasks may be submitted when thatcluster recovers from overload and before the executionservice can detect and react to the new overload.In heterogeneous multi-cluster grids BBC may performpoorly: it is possible that all but the fastest cluster mayget overloaded and only the fastest cluster may be un-derloaded. Such a situation causes the fastest cluster toreceive all the tasks while the other clusters are recoveringfrom their overloads causing the queueing times at the lo-cal resource manager to increase noticeably. To solve thisproblem we adapt the original BBC algorithm by intro-ducing a maximum concurrency limit (C_LIMIT_MAX)for each cluster so that when the number of tasks that arerunning concurrently in a cluster exceeds C_LIMIT_MAXthe cluster transitions to the overloaded state.

3. Adaptive throttling (Adaptive): To address the in-flexibility of Static and the problem of temporarily

Cluster # of Node CPU # of CoresNodes Speed [GHz] on the Head-Node

C1 22 2.6 8C2 29 2.2 8C3 60 2.4 8

Table ITHE PROCESSING CAPABILITY OF OUR MULTI-CLUSTER GRID.

overloading the clusters of BBC, we propose an AdditiveIncrease Multiplicative Decrease (AIMD) based controllerthat dynamically adjusts the concurrency limit. It has beenshown that AIMD-based control is a provably convergentcontrol rule [13]. However, to design a controller thatgives additional guarantees control theory can also beused [14].Adaptive operates in each cluster independently. Ituses the following constants as inputs: the number ofnodes in the cluster (N_NODES), the threshold and max-imum values for the CPU load and disk utilization, andthree parameters that are explained in the following (α,β, and C_LIMIT_MAX). Adaptive tunes the concur-rency limit (c_limit) as follows. Initially, c_limitis set to N_NODES. Periodically, Adaptive measuresthe head-node CPU load and the disk utilization, andit checks whether the cluster is overloaded using thecorresponding threshold and max values. If the clusteris overloaded, c_limit is decreased by being set toα · c_limit, with 0 < α < 1. If the cluster is notoverloaded, then c_limit is increased by being set toc_limit+β ·n_finished, with 0 < β ≤ 1 being usedto gradually increase c_limit to avoid overshootingand n_finished tasks have finished since the lastcontrol period. To prevent clusters from being severelyoverloaded even temporarily, c_limit is not allowed toexceed the maximum concurrency limit C_LIMIT_MAX.We describe in Section IV-D how we set the values ofthese parameters in our experiments.We have implemented the throttling techniques presented

in this section as a part of the runner and the executionservice presented in Section II. In Section V, we evaluatethe performance of these throttling techniques in our multi-cluster grid described in the next section.

IV. EXPERIMENTAL SETUPIn this section we first describe our multi-cluster grid

DAS-3 in which we evaluate the performance of the throt-tling techniques presented in the previous section. Then wedescribe the workloads that we use and the performancemetrics that we report as a result of our evaluation inSection V.

A. Multi-Cluster TestbedWe perform our experiments on three clusters of our DAS-

3 testbed. Table I shows the processing capability of ourtestbed. Each cluster has a separate distributed file systemand on each cluster the Grid Engine (GE) middleware oper-ates as the local resource manager. GE has been configuredto run tasks on the nodes exclusively (in space-shared mode).We have deployed the execution service on each cluster’s

Workload Number of Task Runtime Total I/OTasks [s] Per Task [MB]

W-Base 1,000 60 100W-Task 5,000 60 100W-Run 1,000 300 100W-IO 1,000 60 200

Table IIWORKLOADS USED IN OUR EXPERIMENTS.

head-node, and the runner has been deployed on the head-node of the C3 cluster; execution service and runner aredescribed in Section II.

B. WorkloadsWe evaluate our throttling techniques (Section III) using

BoTs, which are the dominant application type in multi-cluster grids (see Section I). We summarize in Table II thecharacteristics of the workloads used in our experiments.All tasks of a BoT are submitted to the system at the sametime, so our workloads represent the worst-case overloadscenario. The W-Base workload comprises 1,000 tasks, eachwith a runtime of 60 seconds and performing 100 MB I/O.To understand the impact of the workload characteristics,we perform the evaluation across three dimensions: startingfrom W-Base we increase, in turn, the number of tasks(W-Task), the task runtimes (W-Run), and the task I/Orequirements (W-IO) of the BoT. Although each workload ishomogeneous, together they cover a wide range of scenarios,from compute-intensive to communication-intensive, andfrom small-scale to large-scale applications. Their tasks havesimilar runtimes and I/O requirements to the tasks observedin real multi-cluster grid workloads [6].

C. The Performance MetricsIn our evaluation we use several metrics that we categorize

as system or user metrics. System metrics quantify theperformance of the system components while user metricsquantify the performance perceived by the user.

1. System Metrics:• CPU Usage [%]: The fraction of time a process keeps

the CPU busy as reported by the Linux top utility. Weuse this metric to assess the overhead of our schedulerin Section V-A.

• CPU Load: The number of processes which are in theprocessor run queue or waiting for I/O. We report theaverage CPU load calculated over one minute intervalsas reported by the kernel. When the CPU load ishigh, the head-nodes cannot respond to connectionrequests, so we use this metric to quantify the systemresponsiveness. It is better if this metric is close to thenumber of cores of a head-node.

• Disk Utilization [%]: Fraction of time the disk is busyas reported by the Linux iostat utility. We report theaverage utilization calculated over five second intervals.

• Cluster Utilization [%]: Fraction of available nodesthat are used.

2. User Metrics:• I/O Service Time [ms]: The time it takes for the disk

to serve an I/O request. We report the average service

Parameter Value(s)Control Period 30 s

CPU Load Threshold 7Max. CPU Load 10

Disk Utilization Threshold 40%Max. Disk Utilization 60%

α 0.5β 0.5 and 1.0

C_LIMIT_MAX Number of nodes (see Table I)Table III

THE PARAMETERS FOR THE OVERLOAD CONTROL TECHNIQUES ANDTHEIR VALUES USED IN OUR EXPERIMENTS.

time calculated over five second intervals.• Task Execution Time [s]: The time it takes for a task

to complete its execution.• Makespan [s] (of a BoT): The difference between the

earliest time of submission of any of its tasks and thelatest time of completion of any of its tasks.

D. Parameters for the Overload Control TechniquesTable III summarizes the parameters for the throttling

techniques with their values that we use in our experiments.Since the best values for these parameters depend on aparticular system and workload we have performed severalexperiments to determine the best values for our system.

We use a control period of 30s which is smaller thanthe shortest task in our workloads. Hence, the throttlingtechniques react fast enough to the changes in the monitoredmetrics. Since all cluster head-nodes are 8-core machines,we use a CPU load threshold of 7 (corresponding roughly to90% utilization) and a maximum CPU load of 10 (letting ahead-node to be overloaded up to 125%). When our systemis empty, the average disk utilization is less than 20%.Therefore, for this metric we use 40% as the threshold and60% as the maximum value.

For Adaptive the value of the α parameter should be setto provide a balance between the throughput and the speedof overload recovery. We use α = 0.5 in our experiments.Small values of α may degrade the throughput while largerα values may cause the system to recover from overloadslowly. Experiments with larger values, such as 0.7 and0.8, did not lead to substantial differences in the observedperformance. For the β parameter we use a value of either0.5 or 1.0. Unless otherwise specified, we use β = 0.5 inour experiments. For small values of β the throughput maydegrade while for larger β values the runner may temporarilyoverload a cluster as the concurrency limit will be increasedquickly.

Finally, for the maximum concurrency limit parameter(C_LIMIT_MAX), with Static we use 30 tasks which wefound through several experiments to perform well for oursystem, and we set the value of this parameter to the numberof available nodes on each cluster for BBC and Adaptiveto prevent the tasks from getting queued in the local resourcemanagers.

V. EXPERIMENTAL RESULTSIn this section we assess the performance of the throt-

tling techniques described in Section III and of the systemwithout throttling (No Throttling). We first validate theassumption that our system’s scheduling middleware is not

��

��

��

��

(a) Runner.

��

��

��

��

(b) Execution service.Figure 2. Single-Cluster Experiments [W-Base]: The CPU usage [%]of the runner (left) and the execution service (right).

0

1000

2000

3000

4000

5000

NoThrottling

Static BBC Adaptive

Makespan [s]

(a) The makespan.

0

25

50

75

100

0 200 400

CD

F [%

]

Task Execution Time [s]

BaseRuntime

No Throttling

Static

BBC

Adaptive

(b) The distribution of the taskexecution times.

Figure 3. Single-Cluster Experiments [W-Base]: Application perfor-mance. CDF denotes cumulative distribution function.

a bottleneck (Section V-A). Then, we perform two sets ofexperiments, one in a single cluster and the other on threeheterogeneous clusters.

A. Scheduling OverheadWe assess the overhead of the runner and the execution

service after tuning our system to make sure that thesecomponents do not contribute to system overload. To thisend, we run the W-Base workload on a single cluster (C3)without using throttling. Figure 2 shows the CPU usage ofthe runner and the execution service during this experiment.The runner and the execution service maximum CPU usageis well below 100%, with the runner having a maximumCPU usage of 10% and the execution service having amaximum CPU usage of 1%. Similarly, we have observedlow memory consumption (KBs) and I/O usage (not shown).This confirms that these components have relatively lowoverhead, and therefore they do not contribute to the systemoverload in the experiments.

B. Results for Single-Cluster ExperimentsIn this part of our work we investigate the performance

of the throttling techniques presented in Section III withexperiments on the C3 cluster using the W-Base workload.

We analyze the application performance and show theresults in Figure 3. We observe that throttling improvesthe makespan over the system without throttling; the im-provement is 40% with Static, 20% with BBC, and 18%with Adaptive (Figure 3(a)). The reason for the makespanimprovements is that, without throttling, the cluster becomesfully utilized during the workload execution (see the valuesfor No Throttling in Figure 4(a)). So, the tasks runningin parallel congest the shared distributed file system andthe intra-cluster network, which leads to an increase in theindividual task runtimes and further to an increase in the

0

20

40

60

80

100

NoThrottling

Static BBC Adaptive

Clu

ste

r U

tiliz

atio

n [%

] Quartiles

Median

Mean

(a) The basic statistical proper-ties of the C3 utilization.

0

10

20

30

40

50

60

NoThrottling

Static BBC Adaptive

CP

U L

oa

d

QuartilesMedian

Mean

(b) The basic statistical proper-ties of the CPU load of the C3head-node.

Figure 4. Single-Cluster Experiments [W-Base]: System load. Experi-ments in the C3 cluster.

makespan. With throttling, fewer tasks run in parallel as therunner delays the task submissions taking into account theconcurrency limits, but the resulting delay is smaller than theoverheads of running many tasks simultaneously. Throttlingalso helps individual tasks: the median task execution timeis reduced by 70% with Static, 65% with Adaptive,and 40% with BBC over No Throttling (Figure 3(b)).Furthermore, with throttling the task execution time distri-bution has a shorter tail than that of No Throttling; at95th percentile we observe significant improvements: 75%with Static, 25% with BBC, and 63% with Adaptive(see Table IV). Although throttling introduces additionaldelay for individual tasks the resulting makespan is muchbetter than without throttling. Makespan-wise Static per-forming the best, with BBC and Adaptive having similarperformance. Moreover, with Static and Adaptive, theresulting task execution performance is more consistent (hasa shorter distribution tail) than without throttling.

We analyze the performance of the system and showthe basic statistical properties of the C3 cluster uti-lization in Figure 4(a). We observe that Static andAdaptive reduce the median cluster utilization by 50%versus the system without throttling. However, similarlyto No Throttling, for BBC the median and maximumcluster utilization are 100%, significantly higher than forStatic and Adaptive; for the latter, the lower utilizationis due to the fewer tasks running concurrently in the system.Although the cluster is lowly utilized with throttling, whichmay not be desired by system administrators, the resultingapplication performance is significantly better (Figure 3).

We assess the basic statistical properties of the CPU loadof the C3 head-node and show the results in Figure 4(b)1.Throttling improves the median CPU load, hence the systemresponsiveness, substantially: 70% with Static, 20% withBBC, and 68% with Adaptive. Without throttling, theCPU load is constantly high with a median load of 35causing the system to be unresponsive to user requests.With Static, the CPU load is constantly low with amedian load of 10. Among the techniques, BBC performsthe worst in terms of CPU load as the runner overloadsthe cluster temporarily several times during the workload

1As the C3 cluster’s head-node has an 8-core CPU, it is betterif the CPU load is close to 8.

0

50

100

NoThrottling

Static BBC Adaptive

I/O

Serv

ice T

ime [m

s]

517 402 486 346

Quartiles

Median

Mean

(a) The basic statistical proper-ties of the I/O service times.

0

50

100

NoThrottling

Static BBC Adaptive

Dis

k U

tiliz

atio

n [%

]

Quartiles

Median

Mean

(b) The basic statistical proper-ties of the disk utilization.

Figure 5. Single-Cluster Experiments [W-Base]: I/O performance. Thevalues at the top of graph (a) are the maximum values observed.

Task Execution CPU Load I/O ServiceTime [s] Time [ms]

95th 99th 95th 99th 95th 99thNo Throttling 346 443 37 38 178 318

Static 89 109 14 17 124 262BBC 262 309 35 36 214 354

Adaptive 127 187 22 31 147 256Ideal Case 60 8 6

Table IVSINGLE-CLUSTER EXPERIMENTS [W-BASE]: THE 95TH AND THE

99TH PERCENTILES FOR THE TASK EXECUTION TIME, CPU LOAD, ANDI/O SERVICE TIME METRICS.

execution. Nevertheless, it still performs better than NoThrottling, with an improvement of 20% in medianCPU load. Adaptive performs similarly to Static, andit performs significantly better than BBC. With Staticand Adaptive throttling, the CPU load is much lowercompared to No Throttling: throttling also improvesthe system responsiveness substantially.

We investigate the I/O performance and show the basicstatistical properties of the I/O service time and the diskutilization in Figure 5. All techniques improve the medianI/O service time over No Throttling: Static by 80%,Adaptive by 93%, and BBC by 63% (Figure 5(a)). SinceBBC lets the runner temporarily overload the cluster, themaximum I/O service time with BBC is close to that of NoThrottling. Finally, the disk has a lower utilization withthrottling than No Throttling; the median disk utiliza-tion decreases by up to 70% with Adaptive (Figure 5(b)).With No Throttling and BBC, the I/O service time ishighly variable, while with Static and Adaptive theI/O service time has lower variability. We conclude that,in addition to significant improvements in task executionperformance, throttling also improves the I/O performancesubstantially.

The quality of the service offered by a system to its users(Service Level Agreement, SLA) is often quantified by theservice performed on a large fraction of the work requests,such as the 95th or the 99th percentiles of the task executiontime; we call this quantifier the extreme performance of thesystem. We compare in Table IV the 95th and the 99th per-centiles of three performance metrics–task execution time,CPU load, and I/O service time–with and without throttling;the row “Ideal Case” additionally presents the metric valuesfor the system without overload. As expected, the overloadedsystem has much lower extreme performance than the idealcase. However, among the techniques the use of either of

0

500

1000

1500

2000

NoThrottling

Static BBC Adaptive(β=0.5)

Adaptive(β=1.0)

Ma

ke

sp

an

[s]

(a) The makespan.

0

25

50

75

100

0 200 400

CD

F [%

]


BaseRuntime

No ThrottlingStatic

BBCAdaptive

(b) The distribution of the taskexecution time.

Figure 6. Multi-Cluster Experiments [W-Base]: Application perfor-mance.

the Static, BBC, and Adaptive techniques leads tosignificant improvements in one or more of the metrics,especially the task execution time and the I/O service time.Thus, throttling is to be preferred to No Throttlingwhen extreme performance guarantees are part of the SLA.Furthermore, BBC delivers consistently worse extreme per-formance than the other techniques; the differences betweenStatic and Adaptive illustrate the time-performancetrade-offs offered by manual and automatic-and-dynamicsystem tuning, respectively.

C. Results for Multi-Cluster ExperimentsWe now evaluate the performance of the throttling tech-

niques in a multi-cluster setting. The three clusters we use(Section IV) are heterogeneous in terms of size and networkbandwidth. We first perform experiments with our baselineworkload, W-Base, using all the techniques. Then, we use, inturn, a workload with increased number of tasks (W-Task),task runtimes (W-Run), and task I/O requirements (W-IO);we assess with these workloads the performance of the BBCand Adaptive techniques.

We analyze the application performance and show theresults in Figure 6. As more resources are used dur-ing this experiment, the makespan here is lower thanfor the single-cluster experiments (compare Figure 3(a)with Figure 6(a)). Similarly to the results obtained forsingle-cluster experiments, throttling noticeably improvesthe application performance (Figure 6(a)). Static andBBC improve the makespan by 13% and 8% over NoThrottling, respectively. Adaptive with an averageadaptation rate (β = 0.5, see Sections III and IV-D)provides roughly the same makespan as No Throttling.However, Adaptive with β = 1.0 provides a makespanof 1,600 ms (similar to BBC) and improves the applicationperformance by 10% over No Throttling. All tech-niques improve significantly the application performanceand the extreme performance of the task execution (shorterdistribution tail in Figure 6(b)).

We investigate the performance of the system and showthe basic statistical properties of the cluster utilization ofthe C3 cluster and the CPU load of the C3 head-node inFigure 7. All techniques reduce the CPU load leading tobetter system responsiveness; Static by 33%, BBC by

0

10

20

30

40

50

60

NoThrottling

Static BBC Adaptive(β=0.5)

Adaptive(β=1.0)

CP

U L

oa

d

Quartiles

Median

Mean

(a) The basic statistical properties of theCPU load of the head-node.

0

20

40

60

80

100

NoThrottling

Static BBC Adaptive

Clu

ste

r U

tiliz

atio

n [%

] Quartiles

Median

Mean

(b) The basic statistical proper-ties of cluster utilization.

Figure 7. Multi-Cluster Experiments [W-Base]: System load for the C3cluster.

0

50

100

NoThrottling

Static BBC Adaptive

I/O

Se

rvic

e T

ime

[m

s]

433 214 288 276

Quartiles

Median

Mean

(a) The basic statistical propertiesof the I/O service time.

0

50

100

NoThrottling

Static BBC Adaptive

Dis

k U

tiliz

atio

n [%

] Quartiles

Median

Mean

(b) The basic statistical propertiesof the disk utilization

Figure 8. Multi-Cluster Experiments [W-Base]: I/O performance forthe C3 cluster. The values at the top of graph (a) are the maximum valuesobserved.

40%, and Adaptive by 80% (Figure 7(a)). This improve-ment adds to the improvements observed for the applicationperformance (Figure 6). Compared with No Throttling,Adaptive with β = 0.5 preserves the application perfor-mance (Figure 6(a)) while using less resources (Figure 7(b)),and improving the system responsiveness (Figure 7(a)).Moreover, with β = 1.0, although Adaptive yields abetter performance (Figure 6(a)), it leads to a 50% higherCPU load over β = 0.5 (Figure 7(a)). Because with β = 1.0,Adaptive increases the concurrency limit faster than withβ = 0.5, thus letting the runner overload the head-nodes.Our results show that a trade-off between the applicationperformance and system responsiveness exists. As a conse-quence, while determining the values of the parameters ofthe throttling techniques, this trade-off should be taken intoaccount.

We investigate the I/O performance and show the basicstatistical properties of the I/O service time and the diskutilization for the C3 cluster in Figure 8. Throttling alsohelps in reducing the I/O service times: the median I/Oservice time is reduced by 62% with Static, 65% withBBC, and 81% with Adaptive over No Throttling (Figure 8(a)). Finally, in terms of the disk utilization BBC per-forms similar to No Throttling while Adaptive per-forms slightly better decreasing the disk utilization by 66%.Due to the heterogeneity of our testbed, although Statichas a higher disk utilization than No Throttling (Fig-ure 8(b)), it improves significantly the CPU load (Fig-ure 7(a)) and the I/O service time over No Throttling(Figure 8(a)).

We now assess the performance of the BBC and

0

4000

8000

12000

16000

20000

NoThrottling

BBC Adaptive

Ma

ke

sp

an

[s]

CP

U L

oa

d

(a) (b) (c)

0

20

40

60

80

100

NoThrottling

BBC Adaptive

Quartiles

Median

Mean

0

25

50

75

100

0 200 400C

DF

[%

]


BaseRuntime

No Throttling

BBC

Adaptive

Figure 9. Multi-Cluster Experiments [W-Task]: Makespan (a), thedistribution of the task execution time (b), and the basic statistical propertiesof the CPU load of the C3 head-node (c).

0

1000

2000

3000

4000

5000

NoThrottling

BBC Adaptive 0

10

20

30

40

50

NoThrottling

BBC Adaptive

Quartiles

Median

Mean

Ma

ke

sp

an

[s]

CP

U L

oa

d

(a) (b) (c)

Figure 10. Multi-Cluster Experiments [W-Run]: Makespan (a), thedistribution of the task execution time (b), and the basic statistical propertiesof the CPU load of the C3 head-node (c).

Adaptive techniques with the W-Task workload and showthe results in Figure 9. As W-Task contains more tasksthan W-Base, the overloads in all clusters are more severe.As a result, throttling improves drastically the applicationperformance: Adaptive and BBC improve the makespanby 50% (Figure 9(a)) while improving the median CPU loadby 22% and 78% (Figure 9(c)), respectively. The reasonsfor such a difference are the increased number of parallelI/O operations, and the increased number of simultaneousinter-cluster file transfers that put more load on the sharedresources. Compared to the other experiments, with the W-Task workload the improvements in the application per-formance and the CPU load is higher resulting in a moreresponsive system. Moreover, throttling also improves theextreme performance of the task execution time (Figure 9(b))leading to better performance consistency than without throt-tling.

We evaluate the performance of the BBC and Adaptivetechniques with the W-Run workload and show the re-sults in Figure 10. Unlike the results for the W-Taskworkload, with Adaptive the makespan is roughly thesame as the makespan without throttling (Figure 10(a)).Although Adaptive and BBC have similar task executionperformance (Figure 10(b)), the makespan is smaller withBBC with a 30% improvement over No Throttling astasks have higher wait times (throttling delay + queuingdelay) with Adaptive than with BBC, leading to a highermakespan. Similar to the results for the W-Task workload,Adaptive improves the CPU load by 60% over NoThrottling leading to better system responsiveness thanBBC (Figure 10(c)). With the W-Run workload Adaptive

0

500

1000

1500

2000

2500

3000

NoThrottling

BBC Adaptive 0

10

20

30

40

50

No

Throttling

BBC Adaptive

Quartiles

Median

Mean

(a) (b) (c)

Makespan [s]

CP

U L

oa

d

0

25

50

75

100

0 200 400

CD

F [%

]


BaseRuntime

No Throttling

BBC

Adaptive

Figure 11. Multi-Cluster Experiments [W-IO]: Makespan (a), thedistribution of the task execution time (b), and the basic statistical propertiesof the CPU load of the C3 head-node (c).

leads to a similar makespan with No Throttling whileBBC results in a better makespan but with a higher CPUload.

Using the W-IO workload we investigate the performanceof the BBC and Adaptive techniques and present the re-sults in Figure 11. Both techniques lead to similar makespan,with an improvement of 15% over No Throttling (Fig-ure 11(a)). Since the workload is I/O intensive, the lesspowerful cluster (C3) gets overloaded quickly, causing alarge number of tasks to be submitted to faster clusters withboth techniques yielding a similar makespan. Similarly to theresults of the W-Task and W-Run workloads, both techniquesimprove the extreme performance of the task execution time(Figure 11(b)). Finally, both techniques result in similarCPU load due to the large file transfers with this workload(Figure 11(c)). Nevertheless, the system responsiveness isimproved substantially as both techniques reduce the CPUload by 55%.

VI. RELATED WORKIn this section we survey prior research exploring the

following overload control techniques: congestion control,admission control, scheduling, overprovisioning, and throt-tling.Congestion Control is a well-researched technique for

network traffic engineering; we refer to [15] for a survey ofa wide range of TCP congestion control mechanisms.Admission Control is a technique under which the

amount of work accepted to a system is policy controlled.Admission control has been used in web servers to mitigateflash crowds [9], and in e-commerce systems [16] and multi-tier distributed systems [17] for overload control. Althougheffective, admission control can help stave off degradingresponse times under overload but cannot prevent it com-pletely.Scheduling has also been investigated as a solution to

the overload control problem. In [4] authors address thetransient overload problem of web servers by using theshortest remaining processing time scheduling policy. Asimilar study shows significant response time improvementsby favoring short connections [18]. Although these studiesshowed improvements to the response times under transientoverload, they do not evaluate the policies under perma-nent overloads. Previous studies [16], [19] also show thatscheduling can prevent overload only to a certain extent.

Overprovisioning is a technique for handling workloadfluctuations that may cause temporary overloads at bottle-neck resources. Overprovisioning can solve the overloadproblem only to a certain extent [20], even when usingoverflow pools to handle transient overload [21] or dynamicoverprovisioning [3]. Overprovisioning is difficult to employfor highly variable workloads—at one extreme, a system thatis overprovisioned for the peak load incurs high costs, at theother, a system overprovisioned for the mean load cannothandle severe overloads.Throttling is a technique under which the rate at which

workloads are pushed through the system is controlleddepending on the system load. Throttling has been used indiverse computer systems to control overload; it has beenused in distributed file systems [22], resource managementsystems like Grid Engine and Condor [23], networks of SIPservers [12], and in cycle stealing systems for efficientlyenforcing resource limits on I/O subsystems [24].

Closest to our work are the studies in networks ofSIP servers [12], in cycle stealing systems [24], and inCondor DAGMan [23]. Our study is different from [12]since the workload characteristics of multi-cluster grids aresignificantly different from multimedia workloads [10]. Incontrast to [23], we perform a more extensive evaluation:we investigate both static and adaptive throttling techniqueswhere they only focus on static throttling, and moreover, weevaluate these techniques in a real multi-cluster grid.

VII. CONCLUSIONDue to highly demanding and bursty workloads, overloads

are inevitable in multi-cluster grids, leading to decreasedsystem performance and responsiveness. Further motivatedby our DAS multi-cluster grid, where running hundreds oftasks concurrently leads to severe overloads, in this studywe have investigated the performance of throttling-basedoverload control techniques in multi-cluster grids.

Our results show strong evidence that throttling can beused for effective overload control in multi-cluster grids. Ingeneral, we have shown that throttling leads to a decrease (inmost cases) or at least to a preservation of the makespan ofbursty workloads, while significantly improving the extremeperformance (95th and 99th percentiles) for application tasksleading to more consistent performance and reducing theoverload of cluster head-nodes. In particular, our adaptivetechnique improves the application performance by as muchas 50% while also improving the system responsiveness byup to 80%, when compared with the tuned multi-clustersystem without throttling. Our results further indicate thatour adaptive throttling technique performs similarly to staticthrottling, which is based on the manual tuning of oursystem that provides the best observed performance, andbetter overall than the other adaptive throttling techniqueinvestigated in this work.

REFERENCES[1] R. Iyer, V. Tewari, and K. Kant, “Overload control mecha-

nisms for web servers,” in Workshop on Perf. and QoS of NextGen. Netw., 2000, pp. 225–244.

[2] L. Cherkasova and P. Phaal, “Session based admission con-trol: A mechanism for improving the performance of anoverloaded web server,” HP, Tech. Rep. HPL-98-119, 1998.

[3] B. Urgaonkar and P. Shenoy, “Cataclysm: policing extremeoverloads in internet applications,” in WWW, 2005, pp. 740–749.

[4] B. Schroeder and M. Harchol-Balter, “Web servers underoverload: How scheduling can help,” ACM Trans. InternetTechnol., vol. 6, no. 1, pp. 20–52, 2006.

[5] G. Linden, “Make data useful,” 2006, http://home.blarg.net/∼glinden/StanfordDataMining.2006-11-29.ppt.

[6] A. Iosup, C. Dumitrescu, D. Epema, H. Li, and L. Wolters,“How are real grids used? the analysis of four grid traces andits implications,” in GRID, 2006, pp. 262–269.

[7] O. Sonmez, N. Yigitbasi, A. Iosup, and D. Epema, “Trace-based evaluation of job runtime and queue wait time predic-tions in grids,” in HPDC, 2009, pp. 111–120.

[8] S. Callaghan et al., “Scaling up workflow-based applications,”J. Comput. Syst. Sci., vol. 76, no. 6, pp. 428–446, 2010.

[9] M. Welsh and D. Culler, “Adaptive overload control for busyinternet servers,” in USITS, 2003.

[10] A. Iosup and D. Epema, “Grid computing workloads: Bagsof tasks, workflows, pilots, and others,” IEEE Internet Com-puting, vol. 15, pp. 19–26, 2011.

[11] “The metrics project, globus metrics,” Globus, Tech.Rep. v 1.4, 2007, http://incubator.globus.org/metrics/reports/2007-02.pdf.

[12] V. Hilt and I. Widjaja, “Controlling overload in networks ofsip servers,” in ICNP, 2008, pp. 83–93.

[13] D.-M. Chiu and R. Jain, “Analysis of the increase anddecrease algorithms for congestion avoidance in computernetworks,” Comput. Netw. ISDN Syst., vol. 17, no. 1, pp. 1–14, 1989.

[14] J. L. Hellerstein, Y. Diao, S. Parekh, and D. M. Tilbury,Feedback Control of Computing Systems. John Wiley &Sons, 2004.

[15] J. Widmer, R. Denda, and M. Mauve, “A survey on tcp-friendly congestion control,” IEEE Network, vol. 15, no. 3,pp. 28 –37, May 2001.

[16] S. Elnikety, E. Nahum, J. Tracey, and W. Zwaenepoel, “Amethod for transparent admission control and request schedul-ing in e-commerce web sites,” in WWW, 2004, pp. 276–286.

[17] N. Mi, G. Casale, A. Riska, Q. Zhang, and E. Smirni,“Autocorrelation-driven load control in distributed systems,”in MASCOTS, 2009.

[18] M. Crovella, R. Frangioso, and M. Harchol-Balter, “Connec-tion scheduling in web servers,” in USITS, 1999.

[19] O. Sonmez, N. Yigitbasi, S. Abrishami, A. Iosup, andD. Epema, “Performance analysis of dynamic workflowscheduling in multicluster grids,” in HPDC, 2010, pp. 49–60.

[20] S. Kleban and S. Clearwater, “Quelling queue storms,” inHPDC, 2003, p. 162.

[21] A. Fox, S. Gribble, Y. Chawathe, E. Brewer, and P. Gauthier,“Cluster-based scalable network services,” SIGOPS Oper.Syst. Rev., vol. 31, no. 5, pp. 78–91, 1997.

[22] A. Adya, W. Bolosky, R. Chaiken, J. Douceur, J. Howell, andJ. Lorch, “Load management in a large-scale decentralized filesystem,” Microsoft Research, Tech. Rep. MSR-TR-2004-60,2004.

[23] P. Couvares, T. Kosar, A. Roy, J. Weber, and K. Wenger,“Workflow management in Condor,” in Workflows for e-Science. Springer, 2007, pp. 357–375.

[24] K. Ryu, J. Hollingsworth, and P. Keleher, “Efficient networkand i/o throttling for fine-grain cycle stealing,” in SC, 2001,pp. 3–3.

Date post:	15-Dec-2016
Category:	Documents
Upload:	dick
View:	213 times
Download:	1 times

[IEEE 2011 12th IEEE/ACM International Conference on Grid Computing (GRID) - Lyon, France...

Documents