........................................................................................................................................................................................................................
QUALITY-OF-SERVICE-AWARESCHEDULING IN HETEROGENEOUSDATACENTERS WITH PARAGON
........................................................................................................................................................................................................................
PARAGON, AN ONLINE, SCALABLE DATACENTER SCHEDULER, ENABLES BETTER CLUSTER
UTILIZATION AND PER-APPLICATION QUALITY-OF-SERVICE GUARANTEES BY LEVERAGING
DATA MINING TECHNIQUES THAT FIND SIMILARITIES BETWEEN KNOWN AND NEW
APPLICATIONS. FOR A 2,500-WORKLOAD SCENARIO, PARAGON PRESERVES PERFORMANCE
CONSTRAINTS FOR 91 PERCENT OF APPLICATIONS, WHILE SIGNIFICANTLY IMPROVING
UTILIZATION. IN COMPARISON, A BASELINE LEAST-LOADED SCHEDULER ONLY PROVIDES
SIMILAR GUARANTEES FOR 3 PERCENT OF WORKLOADS.
......Efficiency is a first-class require-ment and the main source of scalability con-cerns both for small and large systems.1,2
Achieving high efficiency is not only a matterof sensible design, but also a function of howthe system is managed, which becomes essen-tial as the hardware grows progressively heter-ogeneous and parallel and applications getdynamic and diverse. Architecture has tradi-tionally been about efficient system design.As efficiency increases in importance, archi-tecture should be about both design andmanagement for systems of any scale.
In this article, we focus on improving effi-ciency while guaranteeing high performancein large-scale systems. Although an increasingamount of computing now happens in publicand private clouds, such as Amazon ElasticCompute Cloud (EC2; see http://aws.amazon.com/ec2) or vSphere (www.vmware.
com/products/vsphere), datacenters continueto operate at utilizations in the single dig-its.1,3 This lessens the two main advantagesof cloud computing—flexibility and cost effi-ciency both for cloud operators and endusers—because not only are the machinesunderutilized, they are also operating in anon-energy-proportional region.1,4
There can be several reasons why ma-chines are underutilized. Two of the mostprominent obstacles are interference betweencoscheduled applications and heterogeneityin server platforms. For more information,see the “Interference and Heterogeneity”sidebar.
In our paper presented at the 18th Inter-national Conference on Architectural Sup-port for Programming Languages andOperating Systems (ASPLOS 2013),5 weintroduced Paragon, an online and scalable
Christina Delimitrou
Christos Kozyrakis
Stanford University
............................................................
2 Published by the IEEE Computer Society 0272-1732/14/$31.00�c 2014 IEEE
..............................................................................................................................................................................................
Interference and HeterogeneityInterference occurs as coscheduled applications contend in shared
resources. Coscheduled applications may interfere negatively even if
they run on different processor cores because they share caches,
memory channels, storage, and networking devices.1,2 If unmanaged,
interference can result in performance degradations of integer fac-
tors,2 especially when the application must meet tail latency guaran-
tees apart from average performance.3 Figure A shows that an
interference-oblivious scheduler will slow workloads down by 34 per-
cent on average, with some running more than two times slower. This
is undesirable for both users and operators.
Heterogeneity is the natural result of the infrastructure’s evolu-
tion, as servers are gradually provisioned and replaced over the typical
15-year lifetime of a datacenter.4-7 At any point in time, a datacenter
may host three to five server generations with a few hardware config-
urations per generation, in terms of the processor speed, memory,
storage, and networking subsystems. Managing the different hard-
ware incorrectly not only causes significant performance degradations
to applications sensitive to server configuration, but also wastes
resources as workloads occupy servers for significantly longer, and
gives a low-quality signal to hardware vendors for the design of future
platforms. Figure A shows that a heterogeneity-oblivious scheduler
will slow applications down by 22 percent on average, with some run-
ning nearly 2 times slower (see the “Methodology” section in the
main article).
Finally, a baseline scheduler that is oblivious to both interference
and heterogeneity and which schedules applications to least-loaded
servers is even worse (48 percent average slowdown), causing some
workloads to crash due to resource exhaustion on the server. Unless
interference and heterogeneity are managed in a coordinated fashion,
the system loses both its efficiency and predictability guarantees. Pre-
vious research has identified the issues of heterogeneity6 and inter-
ference,2 but while most cloud management systems—such as
Mesos8 or vSphere (www.vmware.com/products/vsphere)—have
some notion of contention or interference awareness, they either use
empirical rules for interference management or assume long-running
workloads (for example, online services), whose repeated behavior
can be progressively modeled. In this article, we target both heteroge-
neity and interference and assume no a priori analysis of the applica-
tion. Instead, we leverage information the system already has about
the large number of applications it has previously seen.
References1. S. Govindan et al., “Cuanta: Quantifying Effects of Shared
On-Chip Resource Interference for Consolidated Virtual
Machines,” Proc. 2nd ACM Symp. Cloud Computing, 2011,
article no. 22.
2. J. Mars et al., “Bubble-Up: Increasing Utilization in Modern
Warehouse Scale Computers via Sensible Co-locations,”
Proc. 44th Ann. IEEE/ACM Int’l Symp. Microarchitecture,
2011, pp. 248-259.
3. D. Meisner et al., “Power Management of Online Data-Inten-
sive Services,” Proc. 38th Ann. Int’l Symp. Computer Archi-
tecture (ISCA 11), 2011, pp. 319-330.
4. L.A. Barroso and U. Holzle, The Datacenter as a Computer:
An Introduction to the Design of Warehouse-Scale
Machines, Morgan and Claypool Publishers, 2009.
5. C. Kozyrakis et al., “Server Engineering Insights for Large-Scale
Online Services,” IEEE Micro, vol. 30, no. 4, 2010, pp. 8-19.
6. J. Mars, L. Tang, and R. Hundt, “Heterogeneity in ‘Homoge-
neous’ Warehouse-Scale Computers: A Performance Oppor-
tunity,” IEEE Computer Architecture Letters, vol. 10, no. 2,
2011, pp. 29-32.
7. R. Nathuji, C. Isci, and E. Gorbatov, “Exploiting Platform Het-
erogeneity for Power Efficient Data Centers,” Proc. 4th Int’l
Conf. Autonomic Computing (ICAC 07), 2007, doi:10.1109/
ICAC.2007.16.
8. B. Hindman et al., “Mesos: A Platform for Fine-Grained
Resource Sharing in the Data Center,” Proc. 8th USENIX
Conf. Networked Systems Design and Implementation,
2011, article no. 22.
1.0
Alone on best platform No interferenceLeast loadedNo heterogeneity
Sp
eed
up o
ver
alon
e on
bes
t pla
tform
0.8
0.6
0.4
0.2
0.00 1,000 2,000
Workloads3,000 4,000 5,000
Figure A. Performance degradation for 5,000 applications
on 1,000 Amazon Elastic Compute Cloud (EC2) servers with
heterogeneity-oblivious, interference-oblivious, and
baseline least-loaded schedulers compared to ideal
scheduling (application runs alone on best platform).
Results are ordered from worst to best-performing
workload.
.................................................................
MAY/JUNE 2014 3
datacenter scheduler that accounts for hetero-geneity and interference. The key feature ofParagon is its ability to quickly and accuratelyclassify an unknown application with respectto heterogeneity (which server configurationsit will perform best on) and interference(how much interference it will cause tocoscheduled applications and how muchinterference it can tolerate itself in multipleshared resources). Unlike previous techniquesthat require detailed profiling of each in-coming application, Paragon’s classificationengine exploits existing data from previouslyscheduled workloads and requires only aminimal signal about a new workload. Spe-cifically, it is organized as a low-overhead rec-ommendation system similar to the onedeployed for the Netflix Challenge,6 butinstead of discovering similarities in users’movie preferences, it finds similarities inapplications’ preferences with respect to het-erogeneity and interference. It uses singularvalue decomposition (SVD) to perform col-laborative filtering and identify similaritiesbetween incoming and previously scheduledworkloads.
Once an incoming application is classi-fied, a greedy scheduler assigns it to the serverthat is the best possible match in terms ofplatform and minimum negative interferencebetween all coscheduled workloads. Eventhough the final step is greedy, the high accu-racy of classification leads to schedules thatachieve both fast execution time and efficientresource usage. Paragon scales to systemswith tens of thousands of servers and tens ofconfigurations, running large numbers ofpreviously unknown workloads. We imple-mented Paragon and showed that it signifi-cantly improves cluster utilization, whilepreserving per-application quality-of-service(QoS) guarantees both for small- and large-scale systems. For more information onrelated work, see the “Research Related toParagon” sidebar.
Fast and accurate classificationThe key requirement for heterogeneity
and interference-aware scheduling is toquickly and accurately classify incomingapplications. First, we need to know how fastan application will run on each of the tens of
server configurations (SCs) available. Second,we need to know how much interference itcan tolerate from other workloads in each ofseveral shared resources without significantperformance loss and how much interferenceit will generate itself. Our goal is to performonline scheduling for large-scale systemswithout any a priori knowledge about incom-ing applications. Most previous schemesaddress this issue with detailed but offlineapplication characterization or long-termmonitoring and modeling.7-9 Paragon takes adifferent approach. Its core idea is that,instead of learning each new workload indetail, the system leverages information italready has about applications it has seen toexpress the new workload as a combinationof known applications. For this purpose, weuse collaborative filtering techniques thatcombine a minimal profiling signal about thenew application with the large amount ofdata available from previously scheduledworkloads. The result is fast and accurateclassification of incoming applications withrespect to heterogeneity and interference.Within a minute of its arrival, an incomingworkload is scheduled on a large-scale cluster.
Background on collaborative filteringCollaborative filtering techniques are fre-
quently used in recommendation systems.We use one of their most publicized applica-tions, the Netflix Challenge,6 to provide aquick overview of the two analytical methodswe rely on, SVD and PQ reconstruction.10
In this case, the goal is to provide valid movierecommendations for Netflix users given theratings they have provided for various othermovies.
The input to the analytical framework is asparse matrix A, the utility matrix, with onerow per user and one column per movie. Theelements of A are the ratings that users haveassigned to movies. Each user has rated onlya small subset of movies; this is especially truefor new users, who might only have a handfulof ratings, or even none. Although techniquesexist that address the cold-start problem (thatis, providing recommendations to a com-pletely fresh user with no ratings), we focushere on users for whom the system has someminimal input. If we can estimate the valuesof the missing ratings in the sparse matrix A,
..............................................................................................................................................................................................
TOP PICKS
.................................................................
4 IEEE MICRO
we can make movie recommendations; thatis, we can suggest that users watch the moviesfor which the recommendation system esti-mates they will give high ratings to with highconfidence.
The first step is to apply SVD, a matrixfactorization method used for dimensionalityreduction and similarity identification. Fac-toring A produces the decomposition to thefollowing matrices of left (U) and right (V)
..............................................................................................................................................................................................
Research Related to ParagonWe discuss work relevant to Paragon in the areas of datacenter
scheduling, virtual machine (VM) management, workload rightsizing,
and scheduling for heterogeneous multicore chips.
Datacenter schedulingRecent work on datacenter scheduling has highlighted the impor-
tance of platform heterogeneity and workload interference. Mars
et al. showed that the performance of Google workloads can vary by
up to 40 percent because of heterogeneity, even when considering
only two server configurations, and by up to 2 times because of inter-
ference, even when considering only two colocated applications.1,2
Govindan et al. also present a scheme to quantify the effects of cache
interference between consolidated workloads.3 In Paragon, we extend
the concepts of heterogeneity- and interference-aware scheduling by
providing an online, scalable, and low-overhead methodology that
accurately classifies applications for both heterogeneity and interfer-
ence across multiple resources.
VM managementSystems such as vSphere (http://www.vmware.com/products/
vsphere) or the VM platforms on public cloud providers can schedule
diverse workloads submitted by users on the available servers. In gen-
eral, these platforms account for application resource requirements
that they expect the user to express or they learn over time by moni-
toring workload execution. Paragon can complement such systems by
making scheduling decisions on the basis of heterogeneity and inter-
ference and detecting when an application should be considered for
rescheduling.
Resource management and rightsizingThere has been significant work on resource allocation in virtual-
ized and nonvirtualized large-scale datacenters. Mesos performs
resource allocation between distributed computing frameworks such
as Hadoop or Spark.4 Rightscale (http://www.rightscale.com) auto-
matically scales out three-tier applications to react to changes in the
load in Amazon’s cloud service. DejaVu serves a similar goal by identi-
fying a few workload classes and, based on them, reusing previous
resource allocations to minimize reallocation overheads.5 In general,
Paragon is complementary to rightsizing systems. Once such a system
determines the amount of resources needed by an application, Para-
gon can classify and schedule it on the proper hardware platform in a
way that minimizes interference.
Scheduling for heterogeneous multicore chipsScheduling in heterogeneous CMPs shares some concepts and
challenges with scheduling in heterogeneous datacenters; thus, some
of the ideas in Paragon can be applied in heterogeneous CMP sched-
uling as well. Shelepov et al. present a scheduler for heterogeneous
CMPs that is simple and scalable,6 whereas Craeynest et al. use per-
formance statistics to estimate which workload-to-core mapping is
likely to provide the best performance.7 Given the increasing number
of cores per chip and coscheduled tasks, techniques similar to the
ones used in Paragon can be applicable when deciding how to sched-
ule applications in heterogeneous CMPs as well.
References1. J. Mars, L. Tang, and R. Hundt, “Heterogeneity in ‘Homoge-
neous’ Warehouse-Scale Computers: A Performance Oppor-
tunity,” IEEE Computer Architecture Letters, vol. 10, no. 2,
2011, pp. 29-32.
2. J. Mars et al., “Bubble-Up: Increasing Utilization in Modern
Warehouse Scale Computers via Sensible Co-locations,”
Proc. 44th Ann. IEEE/ACM Int’l Symp. Microarchitecture,
2011, pp. 248-259.
3. S. Govindan et al., “Cuanta: Quantifying Effects of Shared
On-Chip Resource Interference for Consolidated Virtual
Machines,” Proc. 2nd ACM Symp. Cloud Computing, 2011,
article no. 22.
4. B. Hindman et al., “Mesos: A Platform for Fine-Grained
Resource Sharing in the Data Center,” Proc. 8th USENIX
Conf. Networked Systems Design and Implementation,
2011, article no. 22.
5. N. Vasic et al., “DejaVu: Accelerating Resource Allocation in
Virtualized Environments,” Proc. 17th Int’l Conf. Architec-
tural Support for Programming Languages and Operating
Systems, 2012, pp. 423-436.
6. D. Shelepov et al., “HASS: A Scheduler for Heterogeneous
Multicore Systems,” ACM SIGOPS Operating Systems
Rev., vol. 43, no. 2, 2009, pp. 66-75.
7. K. Craeynest et al., “Scheduling Heterogeneous Multi-Cores
through Performance Impact Estimation (PIE),” Proc. 39th
Ann. Int’l Symp. Computer Architecture (ISCA 12), 2012,
pp. 213-224.
.................................................................
MAY/JUNE 2014 5
singular vectors and the diagonal matrix ofsingular values (R):
Am;n ¼
a1;1 a1;2 � � � a1;n
a2;1 a2;2 � � � a2;n
..
. ... . .
. ...
am;1 am;2 � � � am;n
0BBBB@
1CCCCA
¼ U � R � V T where
Um�r ¼
u1;1 � � � u1;r
..
. . .. ..
.
um;1 � � � um;r
0BB@
1CCA;
V n�r ¼
v1;1 � � � v1;r
..
. . .. ..
.
vn;1 � � � vn;r
0BB@
1CCA;
Rr�r ¼r1 � � � 0
..
. . .. ..
.
0 � � � rr
0BB@
1CCA
Dimension r is the rank of matrix A, andit represents the number of similarity con-cepts identified by SVD. For instance, onesimilarity concept might be that certain mov-ies belong to the drama category, whileanother might be that most users who likedthe movie The Lord of the Rings: The Fellow-ship of the Ring also liked The Lord of theRings: The Two Towers. Similarity conceptsare represented by singular values rið Þ inmatrix R and the confidence in a similarityconcept by the magnitude of the correspond-ing singular value. Singular values in R areordered by decreasing magnitude. Matrix Ucaptures the strength of the correlationbetween a row of A and a similarity concept.In other words, it expresses how users relateto similarity concepts such as the one aboutliking drama movies. Matrix V captures thestrength of the correlation of a column of Ato a similarity concept. In other words, towhat extent does a movie fall in the dramacategory? The complexity of performingSVD on a m � n matrix is minðn2m; m2nÞ.SVD is robust to missing entries and imposesrelaxed sparsity constraints to provide accu-racy guarantees.
Before we can make accurate score estima-tions using SVD, we need the full utilitymatrix A. To recover the missing entries in A,
we use PQ reconstruction. Building from thedecomposition of the initial sparse A matrix,
we have Qm�r ¼ U and PTr�n ¼
P�V T .
The product of Q and PT gives matrix R,which is an approximation of A with themissing entries. To improve R, we use sto-chastic gradient descent (SGD), a scalableand lightweight latent-factor model that iter-atively recreates A:
8rui, where rui is an element of the rec-onstructed matrix R
2ui ¼ rui � qi � pTu
qi qi þ gð2ui pu � kqiÞpu pu þ gð2ui qi � kpuÞ
until 2j jL2¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPu;i 2uij j2
qbecomes marginal.
In this process, g is the learning rate andk is the regularization factor. The complex-ity of PQ is linear with the number of rui
and in practice takes up to a few millisec-onds for matrices whose m and n equalabout 1,000. Once the dense utility matrixR is recovered, we can make movie recom-mendations. This involves applying SVDto R to identify which of the reconstructedentries reflect strong similarities that enablemaking accurate recommendations withhigh confidence.
Classification for heterogeneityWe use collaborative filtering to identify
how well a previously unknown workloadwill run on different hardware platforms.The rows in matrix A represent applications,the columns represent server configurations(SCs), and the ratings represent normalizedapplication performance on each SC. As partof an offline step, we select a small number ofapplications and profile them on all the dif-ferent SCs. This provides some initial infor-mation to the classification engine to addressthe cold-start problem that would otherwiseoccur. It only needs to happen once in thesystem.
During regular operation, when an appli-cation arrives, we profile it for 1 minute onany two SCs, insert it as a new row in matrixA, and use the process described previously toderive the missing ratings for the other serverconfigurations. In this case, R represents sim-ilarity concepts such as the fact that applica-tions that benefit from SC1 will also benefit
..............................................................................................................................................................................................
TOP PICKS
.................................................................
6 IEEE MICRO
from SC3. U captures how an applicationcorrelates to the different similarity concepts,and V shows how an SC correlates to them.Collaborative filtering identifies similaritiesbetween new and known applications. Twoapplications can be similar in one characteris-tic (for instance, they both benefit from highclock frequency) but different in others (forexample, only one benefits from a large L3cache). This is especially common when scal-ing to large application spaces and hardwareconfigurations. SVD addresses this issue byuncovering hidden similarities and filteringout the ones less likely to have an impact onthe application’s behavior.
As incoming applications are added in A,the density of the matrix increases and therecommendation accuracy improves. Notethat online training is performed only on twoSCs. This reduces the training overhead andthe number of servers needed for it comparedto exhaustive search. In contrast, if weattempted an exhaustive application profil-ing, the number of profiling runs wouldequal the number of SCs. For a cloud servicewith high workload arrival rates, this wouldbe infeasible to support. On a production-class Xeon server, classification takes 10 to 30milliseconds for thousands of applicationsand tens of SCs. We can perform classifica-tion for one application at a time or for smallgroups of incoming applications (batching) ifthe arrival rate is high without impactingaccuracy or speed.
Performance scores. We use the following per-formance metrics according to the applica-tion type:
� Single-threaded workloads: We useinstructions committed per second(IPS) as the initial performance met-ric. Using execution time wouldrequire running applications to com-pletion during profiling, increasingoverheads. We have verified that IPSleads to similar classification accuracyas using time to completion. Formultiprogrammed workloads, we useaggregate IPS.
� Multithreaded workloads: In the pres-ence of spinlocks or other synchroni-zation schemes, IPS can be deceptive.
We address this by detecting activewaiting and weight such executionsegments out of the IPS computa-tion. We verified that using this“useful” IPS leads to similar classifi-cation accuracy as using the full exe-cution time.
The choice of IPS is influenced by ourcurrent evaluation, which focuses on single-node CPU-, memory-, and I/O-intensiveprograms. The same methodology can beextended to higher-level metrics, such asqueries per second (QPS), which cover com-plex multitier workloads as well.
Validation. We evaluate the accuracy of het-erogeneity classification on a 40-server clusterwith 10 SCs with a large set of diverse appli-cations. The offline training set includes 20randomly selected applications. Using theclassification output for scheduling improvesperformance by 24 percent for single-threaded workloads, 20 percent for multi-threaded workloads, 38 percent for multi-programmed workloads, and 40 percent forI/O workloads, on average, while some appli-cations have a 2� performance difference.Table 1 summarizes key statistics on the vali-dation study. It is important to note that theaccuracy does not depend on the SCs selectedfor training, which matched the top-performing configuration only for 20 percentof workloads. We also compare performancepredicted by the recommendation system toperformance obtained through experimenta-tion. The deviation is 3.8 percent on average.
Classification for interferenceWe are interested in two types of interfer-
ence: that which an application can toleratefrom preexisting load on a server, and thatwhich the application will cause on that load.We detect interference due to contentionand assign a score to the sensitivity of anapplication to a type of interference. Toderive sensitivity scores, we develop severalmicrobenchmarks (sources of interference, orSoIs), each stressing a specific shared resourcewith tunable intensity.11 SoIs span the core,memory, and cache hierarchy and networkand storage bandwidth. We run an applica-tion concurrently with a microbenchmark
.................................................................
MAY/JUNE 2014 7
and progressively tune up its intensity untilthe application violates its QoS. Applicationswith high tolerance to interference (for exam-ple, a sensitivity score over 60 percent) areeasier to coschedule than applications withlow tolerance. Similarly, we detect the sensi-tivity of a microbenchmark to the interfer-ence the application causes by tuning up itsintensity and recording when the microbe-nchmark’s performance degrades by 5 per-cent compared to its performance inisolation. In this case, high-sensitivity scorescorrespond to applications that cause a lot ofinterference in the specific shared resource.
Collaborative filtering for interference. Weclassify applications for interference toleratedand caused, using twice the process describedearlier. The two utility matrices have applica-tions as rows and SoIs as columns. The ele-ments of the matrices are the sensitivityscores of an application to the correspondingmicrobenchmark. Similarly to classificationfor heterogeneity, we profile a few applica-tions offline against all SoIs and insert themas dense rows in the utility matrices. In theonline mode, each new application is profiled
against two randomly chosen microbe-nchmarks for one minute, and its sensitivityscores are added in a new row in each of thematrices. Then, we use SVD and PQ recon-struction to derive the missing entries and theconfidence in each similarity concept.
Validation. We evaluated the accuracy ofinterference classification using the sameworkloads and systems as before. Table 2summarizes key statistics on the classificationquality. The average error in estimating bothtolerated and caused interference across SoIsis 5.3 percent. For high values of sensitivity(that is, applications that tolerate and cause alot of interference), the error is even lower(3.4 percent).
Putting it all togetherOverall, Paragon requires two short runs
(approximately 1 minute) on two SCs to clas-sify incoming applications for heterogeneity.Another two short runs against two micro-benchmarks on a high-end SC are needed forinterference classification. Running for 1minute provides some signal on the newworkload without introducing significant
Table 1. Validation of heterogeneity classification.
Applications
Metric
Single
threaded (%)
Multithreaded
(%)
Multiprogrammed
(%)
I/O bound
(%)
Selected best platform 86 86 83 89
Selected platform within 5% of best 91 90 89 92
Correct platform ranking (best to worst) 67 62 59 43
90% correct platform ranking 78 71 63 58
Training and best selected platform match 28 24 18 22
Table 2. Validation of interference classification.
Metric Percentage (%)
Average estimation error of sensitivity across all examined resources 5.3
Average estimation error for sensitivities> 60% 3.4
Applications with< 5% estimation error 59.0
Resource with highest estimation error: L1 instruction cache 15.8
Frequency L1 instruction cache used for training 14.6
Resource with lowest estimation error: Storage bandwidth 0.9
..............................................................................................................................................................................................
TOP PICKS
.................................................................
8 IEEE MICRO
profiling overheads. In our full paper,5 we dis-cuss the issue of workload phases (that is,transient effects that do not appear in the1-minute profiling period). Next, we use col-laborative filtering to classify the applicationin terms of heterogeneity and interference.This requires a few milliseconds even whenconsidering thousands of applications andseveral tens of SCs or SoIs. Classification forheterogeneity and interference is performedin parallel. For the applications we consid-ered, the overall profiling and classificationoverheads are 1.2 and 0.09 percent onaverage.
Using analytical methods for classificationhas two benefits. First, we have strong analyt-ical guarantees on the quality of the informa-tion used for scheduling, instead of relyingmainly on empirical observation. The analyt-ical framework provides low and tight errorbounds on the accuracy of classification, stat-istical guarantees on the quality of colocationcandidates, and detailed characterization ofsystem behavior. Moreover, the schedulerdesign is workload independent, whichmeans that the properties the scheme pro-vides hold for any workload. Second, thesemethods are computationally efficient, scalewell with the number of applications andSCs, and do not introduce significant sched-uling overheads.
ParagonOnce an incoming application is classified
with respect to heterogeneity and interference,
Paragon schedules it on one of the availableservers. The scheduler attempts to assign eachworkload to the server of the best SC and colo-cate it with applications so that interference isminimized for workloads running on thesame server.
Scheduler designFigure 1 presents an overview of Paragon’s
components and operation. The schedulermaintains per-application and per-serverstate. The per-application state includes theclassification information; for a datacenterwith 10 SCs and 10 SoIs, it is 64 bytes perapplication. The per-server state records theIDs of applications running on a server andthe cumulative sensitivity to interference(roughly 64 bytes per server). The per-serverstate is updated as applications are scheduledand, later on, completed. Overall, state over-heads are marginal and scale logarithmicallyor linearly with the number of applications(N) and servers (M). In our experiments withthousands of applications and servers, a singleserver could handle all processing and storagerequirements of scheduling, although addi-tional servers can be used for fault tolerance.
Greedy server selectionIn examining candidates, the scheduler
considers two factors: first, which assign-ments minimize negative interference be-tween the new application and existing load,and second, which servers have the best SCfor this workload.
Selection of colocation candidates
2x
State: N*16B
Step 2: Server selection
Apparrival
1 3
1 52 33 5
2 33 4
2 4
5 4
U’ ∑’ V’
Classification for heterogeneity (SVD+PQ)
Classification for interference (SVD+PQ)
Step 1: Application classification
U ∑ V
1 3
1 52 33 5
2 33 4
2 4
5 4
U’ ∑’ V’5 45 5 5 1
1 31 2 542 4 35 5 3
1 54 2532 31 5 553 511 3 2
2 3 34 4 23 41 551
5 45 5 5 11 31 2 542 4 35 5 3
1 54 2532 31 5 553 511 3 2
2 3 34 4 23 41 551
U ∑ V
Heterogeneityscores
Interferencescores
C
DC servers
SA
N
B
DS
A
CS
CDC
S
D
SS
D
DEE
E FFFA
A
BB
Figure 1. The components of Paragon and the state maintained by each component. Overall, the state requirements are
marginal and scale linearly or logarithmically with the number of applications (N), servers (M), and configurations. (PQ: PQ
reconstruction; SVD: singular value decomposition; DC: datacenter.)
.................................................................
MAY/JUNE 2014 9
The scheduler evaluates two metrics,D1¼ tserver� cnewapp and D2¼ tnewapp�cserver , where t is the sensitivity score for toler-ated and c for caused interference for a spe-cific SoI. The cumulative sensitivity of aserver to caused interference is the sum ofsensitivities of individual applications run-ning on it, whereas the sensitivity to toleratedinterference is the minimum of these values.The optimal candidate is a server for whichD1 and D2 are exactly zero for all SoIs, whichimplies no negative impact from interferenceand perfect resource usage. In practice, agood selection is one where D1 and D2 arepositive and small for all SoIs. Large, positivevalues for D1 and D2 indicate suboptimalresource utilization. Negative values for D1
or D2 imply violation of QoS.We examine candidate servers for an
application in the following way. The processis explained for interference tolerated by theserver and caused by the new workload (D1)and is exactly the same for D2. We start fromthe resource the new application is most sen-sitive to. We select the server set for which D1
is non-negative for this SoI. Next, we exam-ine the second SoI in order of decreasing sen-sitivity scores, filtering out any servers forwhich D1 is negative, until all SoIs have beenexamined. Then, we take the intersection ofserver sets for D1 and D2 and select themachine with the best SC and withmin D1 þ D2 L1kkð Þ.
As we filter out servers, at some point theset of candidate servers might becomeempty. This implies that there is no singleserver for which D1 and D2 are non-negativefor some SoI. Although unlikely, we supportthis event with backtracking and QoSrelaxation. Given M servers, the worst-casecomplexity is OðM � SoI 2Þ, because, theo-retically, backtracking might extend all theway to the first SoI. In practice, however, weobserve that for a 1,000-server system, 89percent of applications were scheduled with-out any backtracking. For 8 percent of theremaining applications, backtracking led tonegative D1 or D2 for a single SoI (and for 3percent for multiple SoIs). Additionally, webound the runtime of the greedy searchusing a timeout mechanism, after which thebest server from the ones already examinedis selected.
Our full paper includes a discussion onworkload phases and applicability to multit-ier latency-critical applications.5
Evaluation methodologyIn the following paragraphs, we describe
the server systems, alternative schedulers,applications, and workload scenarios used inour evaluation.
We evaluated Paragon on a 1,000-servercluster on Amazon EC2 with 14 instancetypes from small to extra large.12 All instanceswere exclusive (reserved)—that is, no otherusers had access to the servers. There were noexternal scheduling decisions or actions suchas auto-scaling or workload migration duringthe course of the experiments.
We compared Paragon to three schedu-lers. The first is a baseline scheduler thatassigns applications to least-loaded (LL)machines, accounting for their core andmemory requirements but ignoring their het-erogeneity and interference profiles. Thesecond is a heterogeneity-oblivious (NH)scheme that uses the interference classifica-tion in Paragon to assign applications to serv-ers without visibility in their SCs. The thirdis an interference-oblivious (NI) scheme thatuses the heterogeneity classification but hasno insight on workload interference.
We used 400 single-threaded (ST), multi-threaded (MT), and multiprogrammed(MP) applications from SPEC CPU2006,several multithreaded benchmark suites,5 andSPECjbb. For multiprogrammed workloads,we created 350 mixes of four SPEC applica-tions. We also used 26 I/O-bound workloadsin Hadoop and Matlab running on a singlenode. Workload durations range fromminutes to hours. For workload scenarioswith more than 426 applications, we repli-cated these workloads with equal likelihoods(1/4 ST, 1/4 MT, 1/4 MP, and 1/4 I/O) andrandomized their interleaving.
We used the applications listed in this sec-tion to examine the following scenarios: a low-load scenario with 2,500 randomly chosenapplications submitted with 1-second inter-vals, a high-load scenario with 5,000 applica-tions submitted with 1-second intervals, andan oversubscribed scenario where 7,500 work-loads are submitted with 1-second intervalsand an additional 1,000 applications arrive in
..............................................................................................................................................................................................
TOP PICKS
.................................................................
10 IEEE MICRO
burst (less than 0.1-second intervals) after thefirst 3,750 workloads.
EvaluationWe evaluated the Paragon scheduler
against the LL, NH, and NI schedulers, withrespect to performance, decision quality,resource allocation, and cluster utilization.
Performance impactFigure 2 shows the performance for the
three workload scenarios on the 1,000-server
EC2 cluster. The low-load scenario, in gen-eral, does not create significant performancechallenges. Nevertheless, Paragon outper-forms the other three schemes; it preservesQoS for 91 percent of workloads andachieves on average 96 percent of the per-formance of a workload running in isolationin the best SC. When moving to the high-load scenario, the difference between schedu-lers becomes more obvious. Although theheterogeneity and interference-obliviousschemes degrade performance by an average
0 5,00 1,000 1,500 2,000 2,500
Workloads
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Sp
eed
up o
ver
alon
e on
b
est p
latfo
rm
Low load
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Sp
eed
up o
ver
alon
e on
b
est p
latfo
rm
0 1,000 2,000 3,000 4,000 5,000
Workloads
High load
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Sp
eed
up o
ver
alon
e on
bes
t pla
tform
0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000Workloads
Oversubscribed
(a) (b)
(c)
Alone on best platform No heterogeneity (NH) No interference (NI)
Least loaded (LL) Paragon (P)
Figure 2. Performance comparison between the four schedulers for three workload scenarios on 1,000 Amazon Elastic
Compute Cloud (EC2) servers. Performance is normalized to optimal performance in isolation, and applications are ordered
from worst to best performing.
.................................................................
MAY/JUNE 2014 11
of 22 and 34 percent and violate QoS for 96and 97 percent of workloads, respectively, Par-agon degrades performance by only 4 percentand guarantees QoS for 61 percent of work-loads. The least-loaded scheduler degradesperformance by 48 percent on average, withsome applications not terminating success-fully. The differences in performance arelarger for workloads submitted when the sys-tem is heavily loaded.
Finally, for the oversubscribed case, NH,NI, and LL dramatically degrade perform-ance for most workloads, while the numberof applications that do not terminate success-fully increases to 10.4 percent for LL. Para-gon, on the other hand, preserves QoSguarantees for 52 percent of workloads, whilethe other schedulers provide similar guaran-tees only for 5, 1, and 0.09 percent of work-loads, respectively. Additionally, it limitsdegradation to less than 10 percent for anadditional 33 percent of applications andmaintains moderate performance degrada-tion (no cliffs in performance similar to NHfor applications 1 through 1,000).
Decision qualityFigure 3 shows a breakdown of the deci-
sion quality of the different schedulers forheterogeneity (left) and interference (right)across the three scenarios. LL induces more
than 20 percent performance degradation tomost applications, both due to heterogeneityand interference. NH has low decision qual-ity in terms of platform selection, whereas NIcauses performance degradation by colocatingunsuitable applications. The errors increase aswe move to scenarios of higher load. Paragondecides optimally for 65 percent of applica-tions for heterogeneity and 75 percent forinterference, on average, significantly higherthan the other schedulers. It also constrainsdecisions that lead to larger than 20 percentdegradation to less than 8 percent ofworkloads.
Resource allocationFigure 4 shows why this deviation exists.
The solid black line in each graph representsthe required core count based on the applica-tions running at a snapshot of the system,while the other lines show the allocated coresby each of the schedulers. Because Paragonoptimizes for increased utilization within QoSconstraints, it follows the application require-ments closely. It only deviates when therequired core count exceeds the resourcesavailable in the system (oversubscribed case).NH has mediocre accuracy, whereas NI andLL either significantly overprovision the num-ber of allocated cores, or oversubscribe certainservers. There are two important points in
LL NH NI P LL N
H NI P LL N
H NI P
0
20
40
60
80
100
Ap
plic
atio
npe
rcen
tag
e
No degradation < 10% degradation < 20% > 20%
LL NH NI P LL N
H NI P LL N
H NI P
0
20
40
60
80
100
Ap
plic
atio
npe
rcen
tag
e
Low load High load Oversubscribed Low load High load Oversubscribed
Figure 3. Breakdown of decision quality for the four schedulers across the three EC2
scenarios. Different colors correspond to different impacts in application performance in
terms of heterogeneity (left) and interference.
..............................................................................................................................................................................................
TOP PICKS
.................................................................
12 IEEE MICRO
these graphs. First, as the load increases, thedeviation of execution time from optimalincreases for NH, NI, and LL, whereas Para-gon approximates it closely. Second, for highloads, the errors in core allocation increase dra-matically for the other three schedulers,whereas for Paragon the average deviation
remains approximately constant, excludingthe part where the system is oversubscribed.
Cluster utilizationFigure 5 shows the cluster utilization in
the high-load scenario for LL and Paragon inthe form of heat maps. Utilization is shown
0 50 100 150 200 250 3000
1,000
2,000
3,000
4,000
5,000
6,000
7,000
Cor
e co
unt
Low load
Time (minutes)0 100 200 300 400 500
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
Cor
e co
unt
High load
Time (minutes)
(a) (b)
0 100 200 300 400 500 600 700 8000
1,000
2,000
3,000
4,000
5,000
6,000
7,000
Cor
e co
unt
Oversubscribed load
Time (minutes)
(c)
Required
No heterogeneity (NH)
No interference (NI)
Least loaded (LL)
Paragon (P)
Figure 4. Resource allocation for the three workload scenarios. Each line corresponds to the number of allocated computing
cores at each point during the execution of the scenario. Although the heterogeneity-oblivious (NH), interference-oblivious
(NI), and least-loaded (LL) schedulers under- or overestimate the required resources, Paragon closely follows the application
resource requirements.
0
200
400
600
800
1,000Least loaded
0
10
20
30
40
50
60
70
80
90
100
Ser
vers
Time (minutes)
100 200 300 400 5000
200
400
600
800
1,000Paragon
0
10
20
30
40
50
60
70
80
90
100
Ser
vers
Time (minutes) 100 200 300 400 500
(a) (b)
Figure 5. CPU utilization heat maps for the high-load scenario for the least-loaded system and Paragon. Utilization is averaged
across the cores of a server and is sampled every 5 seconds. Darker colors correspond to higher CPU utilization in the
heatmaps.
.................................................................
MAY/JUNE 2014 13
for each individual server throughout theduration of the experiment and is averagedacross the server’s cores every 5 seconds.Whereas with LL utilization does not exceed20 percent for the majority of time, Paragonachieves an average utilization of 52 percent.Additionally, as workloads run closer to theirQoS requirements, the scenario completes in19 percent less time.
T he Paragon scheduler moves away fromthe traditional empirical design
approach in computer architecture andsystems and adopts a more data-drivenapproach. In the past few years, we haveentered an era where data has become so vastand rich that it can provide much better (andfaster) insight on design decisions than thetraditional trial-and-error approach can.Applying such techniques in datacenterscheduling with significant gains is proof ofthe value of using data to drive system designand management decisions. There are otherhighly dimensional problems where similartechniques can be proven effective, such asthe large space-design explorations for eitherprocessors13 or memory systems or the moregeneral cluster management problem incloud providers. The latter becomes increas-ingly challenging because many cloud appli-cations are multitier workloads with complexdependencies and they must satisfy strict taillatency guarantees. Additionally, issues likeheterogeneity and interference are not rele-vant only to datacenters. Systems of all scales,from low-power mobile to traditional CMPsand large-scale cloud computing facilities,face similar challenges, which makes employ-ing techniques that work online, fast and canhandle huge spaces a pressing need.
Determining which data can offer valua-ble insights in system decisions and designingefficient techniques to collect and mine it in away that leverages their nature and character-istics is a significant challenge movingforward.
MICR O
AcknowledgmentsWe sincerely thank John Ousterhout,
Mendel Rosenblum, Byung-Gon Chun,Daniel Sanchez, Jacob Leverich, David Lo,and the anonymous reviewers for their
feedback on earlier versions of this manu-script. This work was partially supported bya Google-directed research grant on energy-proportional computing. Christina Delimi-trou was supported by a Stanford GraduateFellowship.
....................................................................References1. L.A. Barroso and U. Holzle, The Datacenter
as a Computer: An Introduction to the
Design of Warehouse-Scale Machines,
Morgan and Claypool, 2009.
2. J. Rabaey et al., “Beyond the Horizon: The
Next 10x Reduction in Power—Challenges
and Solutions,” Proc. IEEE Int’l Solid-State
Circuits Conf., 2011, doi:10.1109/ISSCC.
2011.5746206.
3. L. Barroso, “Warehouse-Scale Computing:
Entering the Teenage Decade,” Proc. 38th
Ann. Int’l Symp. Computer Architecture
(ISCA 11), 2011.
4. D. Meisner et al., “Power Management of
Online Data-Intensive Services,” Proc. 38th
Ann. Int’l Symp. Computer Architecture
(ISCA 11), 2011, pp. 319-330.
5. C. Delimitrou and C. Kozyrakis, “Paragon:
QoS-Aware Scheduling in Heterogeneous
Datacenters,” Proc. 18th Int’l Conf. Archi-
tectural Support for Programming Lan-
guages and Operating Systems (ASPLOS
13), 2013, pp. 77-88.
6. R.M. Bell, Y. Koren, and C. Volinsky,
The BellKor 2008 Solution to the
Netflix Prize, tech. report, AT&T Labs, Oct.
2007.
7. J. Mars et al., “Bubble-Up: Increasing Uti-
lization in Modern Warehouse Scale Com-
puters via Sensible Co-locations,” Proc.
44th Ann. IEEE/ACM Int’l Symp. Microarchi-
tecture, 2011, pp. 248-259.
8. R. Nathuji, C. Isci, and E. Gorbatov,
“Exploiting Platform Heterogeneity for
Power Efficient Data Centers,” Proc. 4th
Int’l Conf. Autonomic Computing (ICAC 07),
2007, doi:10.1109/ICAC.2007.16.
9. N. Vasic et al., “DejaVu: Accelerating
Resource Allocation in Virtualized Environ-
ments,” Proc. 17th Int’l Conf. Architectural
Support for Programming Languages and
Operating Systems, 2012, pp. 423-436.
..............................................................................................................................................................................................
TOP PICKS
.................................................................
14 IEEE MICRO
10. A. Rajaraman and J.D. Ullman, Mining of
Massive Datasets, Cambridge Univ. Press,
2011.
11. C. Delimitrou and C. Kozyrakis, “iBench:
Quantifying Interference for Datacenter
Workloads,” Proc. IEEE Int’l Symp. Work-
load Characterization, 2013, pp. 23-33.
12. C. Delimitrou and C. Kozyrakis, “QoS-Aware
Scheduling in Heterogeneous Datacenters
with Paragon,” ACM Trans. Computer Sys-
tems, vol. 31, no. 4, 2013, article no. 12.
13. O. Azizi et al., “Energy Performance Trade-
offs in Processor Architecture and Circuit
Design: A Marginal Cost Analysis,” Proc.
37th Ann. Int’l Symp. Computer Architec-
ture (ISCA 10), 2010, pp. 26-36.
Christina Delimitrou is a PhD student inthe Department of Electrical Engineering atStanford University. Her research focuses onlarge-scale datacenters, specifically on sched-uling and resource allocation techniqueswith quality-of-service guarantees, practicalcluster management systems that improveresource efficiency, and datacenter applica-tion analysis and modeling. Delimitrou hasan MS in electrical engineering from
Stanford University. She is a student mem-ber of IEEE and the ACM.
Christos Kozyrakis is an associate professorin the Departments of Electrical Engineer-ing and Computer Science at Stanford Uni-versity, where he investigates hardwarearchitectures, system software, and pro-gramming models for systems ranging fromcell phones to warehouse-scale datacenters.His research focuses on resource-efficientcloud computing, energy-efficient multicoresystems, and architectural support for secur-ity. Kozyrakis has a PhD in computerscience from the University of California,Berkeley. He is a senior member of IEEEand the ACM.
Direct questions and comments about thisarticle to Christina Delimitrou, Gates Hall,353 Serra Mall, Room 316, Stanford, CA94305; [email protected].
.................................................................
MAY/JUNE 2014 15