Whare-Map: Heterogeneity in “Homogeneous” Warehouse …profmars/wp-content/...1. INTRODUCTION...

Whare-Map: Heterogeneity in “Homogeneous”Warehouse-Scale Computers

Jason Mars∗

Univ. of California, San [email protected]

Lingjia TangUniv. of California, San Diego

[email protected]

ABSTRACT

Modern “warehouse scale computers” (WSCs) continue tobe embraced as homogeneous computing platforms. How-ever, due to frequent machine replacements and upgrades,modern WSCs are in fact composed of diverse commoditymicroarchitectures and machine configurations. Yet, currentWSCs are architected with the assumption of homogene-ity, leaving a potentially significant performance opportunityunexplored.

In this paper, we expose and quantify the performanceimpact of the“homogeneity assumption” for modern produc-tion WSCs using industry-strength large-scale web-serviceworkloads. In addition, we argue for, and evaluate thebenefits of, a heterogeneity-aware WSC using commercialweb-service production workloads including Google’s web-search. We also identify key factors impacting the avail-able performance opportunity when exploiting heterogeneityand introduce a new metric, opportunity factor, to quantifyan application’s sensitivity to the heterogeneity in a givenWSC. To exploit heterogeneity in “homogeneous” WSCs,we propose “Whare-Map,” the WSC Heterogeneity AwareMapper that leverages already in-place continuous profil-ing subsystems found in production environments. Whenemploying “Whare-Map”, we observe a cluster-wide perfor-mance improvement of 15% on average over heterogeneity–oblivious job placement and up to an 80% improvement forweb-service applications that are particularly sensitive toheterogeneity.

1. INTRODUCTIONWarehouse-scale computers (WSCs) [7,18] are the class of

datacenters that are designed, built, and optimized to runa number of large data-intensive web-service applications.Internet service companies such as Google, Microsoft, Ama-zon, Yahoo, and Apple spend hundreds of millions of dollarsto construct and operate WSCs that provide web-servicessuch as search, mail, maps, docs, and video [1,6,11,25]. Thislarge cost stems from the machines themselves, power distri-bution and cooling, the power itself, networking equipment,and other infrastructure [14,15]. Improving the overall per-

∗This work was in part completed while interning at Google.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.ISCA’13, Tel-Aviv, Israel.Copyright 2013 ACM 978-1-4503-2079-5/13/06 ...$15.00.

Table 1: # of Machine Configs. in Google WSCsD0 D1 D2 D3 D4 D5 D6 D7 D8 D9

4 3 2 3 2 3 2 5 2 2

...

JobManager

Homogenous Cores Assumption (Job Managers View)

Actual Machines are Heterogeneous

Jobs

Cores

Machines

J1, J2, J3, J4, ...

Figure 1: The Homogeneous Assumption - The JobManager’s View of a WSC

formance of jobs running in WSCs has been identified as oneof the top priorities of web-service companies as it improvesthe overall cost efficiency of operating a WSC.

WSCs have been embraced as homogeneous computingenvironments [6–8]. However this is not the case in practice.These WSCs are typically composed of cheap and replace-able commodity components. As machines are replaced inthese WSCs, new generations of hardware are deployed whileolder generations continue to operate. This leads to a WSCthat is composed of a mix of machine platforms, i.e., aheterogeneous WSC. Table 1 shows the amount of platformdiversity found in 10 randomly selected anonymized GoogleWSCs in operation. As shown in the table, these 10 dat-acenters house as few as two and as many as five differentmicroarchitectural configurations, including both Intel andAMD servers from several consecutive generations. Yet, theassumption of homogeneity has been a core design philos-ophy behind the job management subsystems and systemsoftware stack of modern WSCs [7]. As Figure 1 shows, thejob manager views a WSC as a collection of tens to hundredsof thousands of cores with the assumption of homogeneity.Available machine resources are assigned to jobs accordingto their core and memory requirements. The diversity acrossthe underlying microarchitectures in a WSC is not explicitlyconsidered by the job management subsystem. However, aswe show in this work, ignoring this heterogeneity is costly,leading to inefficient execution of applications in WSCs.

While prior work [9,13,35,36] has acknowledged the hetero-geneity in various types of datacenter systems, the homoge-neous assumption is still widely adopted in modern WSCsdue to 1) the limited understanding of the performance costof this assumption in real commercial systems and 2) thelack of practical system design to exploit the heterogeneityin production.

There are two key insights to consider in understandingthe heterogeneity present in emerging WSC system archi-tectures:

1. The heterogeneity in WSCs differs than that found ina heterogeneous multicore chip, or the heterogeneityacross processors in a single machine. In a WSC, itis the diversity in execution environments that mustbe considered. Broadly, we define an application’sexecution environment as the set of factors that caninfluence the execution of the application. In the scopeof this paper, we focus this definition to the hetero-geneity of the underlying processor microarchitecturecoupled with the diverse possibilities of simultaneousco-running jobs on a given machine.

2. In the production WSC environments of large webser-vice companies such as Google, the system’s hardware,software, and application stacks are co-designed forefficiency, and there is a set of key applications thatrun continually in these WSCs (such as websearch,maps, etc). This observation leads to an importantinsight. The performance opportunity present fromthe heterogeneity in machines are defined by the mixof applications that will run on these machines, and inturn, the performance opportunity present from the di-versity in applications is defined by the particular mixof underlying machine configurations. As we vary ei-ther, the amount of performance opportunity changessignificantly. As we show in this work, this can be for-mally quantified using a metric we call an application’sopportunity factor.

These insights are prescriptive as to how we design asystem to exploit this heterogeneity. Modern productionWSC systems deploy continuous profiling runtimes such asthe Google Wide Profiler (GWP) [27] that are run in pro-duction throughout the lifetime of the WSC. Currently theseinfrastructures are primarily used for retrospective analysisand offline tuning. However, as we argue in this work, thesesystems can and should be used to drive continuous onlinelearning to steer heterogeneity analysis and adaptation inproduction. By harnessing these already in-place continu-ous monitoring capabilities, we demonstrate the efficacy ofthis approach with the design of the WSC Heterogeneity-aware Mapper, Whare-Map. Using Whare-Map, a novelextension to the core architecture of the WSC system design,we demonstrate how this heterogeneity can be exploited byleveraging in-place WSC monitoring subsystems.

Specifically, the contributions of this paper include:

• Web-service Sensitivity to Heterogeneity: Weinvestigate the performance variability for large-scaleweb-service applications caused by the heterogeneityin “homogeneous” WSCs as it relates to microarchi-tectural configurations and application co-locations inproduction environments. We also introduce a novelmetric, the opportunity factor, which quantifies howsensitive an application is to the heterogeneity. Thismetric characterizes the performance improvement po-tential for a given application when mapped intelli-gently in a given WSC.

• Whare-Map: We present Whare-Map, an extensionto the current WSC architecture to exploit the emer-

gent heterogeneity in WSCs. Whare-Map intelligentlymaps jobs to machines to improve the overall perfor-mance of a WSC. A required component of such ap-proach is the ability to score and rank job-to-machinemaps. We provide four map scoring policies that takeadvantage of the live monitoring services in modernWSCs, discuss the key tradeoffs between them, andperform a thorough evaluation in an experimental clus-ter environment.

• Heterogeneity in Production: We investigate theamount of heterogeneity present in a number of pro-duction WSCs running live Google Internet servicesincluding websearch, each housing thousands of nodes.We demonstrate the potential of Whare-Map by usingit to quantify the performance opportunity from ex-ploiting the heterogeneity in these production WSCs.

• Factors Impacting Heterogeneity: The rationalebehind the homogeneity assumption stems from a lackof understanding of how the gradual introduction ofdiverse microarchitectural configurations and applica-tion types to a WSC impacts the performance vari-ability. In this work, we also present a careful studyof how varying the diversity in applications and ma-chine types in a WSC affects how “homogenous” or“heterogeneous” a WSC becomes. We find that even aslight amount of diversity in these factors can presenta significant performance opportunity. Based on ourfindings, we then discuss the tradeoffs for server pur-chase decisions and show that embracing heterogeneitycan indeed improve the cost-efficiency of the WSC in-frastructure.

A prototype of Whare-Map is evaluated on both a Googletestbed composed of 9 large-scale production web-service ap-plications and 3 types of production machines, as well as anexperimental testbed composed of benchmark applicationsto provide repeatable experimentation.

Results Summary: This paper shows that there is asignificant performance opportunity when taking advantageof emergent heterogeneity in modern WSCs. At the scale ofmodern cloud infrastructures such as those used by compa-nies like Google, Apple, and Microsoft, gaining just 1% ofperformance improvement for a single application translatesto millions of dollars saved. In this work, we show that large-scale web-service applications that are sensitive to emergentheterogeneity improve by more than 80% when employingWhare-Map over heterogeneity-oblivious mapping. Whenevaluating Whare-Map using our testbed composed of keyGoogle applications running on three types of productionmachines commonly found co-existing in the same WSC, weimprove the overall performance of an entire WSC by 18%.We also find a similar improvement of 15% in our benchmarktestbed and in our analysis of production data from WSCshosting live services.

Next, in Section 2 we discuss the background of our work.We then present a study of the heterogeneity in the WSCin Section 3. Section 4 presents our Whare-Map approachfor exploiting heterogeneity in the WSC. We present anin-depth evaluation of the performance benefit of Whare-Map including a study in Google’s production WSCs inSection 5. We present related work in Section 6, and finally,we conclude in Section 7.

2. BACKGROUNDIn this section, we describe the job placement and online

monitoring components that are core to the system archi-tecture of modern WSCs.

workload description

bigtable A distributed storage system for managing petabytes of structured dataads-servlet Ads server responsible for selecting and placing targeted ads on syndication partners sitesmaps-detect-face Face detection for streetview automatic face blurringsearch-render Websearch frontend server, collect results from many backends and assembles html for user.search-scoring Websearch scoring and retrievalprotobuf Protocol Buffer, a mechanism for describing extensible communication protocols and on-disk structures.

One of the most commonly-used programming abstractions at Google.docs-analyzer Unsupervised Bayesian clustering tool to take keywords or text documents and “explain” them with

meaningful clusters.saw-countw Sawzall scripting language interpreter benchmarkyoutube-x264yt x264yt video encoding.

Table 2: Production WSC Applications

2.1 Job Placement in WSCsA WSC provides a view of the underlying system as a

single machine with hundreds of thousands of cores andpetabytes of memory. A job in a WSC is an applicationprocess that is typically long running, responsible for a par-ticular sub task of a major service, and can generally be runon any machine within the WSC. Example jobs in a WSCinclude a result scorer for websearch, an image stitcher formaps, a compression service for video, etc. Job placementin the WSC is managed by a central job manager [7,21,30].The job manager is a cluster-wide runtime that is taskedwith mapping the job mix to the set of machine resourcesin a WSC, and operates independently of the OS. Eachjob has an associated configuration file that specifies thenumber of cores and memory required to execute the job.Based on the resource requirement, the job manager uses abin-packing algorithm to place the job to a machine withthe required resources available [21], after which a machinelevel manager (in the form of a daemon running in user-mode) uses resource containers [5] to allocate and managethe resources belonging to the task. The currently deployedjob manager is unaware of machine heterogeneity and thepotential benefits of intelligent job placement. We integrateour Whare-Map technique described in Section 4 with thejob manager to conduct heterogeneity-aware mapping.

2.2 Live Monitoring in WSCsFor our WSC design to effectively exploit heterogeneity

in job placement decisions, the job manager requires con-tinuous feedback from prior placement decisions. At a min-imum, a sampling of a service’s performance on an assort-ment of platforms and co-runners is necessary. Fortunately,this rudimentary continuous cluster-wide monitoring capa-bility is already deployed and available in state-of-the-artWSC platforms. For example, the Google Wide Profiler(GWP) [27] continuously profiles jobs and machines as theyrun in production and is deployed as a standard service inGoogle’s entire production fleet. GWP provides a databaseand associated API that can be leveraged by software sys-tems, such as the job manager, to query information aboutapplication placement, performance, and co-running jobson a machine. As described later, our heterogeneity-awaretechnique uses the monitoring information provided by ser-vices such as GWP to conduct intelligent job-to-machinemappings.

3. HETEROGENEITY IN MODERN WSCSThe potential benefit of heterogeneity-awareness in WSCs

hinges on the performance variability of applications acrossdiverse microarchitectural configurations and co-runners. Inthis section, we investigate such performance variability forlarge-scale Internet service production applications. We fo-cus on the performance sensitivity of applications to emer-

CPU GHz Cores L2/L3 Name

Clovertown Xeon E5345 2.33ghz 6 8mb CloverIstanbul Opteron 8431 2.4ghz 6 6mb IstanWestmere Xeon X5660 2.8ghz 6 12mb West

Table 3: Production Microarchitecture Mix

CPU GHz Cores L2/L3 Memory

Core i7 920 2.67ghz 4 8mb 4gbCore 2 Q8300 2.5ghz 4 4mb 3gbPhenom X4 910 2.6ghz 4 6mb 4gb

Table 4: Experimental Microarchitecture Mix

gent heterogeneity and, equally importantly, the variance ofthis sensitivity itself across applications.

In addition to the production applications, we also presentresults using an testbed composed of benchmark applica-tions. Finally, we introduce a metric, opportunity factor,that, given the application mix and machine mix in a WSC,quantifies an application’s sensitivity to heterogeneity withina closed eco-system of machine configurations and diverseapplications.

3.1 Characterization Methodology[Google Testbed] We first conducted our experiments

using commercial applications (shown in Table 2) acrossthree production platform types commonly found in Google’sWSCs (shown in Table 3). The applications shown in thetable cover nine large industry-strength workloads that areresponsible for a significant portion of the cycles consumedin arguably the largest web-service WSC infrastructure inthe world. Table 2 also presents a description for eachapplication. Each application corresponds to an actual bi-nary that is run in the WSC. These applications are partof a test infrastructure developed internally at Google com-posed of a host of Google workloads and machine clustersthat have been both laboriously configured by a team ofengineers for performance analysis and optimization testingacross Google. Each application shown in the table operateson a repeatable log of thousands of queries of user activityfrom production. We use this test infrastructure throughoutthe remainder of this work. The number of cores used byeach application is configured to three for both solo and co-location runs.

[Benchmark Testbed] To investigate how our findingsusing Google’s infrastructure generalize to other applicationsets and to provide experimental results that are repeat-able, we replicate our study in an experimental benchmarktestbed. In our experimental infrastructure we use a spec-

ads−

serv

let

map

s−det

ect−

face

sear

ch−

render

sear

ch−

scori

ng

pro

tobuf

docs

−an

alyze

r

saw

−co

untw

youtu

be−

x264yt

mea

n

Norm

aliz

ed P

erfo

rman

ce CloverIstanWest

1x

1.5x

2x

2.5x

3x

3.5x

big

table

Figure 2: Performance comparisonof key Google applications acrossthree microarchitectures. (higher isbetter)

ads−

serv

let

map

s−det

ect−

face

sear

ch−

render

sear

ch−

scori

ng

pro

tobuf

docs

−an

alyze

r

saw

−co

untw

youtu

be−

x264yt

Per

form

ance

Im

pac

t

BT on Clover

BT on Istan

BT on West

SS on Clover

SS on Istan

SS on West

PB on Clover

PB on Istan

PB on West

−30%

−25%

−20%

−15%

−10%

−5%

0%

5%

big

table

Figure 3: Google application performance when co-located withbigtable (BT), search-scoring (SS), and protobuf (PB). (negativeindicates slowdown)

trum of 22 SPEC CPU2006 benchmarks on their ref inputas our application types and three state of art microarchi-tectures as our machine types running Linux 2.6.29. Theunderlying microarchitectures of these three machine typesare presented in Table 4. All application types are compiledwith GCC 4.5 with O3 optimization.

3.2 Investigating Heterogeneity[Microarchitectural Heterogeneity] We first charac-

terize the performance variability due to microarchitecturalheterogeneity in WSCs. In addition to quantifying the mag-nitude of the performance variability, our study also aims toinvestigate firstly, whether one microarchitectural configura-tion consistently outperforms others for all applications; andsecondly, the variance of sensitivities across applications. Aswe discuss later in this section, the variance in performancesensitivity across a given application mix is indicative ofthe performance potential of adopting a heterogeneity-awareWSC design.

Figure 2 presents the experimental results for our Googletestbed with 9 key Google applications (Table 2) runningon 3 types of production machines (Table 3). The y-axisshows the performance (average instructions per second) ofeach application on three types of machines, normalized bythe worst performance among the three for each application.Docs-analyzer’s data on Istanbul is missing because it is notconfigured for that particular platform.

Figure 2 shows that even among three architectures thatare from competing generations, there is a significant per-formance variability for Google applications. More inter-estingly, no platform is consistently better than the oth-ers in this experiment. Although the Westmere Xeon out-performs the other platforms for most applications, maps-detect-face running on the Istanbul Opteron outperformsthe Westmere Xeon by around 25%. On the other hand,the Clovertown Xeon and Istanbul Opteron compete muchmore closely. It is also important to note that even thoughthe Westmere Xeon platform is almost always better thanthe other two, the performance sensitivity to platform typesvary significantly across applications, ranging from gainingonly 10% speedup for protobuf when switching from theworst platform (Opteron) to the best (Westmere Xeon), toas large as 3.5x speedup for docs-analyzer. This hetero-geneity in performance sensitivity (various speedup ratios)impacts how job placement decisions should be made tomaximize the overall performance and the amount of poten-tial performance improvement achievable by intelligent map-

ping. To maximize the overall performance for a WSC com-posed of a limited number of each microarchitecture, a smartjob manager should prioritize mapping those applicationswith higher speedup ratio to the faster machines. For exam-ple, to achieve the best overall performance, docs-analyzeror big-table should be prioritized to use the WestmereXeon over protobuf. In Section 5.4 we delve into moredetails as what cause the performance variability by varyingthe workload mix.

[Co-Runner Heterogeneity] Figure 3 illustrates theperformance variability due to co-location interference forGoogle applications. This figure shows the performanceinterference of each of the 9 Google applications when co-located with bigtable (BT), search-scoring (SS) and pro-tobuf (PB). The y-axis shows the performance degradationof each benchmark when co-located on each platform. Thisdegradation is calculated using the application’s executionrate when co-located normalized to the execution rate whenit is running alone on that platform. The lower the bar, themore severe the performance penalty. We observe that thesame co-runner causes varying performance penalties to dif-ferent applications with performance degradations rangingfrom close to no penalty, 2% or less in some cases, to almost30%.

More interestingly, the heterogeneity in co-location penaltyis not an isolated factor and is complicated by the hetero-geneity in microarchitecture. As shown in the figure, foreach application, the same co-running application may causevarying performance penalties on different microarchitec-tures. Also, microarchitectural heterogeneity on averagehas a more significant performance impact than co-locationheterogeneity. While there is generally less than 30% per-formance degradation due to co-location, the performancevariability due to microarchitectural heterogeneity is up to3.5x. However, the relative impact of the two depends onapplications. For some applications (e.g., protobuf), co-running heterogeneity has a greater impact than machineheterogeneity. The above observations imply that whenexploiting the heterogeneity in WSCs to improve perfor-mance via better job-to-machine mapping, there may bea compounding benefit to consider both machine and co-runner heterogeneity simultaneously.

In addition to Google production applications, we alsoconducted similar experiments on our benchmark testbed.The results are presented in Figures 4 and 5. In sum-mary, we observe that the amount of variability present frommicroarchitectural and co-runner diversity is significant for

1x

1.2x

1.4x

1.6x

1.8x

2x

2.2x

2.4x

per

lben

ch

bzi

p2

gcc

mcf

go

bm

k

hm

mer

sjen

g

lib

qu

antu

m

h2

64

ref

om

net

pp

asta

r

xal

ancb

mk

mil

c

gro

mac

s

cact

usA

DM

nam

d

dea

lII

sop

lex

po

vra

y

calc

uli

x

lbm

sph

inx

3

mea

n

No

rmal

ized

Per

form

ance

Core i7Core 2Phenom X4

Figure 4: Performance comparison of benchmarkworkloads across three microarchitectures.

−35%

−30%

−25%

−20%

−15%

−10%

−5%

0%

5%

per

lben

ch

bzi

p2

gcc

mcf

go

bm

k

hm

mer

sjen

g

lib

qu

antu

m

h2

64

ref

om

net

pp

asta

r

xal

ancb

mk

mil

c

gro

mac

s

cact

usA

DM

nam

d

dea

lII

sop

lex

po

vra

y

calc

uli

x

sph

inx

mea

n

Per

form

ance

Im

pac

t

Core i7Core 2Phenom X4

Figure 5: Benchmark slowdown when co-locatedwith lbm.

both Google applications and benchmark applications. Inaddition, applications present various levels of performancesensitivity to such heterogeneity. This indicates that thehomogeneity assumption may leave a large performance op-portunity untapped.

3.3 OF: An Opportunity MetricAn important concept arises from the previous section.

Depending on how “immune” or “sensitive” an application isto microarchitectural and co-runner variation, each applica-tion would benefit differently from a job mapping policy thattakes advantage of heterogeneity. We introduce a metric,opportunity factor, that approximates a given application’spotential performance improvement opportunity relative toall other applications, given a particular mix of applicationsand machine types. The higher the opportunity factor, themore sensitive an application is to diversity in the WSC.Note that this opportunity factor can be calculated onlywhen the application mix and the machine mix are known.

For a given WSC, we can denote the application of type ias Ai, and the microarchitecture of type j as Mj . We definethe speedup factor for Ai as:

SFAi=

maxj,k{IPSAi,Mj,Ck} − minj,k{IPSAi,Mj,Ck

}

minj,k{IPSAi,Mj,Ck}

, (1)

where IPSAi,Mj ,Ckis application Ai’s IPS (instruction per

second) when it is running on machine Mj with a set ofco-runners Ck. The SFAi is essentially the amount of per-formance variability of Ai in all possible configurations ofthe execution environment, composed of the cross productof all machine options and co-runner options. Using SFAi ,we can define the Opportunity Factor (OF) for Ai as:

OFAi =SFAi∑j SFAj

(2)

OFAi represents the sensitivity of each application type tothe overall heterogeneity of a given application mix relativeto all other applications. This metric allows WSC design-ers, operators and reliability engineers to reason about theperformance improvement potential of various applicationsin the WSC and identify applications that are most likely tobenefit from heterogeneity-aware job mapping. We presentand discuss OF results in Section 5.2.

4. WHARE-MAPIn this section, we present an approach to exploit hetero-

geneity that is particularly well-suited for production WSCsas it leverages already in-place subsystems found in state-of-the-art WSCs to perform job-to-machine mapping.

JobManager

Whare-Map

GWP

Datacenter

Machine

Application

Machine

Map Scorer

Machine

ApplicationApplication

ApplicationApplication ...

Optimization Solver

Figure 6: The Overview of Whare-Map

4.1 OverviewWhare-Map harnesses the continuous profiling informa-

tion provided by in-production monitoring services such asthe Google Wide Profiler (GWP) [27] to intelligently mapjobs to machines. Figure 6 illustrates how Whare-Map isintegrated in the core system architecture of WSCs, enablingthem to exploit heterogeneity. We formulate the problem ofmapping jobs to machines as a combinatorial optimizationproblem and thus an integral component of Whare-Map isthe optimization solver (discussed in Section 4.2).

Another key requirement for Whare-Map is the continu-ous online analysis and comparison of mapping decisions.As illustrated in Figure 6, Map Scorer utilizes GWP toperform such analysis. GWP is a system-wide service thatcontinuously monitors and profiles jobs as they run in pro-duction WSCs and archives the profiling information in adatabase. As shown in the figure, Map Scorer extractsinformation from the GWP database to build and continu-ously refine an internal representation of profiles, analyzingthe performance of each application in various executionenvironments. Using the profiles and scoring by Map Scorer,theOptimization Solver compares various mapping decisionsand their relative performance. It is important to notethat during the optimization process, instead of the costlyapproach of actually mapping jobs to various execution en-vironments to identify the best mapping, the OptimizationSolver relies on the Map Scorer to utilize the historicalprofiling data from GWP of how well a job performs in eachgiven environment. A continuous profiling service such as

Approach Description Complexity

Whare-C Co-location Score: This score is based only on co-location penalty and only requires profiling the co-location penalty on any type of machine. Once a co-location profile is collected it is then used to scorethat co-location regardless of the underlying microarchitecture.

|A|n

Whare-Cs Co-location Score (Smart): This score is based on co-location penalty with microarchitecture specificinformation. Information about co-location penalty must be collected for all platforms of interest.

|A|n × |M |

Whare-M Microarchitectural Affinity Score: This score is based on microarchitectural affinity and captures onlythe speedup of running each application on one microarchitecture over another.

|A| × |M |

Whare-MCs Microarchitectural Affinity and Co-location Score: This scoring method includes both microarchi-tectural affinity and microarchitecture specific co-location penalty. This scoring technique has the heaviestprofiling requirements.

|A|n+1 × |M |

Table 5: Scoring Policies for Mapping

GWP is a key component that makes Whare-Map feasiblein live production systems. It is also important to make thedistinction between the costs associated with populating theGWP database (referred to later as profiling complexity)and the cost of utilizing the information in GWP’s databaseto search for the optimal mapping. The former occurs con-tinuously through the lifetime of operation of the WSC,while the latter is often in the order of minutes for a typicalscale of thousands of machines and dozens of applicationtypes.

4.2 Whare-Map: An Optimization ProblemAs mentioned earlier, we formulate the problem of map-

ping jobs of different types and characteristics to a set ofheterogeneous machine resources as a combinatorial opti-mization problem. The optimization objective is to max-imize the overall performance of the entire WSC, i.e., theaggregated instruction-per-second (IPS) of all jobs. Thisformulation as an optimization problem is especially suitablefor modern WSCs for a number of reasons: 1) The setof important applications are known and fairly stable; 2)the main web-service jobs are often long running jobs; and3) migrations of jobs rarely happens because of the highcost. Our Whare-Map is then essentially a solver for thisoptimization problem.

Algorithm 1: Optimization Algorithm in Whare-Map

Input: set of free machines and available jobsOutput: an optimized mapping

1 while free machines and available jobs do2 map random job to random machine;3 end4 set last score to the score of current map;5 while optimization timer not exceeded do6 foreach machine do7 foreach job on that machine do8 swap job with random job on random machine;9 set cur score to the score of current map;

10 if mapping score is better then11 set last score to cur score;12 else13 swap jobs back to original placements;14 end

15 end

16 end

17 end

The core algorithm (Algorithm 1) we use to solve theoptimization problem is inspired by well established tradi-tional iterative optimization techniques [26, 28, 32]. We usea stochastic hill climbing numerical optimization approachin Whare-Map as it is well suited for the type and scaleof the problem of mapping jobs to machines and convergesrather quickly for our problem formulation (typically lessthan 1 minute of search). It is important to note that

other numerical optimization approaches such as simulatedannealing and genetic algorithms can also be used to per-form the search. However, we observe that for the nature ofour problem, the stochastic hill climbing approach producesmappings that match the quality of these alternatives andconverges quickly.

4.3 Map ScoringA score of a particular placement of a job to a machine is

used to measure the quality of the job placement. We usethe sum of all individual placement scores to score an entiremap of jobs to machines. The higher the score, the betterthe map. The scoring policy is an essential part in Whare-Map. It is used in each optimization iteration to comparemappings. In this work, we present and evaluate a number ofscoring policies that vary in the required profiling necessaryto generate the score. Table 5 shows the descriptions andprofiling complexities for our map scoring policies, where |A|corresponds to the number of application types, |M | corre-sponds to the number of machine types, and n correspondsto the number of co-runners allowed on each machine. Theprofiling complexity indicates the amount of profiling theMap Scorer needs from GWP. For example, among all poli-cies, Whare-M requires the smallest amount of profiling, |A|x |M |, indicating that the scorer only needs performanceprofiles of each application type on each machine type fromGWP without the need of knowing the application’s co-runners when the profiling was conducted. In a practicalsetting of a WSC, |A| is in the order of magnitude of 10s, |M |is often less than 10, and n is often only 1 or 2. Typically,only one or two major web-service jobs, in addition to severallow-overhead background processes such as logsaver, areco-located on a given machine in a WSC.

The accuracy of the scoring policy determines the map-ping quality of Whare-Map. Whare-MCs has the completeinformation and should provide the best result. Meanwhile,Whare-M,Whare-C andWhare-Cs require less time for GWPto collect all needed information. However, they are alsoless accurate, and thus may lead to suboptimal results. Thetrade off is between the amount of available profiling infor-mation and maximizing the performance gain. In addition,the landscape of diversity present in the WSC has a signifi-cant impact on the usefulness of some profiling information.In Section 5.4, we further investigate the factors that impactthe diversity and discuss the selection of the appropriatemap scoring policies.

The complexity of Whare-Map is decided by both theprofiling complexity of scoring policies and the computationcomplexity of the optimization solver. However, as we men-tioned before, GWP continuously profiles in the backgroundthrough the lifetime of a WSC and its cost is thus hiddenfrom Whare-Map. Whare-Map simply utilizes the profil-ing information available at any given time and continuallyrefines its performance profiles based on the newly accu-

2−Jobs−Google 1−Job−SPEC 2−Jobs−SPEC

Norm

aliz

ed I

PS

WorstObliviousWhare−CWhare−CsWhare−MWhare−MCs

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Figure 7: The normalized performance (aggregatedIPS) of Whare-Map, compared to the heterogeneity-oblivious mapper and the worst case (higher isbetter)

2.5x

3x

3.5x

2−Jobs−Google 1−Job−SPEC 2−Jobs−SPEC

No

rmal

ized

Lat

ency

WorstObliviousWhare−CWhare−CsWhare−MWhare−MCs

1x

1.5x

2x

Figure 8: The normalized latency of a givenWSC when using Whare-Map, compared to theheterogeneity-oblivious mapper and worst case(lower is better)

mulated profiling data collected by GWP to continuouslyimprove its scoring and thus the mapping decisions. On theother hand, the complexity of using our optimization solverbased on the map scores to search for the optimal mappingis relatively low, typically in an order of seconds.

5. EVALUATING WHARE-MAPIn this section, we measure the performance improvement

when using our heterogeneity-aware approach, Whare-Map,over current heterogeneity-oblivious mapping. We also eval-uate the performance of four different scoring policies dis-cussed in Section 4.3. In addition to the overall performanceof an entire WSC, we present the application-level perfor-mance achieved by Whare-Map and compare it with theestimation provided by the opportunity factor. Lastly, wedelve into the factors affecting emergent heterogeneity andhow this heterogeneity impacts the cost efficiency of serverpurchase decisions.

5.1 Experimental MethodologyWe conduct thorough investigation and evaluation in three

domains. We evaluate Whare-Map using Google and bench-mark testbeds (Section 5.2). In addition, we analyze its po-tential in production WSCs running live web-services (Sec-tion 5.3). For experimentation using Google and benchmarktestbeds, we use platform types previously presented in Ta-bles 3 and 4 along with the 9 Google key applications and22 SPEC CPU2006 benchmarks, respectively.

For our testbed evaluation, we construct an oracle basedon comprehensive runs on real machines. Given a map ofjobs to a set of machines, this oracle reports the perfor-mance of that mapping. To construct this oracle, we run allcombinations of co-locations on all machine platforms andcollect performance information. This performance infor-mation is in the form of instructions per second (IPS) foreach application in every execution environment. Using thisinformation we construct a knowledge bank that is used asa reference for the performance of a particular event in agiven WSC. We use the same Google workload suite used toevaluate machine configurations internally. These workloadsare composed of Google’s core commercial services and havebeen tuned to exercise mainly the processor and the mem-ory subsystems and have minimal run to run performancevariance (˜1% on average). The input set used is composedof large traces of real world production queries. This setupallows us to focus our study on the emergent heterogeneityin microarchitectural configuration and the microarchitec-tural interaction between co-runners. The knowledge bankserves two purposes for the evaluation conducted on thetestbeds. Firstly, given a job-to-machine mapping, we use

the knowledge bank as the oracle to calculate the aggregateperformance of the entire WSC composed of various machinetypes. Secondly, the knowledge bank is used to model GWPwhere, depending on the scoring policy, partial information(such as only machine heterogeneity or co-location hetero-geneity) is used for various levels of profiling complexity. Inour production analysis, live GWP information is used.

5.2 Google and Benchmark Testbeds[Overall IPS] In Figure 7, we compare our Whare-Map,

the heterogeneity-oblivious mapper and the worse case map-per for overall performance of a WSC. The heterogeneity-oblivious mapper randomly maps a job to a machine basedonly on the job’s resource requirement irrespective to theheterogeneous machine types and corunning jobs on thatmachine. We first use the aggregated instructions per sec-ond (IPS) of all machines as our performance metric forthe entire WSC. The experiments shown in this figure areconducted on Google testbed (1st cluster of bars) and bench-mark testbed (2nd and 3rd clusters of bars). The y-axisshows the normalized overall performance (IPS) of a WSCwhen using various job-to-machine mapping policies. To cal-culate the normalized IPS performance of an entire WSC fora given job-to-machine mapping, we aggregate the averageIPS of all jobs. The normalization baseline for each clusterof bars is the sum of the average IPS of each job when itis run alone on its best performing machine type. Higher isbetter.

The first cluster of bars presents results for the Googletestbed. In this experiment, the testbed WSC is composedof 500 machines with 1000 jobs running; two jobs are co-located on each machine. We choose the 2-Jobs scenariobecause, given typical core and memory requirements, oneor two major web-service jobs are co-located on a givenmachine in our production WSCs. The machine compo-sition and workloads of the WSC are randomly selectedfrom the three machine types shown in Table 3 and 9 keyGoogle applications shown in Table 2. Each bar in thecluster presents the performance for the worst mapping,heterogeneity-oblivious mapping, as well as Whare-Map us-ing four varying scoring policies as discussed in Section 4.3.Similarly, the second and third clusters present results forbenchmark testbed.

For the 1-Job scenario, there are 500 jobs running ina WSC composed of 500 machines with one job runningon each machine; while the 2-Jobs scenario has 1000 jobsrunning on 500 machines. The machine composition andworkloads of the WSC are also randomly generated using thethree machine types from our benchmark testbed (Table 4)and SPEC CPU2006 suite.

In Figure 7, we observe a significant benefit from using

Whare−MCs

0.6x

0.8x

1x

1.2x

1.4x

1.6x

1.8x

2x

big

tab

le

ads−

serv

let

map

s−d

etec

t−fa

ce

sear

ch−

ren

der

sear

ch−

sco

rin

g

pro

tob

uf

do

cs−

anal

yze

r

saw

−co

un

tw

yo

utu

be−

x2

64

yt

Sp

eed

up

(IP

S)

Whare−CWhare−CsWhare−M

Figure 9: Speedup at the application level overheterogeneity-oblivious.

Op

po

rtu

nit

y F

acto

r

0.05

0.1

0.15

0.2

0.25

0.3

big

tab

le

ads−

serv

let

map

s−d

etec

t−fa

ce

sear

ch−

ren

der

sear

ch−

sco

rin

g

pro

tob

uf

do

cs−

anal

yze

r

saw

−co

un

tw

yo

utu

be−

x2

64

yt 0

Figure 10: Opportunity factor of each application.

Whare-Map for the Google testbed experiment. Amongfour scoring policies of Whare-Map, we achieve the bestperformance when considering both machine and co-locationheterogeneity (Whare-MCs), which improves the overall nor-malized IPS of the entire WSC by 18% over the heterogeneity-oblivious mapping (from 0.72x to 0.85x) and 37% over theworst case mapping. Also, in this experiment, Whare-Mperforms comparably well as Whare-MCs. This indicatesthat there is a significant performance benefit when consid-ering the machine heterogeneity. Meanwhile, when only co-location effects are considered to score maps (Whare-C andWhare-Cs) we observe less overall performance gains. It iswithin 1-2% of the heterogeneity-oblivious mapping result asWhare-C and Whare-Cs focus only on the performance im-pact of resource contention between co-located applicationson the same machine. Note that the heterogeneity-obliviousmapping already greatly improves the IPS over the worstcase, by around 17%. When the workload is a fairly balancedmix of contentious (memory-intensive) applications and non-contentious (CPU intensive) applications, randomizing themapping can effectively decrease the chance of co-locatingtwo contentious applications and in turn improve over theworst case by reducing a significant amount of co-locationpenalties. These results indicate that for our Google work-load and production machine mix in our testbed, exploit-ing the machine heterogeneity may have a bigger impactthan considering co-location heterogeneity alone. Howeverthe relative importance of machine and co-location hetero-geneity depends on the machine/workload mix. We explorethose impacting factors in greater detail in Section 5.4.1.

The results for benchmark testbed, shown as the secondand third clusters of bars, are in general consistent withthe Google testbed results. For the 1-Job scenario in thebenchmark testbed, as we expect, Whare-C and Whare-Cs do not improve performance over heterogeneity-obliviousmapping while Whare-M and Whare-MCs perform equallywell. This is because there is no co-location in a 1-Job sce-nario. The performance improvement of Whare-Map usingWhare-MCs over the worst case mapping is 26% and closeto 14% over heterogeneity-oblivious mapping. For the 2-Jobs scenario (the 3rd cluster) we observe that scoring poli-cies that only consider co-location heterogeneity (Whare-C, Whare-Cs) are quite effective, generating up to an 8%improvement over heterogeneity-oblivious mapping. This isbetter than the performance of Whare-C in 2-Jobs scenariosfor Google applications, demonstrating that the effective-ness of Whare-C depends on the machine/workload mix.Only considering microarchitectural heterogeneity withoutconsidering co-location (Whare-M) can produce 12% per-formance benefit over the heterogeneity-oblivious mapping,higher than Whare-C. When Whare-Map combines both

machine heterogeneity and co-location penalty heterogeneity(Whare-MCs), the performance improvement is increased toabout 16%.

[Latency] In addition to the aggregated IPS, we alsocompare the latency of all jobs in a WSC, defined as theexecution time of the longest-running job under a given job-to-machine mapping. Figure 8 shows the latency of variousmapping policies, normalized to the latency when all jobsrun alone on their best performing machine type. Inter-estingly, although the heterogeneity-oblivious mapping canimprove the average IPS performance, it performs equallypoorly as the worst mapping for improving latency. In thisexperiment our Whare-Map improves the job placement ofthe slowest job resulting in lower overall latency. Again,Whare-MCs performs the best, andWhare-M performs com-parably well.

[Opportunity Factor] We further examine the perfor-mance improvement of each application and compare it withthe estimation by the opportunistic factor (Section 3.3).Figure 9 presents the performance improvement at the ap-plication level from the 2-Jobs scenario with 500 machinesand 1000 applications for the Google testbed as in Figures 7and 8. The y-axis shows each application type’s performanceusing Whare-Map, normalized to each type’s average per-formance in a heterogeneity-oblivious mapping. This figuredemonstrates that application types have varying amountsof performance benefit fromWhare-Map. For example, whilethere is a 16%-19% performance improvement overall, docs-analyzer, which is sensitive to both microarchitectural andco-location heterogeneity, achieves a 80% performance im-provement over heterogeneity-oblivious mapping.

There are also applications that suffer performance degra-dation. However, as shown in the figure, the performanceimprovement greatly outweighs these degradations. In thecases where machine heterogeneity is considered (Whare-M,and Whare-MCs), these degradations are negligible. Fig-ure 10 presents the opportunity factor (OF) of each ap-plication, calculated using Equation 1 and Equation 2 inSection 3.3. As the corresponding Figures 9 and 10 show,OF is an effective metric in correctly identifying how sen-sitive various applications are to heterogeneity. However,not all of the application-level opportunity is realized. Re-member that mapping to exploit WSC heterogeneity is aconstraint optimization problem. As a result, not all ap-plications can be mapped to their individual optimal situ-ations to achieve the maximum performance improvement.For example, docs-analyzer has a slightly better OF thanbigtable and they both prefer the Westmere platform, soas the “preferred” Westmeres in a WSC are consumed bydocs-analyzer, bigtable’s mapping options are reduced.

Keep in mind that Whare-Map and OF serve different

1.1x

1.15x

1.2x

D0

D1

D2

D3

D4

D5

D6

D7

D8

D9

Spee

dup (

IPS

)

0.95x

1x

1.05x

Figure 11: Performance improvement from Whare-

Map over the currently deployed mapper in produc-tion

purposes. While Whare-Map exploits heterogeneity, OF isa predictor as to how applications will be affected by hetero-geneity. OF is a much needed metric for WSC operators forunderstanding a key property of the applications within theeco-system of the WSC.

5.3 Whare-Map in the WildLastly, we employWhare-Map on production performance

profiles to study the potential performance improvementwhen exploiting heterogeneity in the wild for live produc-tion WSCs. We conducted our evaluation in the same 10randomly selected WSCs shown in Table 1. These WSCspresent various levels of machine heterogeneity. We havecollected detailed Google-Wide Profiling (GWP) [27] profilesof around 100 job types running across these WSCs con-sisting of numerous machines in the wild. These jobs spanmost of Google’s main products, including websearch. Usingthe GWP profiles, we conducted a postmortem Whare-Mapanalysis to re-map jobs to machines and calculate the ex-pected performance improvement. Firstly, instructions percycle (IPC) samples are derived from GWP profiles. Weuse cycle and instruction samples collected using hardwareperformance counters by GWP over a fixed period of time,aggregated per job and per machine type. IPS (instructionsper second) for each application on each machine type is thencomputed by normalizing the IPC by the clock rate. TheseIPS samples are used for map scoring. Here we use Whare-Mpolicy, considering only microarchitectural heterogeneity.

Using ourWhare-Map approach, we produce an intelligentjob-to-machine mapping in a matter of seconds at the scaleof thousands of machines of multiple types, over a hundredjob types and the performance profiles of over the courseof a month of operation. Figure 11 shows the calculatedperformance improvement when using Whare-Map over thecurrently deployed mapping in 10 of Google’s active WSCs.Even though some major applications are already mappedto their best platforms through manual assignment, we havemeasured significant potential improvement of up to 15%when intelligently placing the remaining jobs. This perfor-mance opportunity calculation based on this paper is nowan integral part of Google’s WSC monitoring infrastructure.Each day the number of ‘wasted cycles’ due to inefficientlymapping jobs to the WSC is calculated and reported acrosseach of Google’s WSCs world wide.

5.4 Factors Impacting Heterogeneity in WSCsThe rationale behind the homogeneity assumption stems

from a lack of understanding on how the gradual introduc-tion of diversity in a WSC impacts performance variability.In this section, we perform a study of how varying the diver-sity in a WSC affects the performance opportunity from theheterogeneity available in the WSC along two dimensions,

application mix and machine platform mix. We then presentinsights into how these two factors affect server purchase op-tions as well as the selection of the appropriate map scoringpolicies.

5.4.1 Impact of Workload Mix on HeterogeneityIn this section, we evaluate a variety of workload mixes

to investigate how workload mixes impact the performanceimprovement when exploiting the heterogeneity. We par-titioned our 9 Google applications into two types, mem-ory intensive (and thus likely to be contentious) and CPUintensive. We also selected the top 8 memory intensiveapplications and the top 8 CPU intensive applications fromSPEC 2006. As shown in Table 6, we constructed 7 typesof workloads using our classification. We then conductedvarious job mapping experiments on these workloads to in-vestigate the performance benefit of using Whare-Map overthe heterogeneity-oblivious mapping. All experiments onthe Google testbed use a WSC of 500 machines evenly dis-tributed from 3 machine types listed in Table 3 (166 Clover-town Xeon, 166 Istanbul Opteron, 168 Westmere Xeon).Similarly, the benchmark testbed experiments use 400 ma-chines composed of 3 types of microarchitectures listed inTable 4 (133 Core i7s, 133 Core 2s, 134 Phenom X4s).

Figures 12 and 13 present our experimental results for theGoogle and benchmark testbed, respectively. We conducteda 2-jobs-per-machine experiment using Google testbed andboth 1-Job and 2-Jobs scenarios for the benchmark testbed.In each figure, the x-axis shows each experiment’s configura-tions. For example, in Figure 13, the notation 1J-MostlyCPUindicates the 1-job-per-machine scenario and the workloadis composed of 3

4CPU intensive benchmarks and 1

4mem-

ory intensive benchmarks. The y-axis shows Whare-Map’sperformance improvement compared to the heterogeneity-oblivious mapping. The performance metric is the overallaggregated IPS of all machines. As the figures show, theamount of performance benefit of using Whare-Map to takeadvantage of heterogeneity varies when the workload mixvaries. Specifically, we have the following observations andinsights.

1) The performance benefit potential is smaller for CPUintensive workloads than memory intensive workloads ormixed workloads. Figure 12 shows that for Google experi-ments, the workload of mostly CPU intensive applicationsachieves just over a 10% improvement over the heterogeneity-oblivious mapping, as opposed to close to 15% for memoryintensive workloads. In Figure 13, both 1J-CPU and 2J-CPU experiments have relatively low performance improve-ment (less than 5%). This indicates that for CPU intensivebenchmarks, the microarchitectural heterogeneity is smaller.For our workloads and the 2 sets of microarchitectures (Ta-bles 3 and 4), much of the performance variability andopportunity are in the memory subsystem hetero-geneity.

2) In general, more diverse workloads, such as workloadscomposed of both CPU and memory intensive benchmarks,have higher performance improvement potential for usingWhare-Map than workloads composed of pure CPU or purememory intensive benchmarks. For example, in Figure 13,for the 1-Job scenarios (left half of the figure), Whare-MCshas more performance improvement over the heterogeneity-oblivious mapping for 1J-mix (15%) than 1J-CPU (3%) or1J-Memory (10%). Similarly, for the 2-Jobs scenarios (righthalf of Figure 13), when the workload is composed of onlyCPU intensive benchmarks (2J-CPU), the performance im-provement is much smaller (4%) than that for 2J-mix (14%),which has a more heterogeneous workload.

3) Considering machine heterogeneity only (Whare-M) isfairly competitive with considering both machine and co-

Workload Application Types

Google Mostly Mem bigtable, ads-servlet, search-render, docs-analyzerGoogle Mostly CPU maps-detect-face, search-scoring, protobuf, saw-countw, youtube-x264ytMemory lbm, libquantum, mcf, milc, omnetpp, soplex, sphinx, xalancbmkCPU hmmer, namd, povray, h264ref, gobmk, dealII, sjeng, perlbench

Mix ( 12Mem/ 1

2CPU) lbm, libquantum, mcf, milc, hmmer, namd, povray, h264ref

Mostly Mem ( 34Mem/ 1

4CPU) lbm, libquantum, mcf, milc, omnetpp, soplex, hmmer, namd

Mostly CPU ( 34CPU/ 1

4Mem) hmmer, namd, povray, h264ref, gobmk, dealII, lbm, libquantum

Table 6: Workload Mixes

0.95x

1x

1.05x

1.1x

1.2x

1.15x

Go

og

Mo

stly

Mem

Go

og

Mo

stly

Cp

u

Sp

eed

up

ov

er R

and

om

Whare−CWhare−CsWhare−MWhare−MCs

Figure 12: Impact of varying workload mix onavailable heterogeneity for the Google testbed. Per-formance is normalized to heterogeneity-obliviousmapping (higher is better)

Sp

eed

up

ov

er R

and

om


0.95x

1x

1.05x

1.1x

1.15x

1.2x

1J−

Mem

1J−

Cp

u

1J−

Mix

1J−

Mo

stly

Mem

1J−

Mo

stly

Cp

u

2J−

Mem

2J−

Cp

u

2J−

Mix

2J−

Mo

stly

Mem

2J−

Mo

stly

Cp

u

Figure 13: Impact of varying workload mixon available heterogeneity for SPEC bench-mark testbed. Performance is normalized toheterogeneity-oblivious mapping.

location heterogeneity (Whare-MCs) in most scenarios. Onthe other hand, considering co-location only (Whare-Cs)does not outperform considering machine heterogeneity only(Whare-M) in any scenario. One reason is that for both ourGoogle and benchmark testbeds, the performance variabilitydue to microarchitectural heterogeneity is as high as 3.5xand 2x, respectively, while the performance variability dueto the penalty of co-locating two jobs is only around 30%(Figures 2 and 3). However, in the next section we will fur-ther investigate the performance difference between Whare-MCs and Whare-M when the amount of microarchitecturalheterogeneity changes.

5.4.2 Impact of Machine Mix on HeterogeneityIn addition to the workload mix, microarchicture mix also

has a significant impact on the amount of the heterogeneityin a WSC. In this section we study the impact of varyingmicroarchitecture mix on the performance improvement ofWhare-Map. We conducted experiments using 6 types ofmachine mixes for the Google testbeds. The 6 types in-clude: an entire WSC composed of all Clovertown Xeon, allIstanbul Opteron, all Westmere Xeon, 1

2Clovertown + 1

2

Istanbul, 1

2Instanbul + 1

2Westmere and 1

2Clovertown +

1

2Westmere. The workloads used for the Google testbed is

composed of all 9 key Google applications (Table 2). We alsoconducted similar experiments on the benchmark testbed,using a a workload composed of mostly memory intensiveapplications (Table 6).Figures 14 and 15 present the results for the Google and

benchmark testbed, respectively. Similar to previous figuresin Section 5.4.1, in each figure, the y-axis shows the per-formance improvement of Whare-Map using four differentscoring policies over the heterogeneity-oblivious mapping fordifferent machine mixes.

The first observation from these two figures is that evenmixes of machines from a similar generation present a signif-icant performance opportunity for exploiting heterogeneity.In Figure 14, even for machine mixes composed of only 2

types of machines, Whare-Map generates significant perfor-mance improvement over heterogeneity-oblivious mapping.For Clovertown and Istanbul, which have similar averageperformance (Figure 2), the performance improvement oftheir mix is also significant (more than 10%). Similar obser-vations can be made for the benchmark testbed as shown inFigure 15.It is also important to note that for some machine mixes,

the benefit of using Whare-MCs over Whare-M is signifi-cant. For example, in the 2J-Core 2+Phenom X4 scenarioshown in Figure 15 (the last cluster of bars), Whare-MCs’sperformance improvement over the heterogeneity-obliviousmapping is 14%, significantly higher than the Whare-M’s8% improvement. This is different from the observations wemade in Section 5.4.1 that often Whare-M performs sim-ilarly with Whare-MCs. The reason for this difference isthat there is less microarchitectural heterogeneity (only 2types of machines in the mix) in these experiments thanthose in Section 5.4.1 and thus the co-location heterogeneitybecomes more important. This observation demonstratesthat although the microarchitectural heterogeneity is gen-erally dominantly important, the amount of additional per-formance benefit when considering co-location is largely de-termined by the workloads mix and the machine mix. Wediscuss more on this topic in Section 5.6.

5.5 Which Servers to Purchase?Important questions arises when making server purchas-

ing decisions: Is heterogeneity in a WSC desirable or not?Should it be increased or decreased?

The heterogeneity study in this paper indicates that whenmaking such heterogeneous vs. homogeneous decisions, sim-ply comparing servers’ average performance for a workloadsuite is insufficient and may be misleading. Instead, weadvocate using Whare-Map to estimate the performance ofWSCs with various machine mixes. In fact, Whare-Mapmakes the heterogeneous WSC a potentially more cost ef-

Wes

t

Clo

ver

+Is

tan

Ista

n+

Wes

t

Wes

t+C

lov

er

Sp

eed

up

ov

er R

and

om


0.95x

1x

1.05x

1.1x

1.15x

1.2x

Clo

ver

Ista

n

Figure 14: Impact of varying machine mix onheterogeneity for the Google testbed. Perfor-mance is normalized to heterogeneity-obliviousmapping(higher is better).

Whare−CsWhare−MWhare−MCs

0.95x

1x

1.05x

1.1x

1.15x

1.2x

1J−

Ci7

1J−

C2

1J−

PX

4

1J−

Ci7

+C

2

1J−

Ci7

+P

X4

1J−

C2

+P

X4

2J−

Ci7

2J−

C2

2J−

PX

4

2J−

Ci7

+C

2

2J−

Ci7

+P

X4

2J−

C2

+P

X4

Sp

eed

up

ov

er R

and

om

Whare−C

Figure 15: Impact of varying machine mix onheterogeneity for SPEC benchmark testbed. Per-formance is normalized to heterogeneity-obliviousmapping.

Whare−MCs

0.5

0.6

0.7

0.8

0.9

1

Clover Istan West Clover+Istan Istan+West West+Clover

No

rmali

zed

IP

S

WorstObliviousWhare−CWhare−CsWhare−M

Figure 16: Normalized performance of various op-tions of WSC machine composition(higher is bet-ter).

fective (better performance/dollar) option than purely ho-mogeneous WSCs.

To illustrate that, Figure 16 shows the performance ofseveral WSCs composed of various machine mixes. Theexperiments are conducted using the same workload as inFigure 14, composed of all 9 Google applications. In contrastto Figure 14, the performance of all experiments here is nor-malized to a single baseline, the aggregate IPS of all appli-cations, each running alone on its best performing platform.The baseline thus is an upper bound on performance. Thisfacilitates the comparison of relative performance betweenWSCs. The key observation in this graph is highlightedwhen comparing the all Istan cluster with the half Istanand half Clover clusters. The Istan cluster is more expen-sive as it is composed of the newer generation. However,when using Whare-Map to place jobs where they run best,the cheaper cluster, Istan+Clover, performs just as well (andindeed a bit better as some jobs actually prefer the Clovermachine).

5.6 Revisiting Map ScoringIn addition to the findings discussed in the above sections,

this study also leads to a number of insights on how to selectthe scoring policy:

1)No free lunch. Among all four scoring policies, Whare-MCs always delivers the best performance improvement.However, it also requires the most amount of profiling tobe effective.

2) Whare-M: big bang for your buck. As this sec-tion shows, in most settings, Whare-M generates signifi-cant performance improvement over heterogeneity-obliviousmapping with a very small amount of profiling. The profil-ing complexity is only |A|x|M | as shown in Table 5. Thisindicates that Whare-M can be adopted as an easy and effec-tive first step for Whare-Map and can be triggered as soon

as GWP finishes profiling the basic machine heterogeneityinformation.

3) Whare-MCs: gradually improve over Whare-M. As Section 5.4.2 shows, depending on the workload andmachine mixes, Whare-MCs may also improve over Whare-M significantly, delivering extra performance benefit, espe-cially when there is much co-location penalty variability.Therefore, Whare-MCs can be used to gradually improveover the mapping of Whare-M, as the GWP accumulatesmore information regarding co-location.

4) Continuous Knowledge Refinement It is impor-tant to remember that although the profiling complexity ofWhare-MCs appears high (Table 5), GWP runs continuouslythroughout the lifetime of the WSC probing each machineonce every minute. As the scale of the WSC increases tothousands of machines the rate at which the profiling infor-mation becomes robust also increases.

6. RELATED WORKPerhaps the most closely related works are those focused

on heterogeneity in datacenters that have appeared in boththe systems and architecture communities [2, 9, 12, 13, 36].Our work is complimentary to these works in that the as-sumptions that underlie these works apply to systems pro-viding utility computing and/or does not leverage in pro-duction continuous profiling subsystems such as GWP. Incontrast to our work, datacenters providing utility com-puting can not make the assumption that applications andservices running in these datacenters are known a priori.Also, in contrast to the prior work on MapReduce, our workextends the core architecture of WSCs at the abstractionlayer closest to the underlying hardware, is general acrossvarious programming paradigms, and leverages continuousprofiling subsystems as opposed to performing trials.

There is much related research on datacenters focused onimproving energy efficiency [1, 3, 6, 16, 19, 20, 22, 24, 25, 34].There has also been work on scheduling in datacenters [17],enabling QoS-aware control in datacenters [23, 29, 31], andprogramming datacenters [10]. A recent work that sharessome similarities with our work presents PROPHET, a goal-oriented provisioning infrastructure that tunes the WSC tosatisfy the needs of particular end users [33]. The researchwork that is closest to ours discusses a scheduling policywhich uses a linear programming problem that maximizessystem capacity to map an application across a desktopgrid [4]. This work focuses on distributed desktop computersand does not consider the interaction between microarchi-tectural and co-location heterogeneity. There has been asignificant amount work in domain of heterogeneous multi-core that is also related to this work such as the work by

Winter et al. [32] that investigated the task of schedulingfor “unpredictably” heterogeneous multicore processors dueto process variation.

7. CONCLUSIONIn this work, we examine the WSC as a heterogeneous

system and show that emergent heterogeneity must be con-sidered when mapping jobs to machines. We investigatemicroarchitectural heterogeneity in the WSC and find thateven when considering platforms from competing genera-tions, there is significant and idiosyncratic variability acrossapplications; and, application co-location is particularly im-portant when considering the heterogeneous WSC. In thiswork, we also demonstrate how WSC heterogeneity can beexploited and investigate how varying the application mixas well as the machine mix can impact performance. Wefind that for applications that are sensitive to variations inmicroarchitecture and co-runners, we observe a performanceimprovement of up to 80% when employing our approachover current random scheduling techniques. Even in a WSCcomposed entirely of state-of-the art machines, we can im-prove the overall performance by 18%. We also present acase study from a live WSC confirming this result, demon-strating up to 15% performance improvement.

8. REFERENCES[1] D. Abts, M. Marty, P. Wells, P. Klausler, and H. Liu. Energy

proportional datacenter networks. ISCA ’10, Jun 2010.

[2] F. Ahmad, S. T. Chakradhar, A. Raghunathan, and T. N.Vijaykumar. Tarazu: optimizing mapreduce on heterogeneousclusters. ASPLOS ’12, pages 61–74, New York, NY, USA, 2012.ACM.

[3] F. Ahmad and T. Vijaykumar. Joint optimization of idle andcooling power in data centers while maintaining response time.ASPLOS ’10, Mar 2010.

[4] I. Al-Azzoni and D. Down. Dynamic scheduling forheterogeneous desktop grids. GRID ’08, Sep 2008.

[5] G. Banga, P. Druschel, and J. C. Mogul. Resource containers: anew facility for resource management in server systems. InOSDI ’99, Berkeley, CA, USA, 1999. USENIX Association.

[6] L. A. Barroso, J. Dean, and U. Holzle. Web search for a planet:The google cluster architecture. IEEE Micro, 23(2):22 – 28,2003.

[7] L. A. Barroso and U. Holzle. The datacenter as a computer: Anintroduction to the design of warehouse-scale machines.Synthesis Lectures on Computer Architecture, pages 1–120,Sep 2009.

[8] L. A. Barroso and P. Ranganathan. Guest editors’ introduction:Datacenter-scale computing. IEEE Micro, 30:6–7, 2010.

[9] J. Burge, P. Ranganathan, and J. Wiener. Cost-awarescheduling for heterogeneous enterprise machines (cash’em).CLUSTER ’07, Sep 2007.

[10] S. Bykov, A. Geller, G. Kliot, J. Larus, R. Pandya, andJ. Thelin. Orleans: A framework for cloud computing. TechnicalReport MSR-TR-2010-159, Microsoft Research, November 2010.

[11] F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach,M. Burrows, T. Chandra, A. Fikes, and R. Gruber. Bigtable: adistributed storage system for structured data. OSDI ’06, Nov2006.

[12] B.-G. Chun, G. Iannaccone, G. Iannaccone, R. Katz, G. Lee,and L. Niccolini. An energy case for hybrid datacenters.SIGOPS Oper. Syst. Rev., 44(1):76–80, Mar. 2010.

[13] C. Delimitrou and C. Kozyrakis. Paragon: QoS-AwareScheduling for Heterogeneous Datacenters. In Proceedings ofthe Eighteenth International Conference on ArchitecturalSupport for Programming Languages and Operating Systems(ASPLOS), March 2013.

[14] EPA. Epa report to congress on server and data center energyefficiency. Technical report, U.S. Protection Agency, 2007.

[15] J. Hamilton. Internet-scale service infrastructure efficiency.SIGARCH Comput. Archit. News, 37(3):232–232, 2009.

[16] T. Heath, B. Diniz, E. V. Carrera, W. Meira, Jr., andR. Bianchini. Energy conservation in heterogeneous serverclusters. In Proceedings of the tenth ACM SIGPLAN

symposium on Principles and practice of parallelprogramming, PPoPP ’05, pages 186–195, New York, NY, USA,2005. ACM.

[17] R. Huang, H. Casanova, and A. Chien. Automatic resourcespecification generation for resource selection. SC ’07, Nov2007.

[18] K. Lim, P. Ranganathan, J. Chang, C. Patel, T. Mudge, andS. Reinhardt. Understanding and designing new serverarchitectures for emerging warehouse-computing environments.In Proceedings of the 35th Annual International Symposiumon Computer Architecture, ISCA ’08, pages 315–326,Washington, DC, USA, 2008. IEEE Computer Society.

[19] J. Mars, L. Tang, and R. Hundt. Heterogeneity in“homogeneous” warehouse-scale computers: A performanceopportunity. IEEE Computer Architecture Letters, 2011.

[20] J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa.Bubble-up: Increasing utilization in modern warehouse scalecomputers via sensible co-locations. In MICRO ’11:Proceedings of The 44th Annual IEEE/ACM InternationalSymposium on Microarchitecture, New York, NY, USA, 2011.ACM.

[21] A. K. Mishra, J. L. Hellerstein, W. Cirne, and C. R. Das.Towards characterizing cloud backend workloads: insights fromgoogle compute clusters. SIGMETRICS Perform. Eval. Rev.,37:34–41, March 2010.

[22] R. Nathuji, C. Isci, and E. Gorbatov. Exploiting platformheterogeneity for power efficient data centers. ICAC ’07:Proceedings of the Fourth International Conference onAutonomic Computing, Jun 2007.

[23] R. Nathuji, A. Kansal, and A. Ghaffarkhah. Q-clouds:managing performance interference effects for qos-aware clouds.EuroSys ’10, Apr 2010.

[24] S. Pelley, D. Meisner, P. Zandevakili, T. Wenisch, andJ. Underwood. Power routing: dynamic power provisioning inthe data center. ASPLOS ’10, Mar 2010.

[25] V. Reddi, B. Lee, T. Chilimbi, and K. Vaid. Web search usingmobile cores: quantifying and mitigating the price of efficiency.ISCA ’10, Jun 2010.

[26] C. R. Reeves, editor. Modern heuristic techniques forcombinatorial problems. John Wiley & Sons, Inc., New York,NY, USA, 1993.

[27] G. Ren, T. Moseley, E. Tune, S. Rus, and R. Hundt.Google-wide profiling: A continuous profiling infrastructure fordatacenters. IEEE Micro, 2010.

[28] S. M. Sait and H. Youssef. Iterative Computer Algorithms withApplications in Engineering: Solving CombinatorialOptimization Problems. IEEE Computer Society Press, LosAlamitos, CA, USA, 1999.

[29] L. Tang, J. Mars, and M. L. Soffa. Compiling for niceness:Mitigating contention for qos in warehouse scale computers. InCGO ’12: Proceedings of the 2012 International Symposiumon Code Generation and Optimization, New York, NY, USA,2012. ACM.

[30] L. Tang, J. Mars, N. Vachharajani, R. Hundt, and M. L. Soffa.The impact of memory subsystem resource sharing ondatacenter applications. In Proceeding of the 38th annualinternational symposium on Computer architecture, ISCA ’11,pages 283–294, New York, NY, USA, 2011. ACM.

[31] L. Tang, J. Mars, W. Wang, T. Dey, and M. L. Soffa. Reqos:Reactive static/dynamic compilation for qos in warehouse scalecomputers. In ASPLOS ’13: Proceedings of the 18thInternational Conference on Architectural Support forProgramming Languages and Operating Systems. ACM, 2013.

[32] J. A. Winter and D. H. Albonesi. Scheduling algorithms forunpredictably heterogeneous cmp architectures. DSN 2008,pages 42 – 51, 2008.

[33] D. Woo and H.-H. Lee. Prophet: goal-oriented provisioning forhighly tunable multicore processors in cloud computing.SIGOPS Operating Systems Review, 43(2), Apr 2009.

[34] H. Yang, A. Breslow, J. Mars, and L. Tang. Bubble-pipo:Precise online qos management for increased utilization inwarehouse scale computers. In ISCA ’13: Proceedings of the40th annual International Symposium on ComputerArchitecture. IEEE/ACM, 2013.

[35] S. Yeo and H.-H. Lee. Using mathematical modeling inprovisioning a heterogeneous cloud computing environment.Computer, 44(8):55 –62, aug. 2011.

[36] M. Zaharia, A. Konwinski, A. Joseph, R. Katz, and I. Stoica.Improving mapreduce performance in heterogeneousenvironments. OSDI’08, Dec 2008.

Date post:	07-Jun-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Whare-Map: Heterogeneity in “Homogeneous” Warehouse …profmars/wp-content/...1. INTRODUCTION...

Documents