+ All Categories
Home > Documents > VMware Technical Journal - Summer 2013

VMware Technical Journal - Summer 2013

Date post: 03-Apr-2018
Category:
Upload: cheese70
View: 216 times
Download: 0 times
Share this document with a friend

of 64

Transcript
  • 7/28/2019 VMware Technical Journal - Summer 2013

    1/64

    VOL. 2, NO. 1JUNE 2013

    VMWARE TECHNICAL JOURNAL

    Editors: Curt Kolovson, Steve Muir, Rita Tavilla

    TABLE OF CONTENTS

    1 Introduction

    Steve Muir, Director, VMware Academic Program

    2 Memory Overcommitment in the ESX Server

    Ishan Banerjee, Fei Guo, Kiran Tati, Rajesh Venkatasubramanian

    13 Redefning ESXi IO Multipathing in the Flash Era

    Fei Meng, Li Zhou, Sandeep Uttamchandani, Xiaosong Ma

    19 Methodology or Perormance Analysis o VMware vSphere under Tier-1 Applications

    Jefrey Buell, Daniel Hecht, Jin Heo, Kalyan Saladi, H. Reza Taheri

    29 vATM: VMware vSphere Adaptive Task Management

    Chirag Bhatt, Aalap Desai, Rajit Kambo, Zhichao Li, Erez Zadok

    35 An Anomaly Event Correlation Engine: Identiying Root Causes, Bottlenecks,

    and Black Swans in IT Environments

    Mazda A. Marvasti, Arnak V. Poghosyan, Ashot N. Harutyunyan, Naira M. Grigoryan

    46 Simpliying Virtualization Management with Graph Databases

    Vijayaraghavan Soundararajan, Lawrence Spracklen

    54 Autonomous Resource Sharing or Multi-Threaded Workloads in Virtualized Servers

    Can Hankendi, Ayse, K. Coskun

  • 7/28/2019 VMware Technical Journal - Summer 2013

    2/64

    VMware Academic Program (VMAP)

    The VMware Academic Program (VMAP) supports a number o academic research projects across

    a range o technical areas. We initiate an annual Request or Proposals (RFP), and also support a

    small number o additional projects that address particular areas o interest to VMware.

    The 2013 Spring RFP, ocused on Storage in support o Sotware Defned Datacenters (SDDC),

    is currently conducting fnal reviews o a shortlist o proposals and will announce the recipients o

    unding at the VMware Research Symposium in July. Please contact Rita Tavilla ([email protected])

    i you wish to receive notifcation o uture RFP solicitations and other opportunities or collaboration

    with VMware.

    The 2012 RFP Security or Virtualized and Cloud Platormsawarded unding to three projects:

    Timing Side-Channels in Modern Cloud Environments

    Pro. Michael Reiter, University o North Carolina at Chapel Hill

    Random Number Generation in Virtualized Environments

    Pro. Yevgeniy Dodis, New York University

    VULCAN: Automatically Generating Tools for Virtual Machine Introspection using Legacy Binary Code

    Pro. Zhiqiang Lin, University o Texas, Dallas

    Visit http://labs.vmware.com/academic/rfp fnd out more about the RFPs.

  • 7/28/2019 VMware Technical Journal - Summer 2013

    3/64

    1

    Introduction

    HappyrstbirthdaytotheVMwareTechnicalJournal!Weareverypleasedtohaveseensucha

    positivereceptiontotherstcoupleoissuesothejournalandhopeyouwillndthisoneequally

    interestingandinormativeWewillpublishthejournaltwiceperyeargoingorwardwithaSpringeditionthathighlightsongoingR&DinitiativesatVMwareandtheFalleditionprovidingashowcase

    orourinternsandcollaborators

    VMwaresmarketleadershipininrastructureorthesotwaredeneddatacenter(SDDC)isbuilt

    uponthestrengthoourcorevirtualizationtechnologycombinedwithinnovationinautomation

    andmanagementAttheheartothevSphereproductisourhypervisorandtwopapershighlight

    ongoingenhancementsinmemorymanagementandIOmulti-pathingthelatterbeingbased

    uponworkdonebyFeiMengoneoourantasticPhDstudentinterns

    AundamentalactorinthesuccessovSphereisthehighperormanceotheTierworkloads

    mostimportanttoourcustomersHenceweundertakein-depthperormanceanalysisandcomparisontonativedeploymentssomekeyresultsowhicharepresentedhereWealsodevelopthenecessary

    eaturestoautomaticallymanagethoseapplicationssuchastheadaptivetaskmanagementscheme

    describedinanotherpaper

    HowevertheSDDCismuchmorethanjustalargenumberoserversrunningvirtualized

    workloadsitrequiressophisticatedanalyticsandautomationtoolsiitistobemanaged

    ecientlyatscalevCenterOperationsVMwaresautomatedoperationsmanagementsuite

    hasproventobeextremelypopularwithcustomersusingcorrelationbetweenanomalousevents

    toidentiyperormanceissuesandrootcausesoailureRecentdevelopmentsintheuseograph

    algorithmstoidentiyrelationshipsbetweenentitieshavereceivedagreatdealoattentionortheirapplicationtosocialnetworksbutwebelievetheycanalsoprovideinsightintothe

    undamentalstructureothedatacenter

    Thenalpaperinthejournaladdressesanotherkeytopicinthedatacenterthemanagement

    oenergyconsumptionAnongoingcollaborationwithBostonUniversityledbyProessorAyse

    Coskunhasdemonstratedtheimportanceoautomaticapplicationcharacterizationanditsuse

    inguidingschedulingdecisionstoincreaseperormanceandreduceenergyconsumption

    ThejournalisbroughttoyoubytheVMwareAcademicProgramteamWeleadVMwareseorts

    tocreatecollaborativeresearchprogramsandsupportVMwareR&Dinconnectingwiththeresearch

    communityWearealwaysinterestedtohearyoureedbackonourprogramspleasecontactuselectronicallyorlookoutorusatvariousresearchconerencesthroughouttheyear

    SteveMuirDirectorVMwareAcademicProgram

  • 7/28/2019 VMware Technical Journal - Summer 2013

    4/64

    2

    Abstract

    Virtualizationocomputerhardwarecontinuestoreducethecost

    ooperationindatacentersItenablesuserstoconsolidatevirtual

    hardwareonlessphysicalhardwaretherebyecientlyusing

    hardwareresourcesTheconsolidationratioisameasureothe

    virtualhardwarethathasbeenplacedonphysicalhardwareA

    higherconsolidationratiotypicallyindicatesgreatereciency

    VMwaresESXServerisahypervisorthatenablescompetitive

    memoryandCPUconsolidationratiosESXallowsuserstopower

    onvirtualmachines(VMs)withatotalconguredmemorythat

    exceedsthememoryavailableonthephysicalmachineThisiscalledmemoryovercommitment

    Memoryovercommitmentraisestheconsolidationratioincreases

    operationaleciencyandlowerstotalcostooperatingvirtual

    machinesMemoryovercommitmentinESXisreliableitdoesnot

    causeVMstobesuspendedorterminatedunderanyconditions

    ThisarticledescribesmemoryovercommitmentinESXanalyzes

    thecostovariousmemoryreclamationtechniquesandempirically

    demonstratesthatmemoryovercommitmentinducesacceptable

    perormancepenaltyinworkloadsFinallybestpracticesor

    implementingmemoryovercommitmentareprovided

    GeneralTerms:memorymanagementmemoryovercommitment

    memoryreclamation

    Keywords:ESXServermemoryresourcemanagement

    Introduction

    VMwaresESXServeroerscompetitiveoperationaleciencyo

    virtualmachines(VM)inthedatacenterItenablesuserstoconsolidate

    VMsonaphysicalmachinewhilereducingcostooperation

    TheconsolidationratioisameasureothenumberoVMsplaced

    onaphysicalmachineAhigherconsolidationratioindicateslower

    costooperationESXenablesuserstooperateVMswithahigh

    consolidationratioESXServersovercommitmenttechnologyisan

    enablingtechnologyallowinguserstoachieveahigherconsolidation

    ratioOvercommitmentistheabilitytoallocatemorevirtualresources

    thanavailablephysicalresourcesESXServeroersuserstheability

    toovercommitmemoryandCPUresourcesonaphysicalmachine

    ESXissaidtobeCPU-overcommittedwhenthetotalcongured

    virtualCPUresourcesoallpowered-onVMsexceedthephysicalCPU

    resourcesonESXWhenESXisCPU-overcommitteditdistributes

    physicalCPUresourcesamongstpowered-onVMsinaairand

    ecientmannerSimilarlyESXissaidtobememory-overcommitted

    whenthetotalconguredguestmemorysizeoallpowered-on

    VMsexceedsthephysicalmemoryonESXWhenESXismemory-

    overcommitteditdistributesphysicalmemoryairlyandeciently

    amongstpowered-onVMsBothCPUandmemoryschedulingare

    donesoastogiveresourcestothoseVMswhichneedthemmost

    whilereclaimingtheresourcesromthoseVMswhicharenot

    activelyusingit

    MemoryovercommitmentinESXisverysimilartothatintraditional

    operatingsystems(OS)suchasLinuxandWindowsIntraditional

    OSesausermayexecuteapplicationsthetotalmappedmemoryowhichmayexceedtheamountomemoryavailabletotheOS

    ThisismemoryovercommitmentItheapplicationsconsume

    memorywhichexceedstheavailablephysicalmemorythenthe

    OSreclaimsmemoryromsomeotheapplicationsandswaps

    ittoaswapspaceItthendistributestheavailablereememory

    betweenapplications

    SimilartotraditionalOSesESXallowsVMstopoweronwithatotal

    conguredmemorysizethatmayexceedthememoryavailableto

    ESXForthepurposeodiscussioninthisarticlethememoryinstalled

    inanESXServeriscalledESXmemoryIVMsconsumealltheESX

    memorythenESXwillreclaimmemoryromVMsItwillthen

    distributetheESXmemoryinanecientandairmannertoallVMssuchthatthememoryresourceisbestutilizedAsimple

    exampleomemoryovercommitmentiswhentwoGBVMsare

    poweredoninanESXServerwithGBoinstalledmemoryThe

    totalconguredmemoryopowered-onVMsis*GBwhile

    ESXmemoryisGB

    Withacontinuingallinthecostophysicalmemoryitcanbe

    arguedthatESXdoesnotneedtosupportmemoryovercommitment

    Howeverinadditiontotraditionalusecasesoimprovingthe

    consolidationratiomemoryovercommitmentcanalsobeused

    intimesodisasterrecoveryhighavailability(HA)anddistributed

    powermanagement(DPM)toprovidegoodperormanceThis

    technologywillprovideESXwithaleadingedgeover

    contemporaryhypervisors

    Memoryovercommitmentdoesnotnecessarilyleadtoperormance

    lossinaguestOSoritsapplicationsExperimentalresultspresented

    inthispaperwithtworeal-lieworkloadsshowgradualperormance

    degradationwhenESXisprogressivelyovercommitted

    ThisarticledescribesmemoryovercommitmentinESXItprovides

    guidanceorbestpracticesandtalksaboutpotentialpitallsThe

    remainderothisarticleisorganizedasollowsSectionprovides

    MEMORY OVERCOMMITMENT IN THE ESX SERVER

    Memory Overcommitment in the ESX ServerIshanBanerjee

    VMwareInc

    ishan@vmwareedu

    FeiGuo

    VMwareInc

    guo@vmwarecom

    KiranTati

    VMwareInc

    ktati@vmwarecom

    RajeshVenkatasubramanian

    VMwareInc

    vrajeshi@vmwarecom

  • 7/28/2019 VMware Technical Journal - Summer 2013

    5/64

    3

    VMsareregularprocessesinKVM andthereorestandard

    memorymanagementtechniqueslikeswappingapplyForLinux

    guestsaballoondriverisinstalledanditiscontrolledbyhostvia

    theballoonmonitorcommandSomehostsalsosupportKernel

    SharedpageMerging(KSM)[]whichworkssimilarlytoESXpage

    sharingAlthoughKVMpresentsdierentmemoryreclamation

    techniquesitrequirescertainhostsandgueststosupport

    memoryovercommitmentInadditionthemanagementand

    policesortheinteractionsamongthememoryreclamation

    techniquesaremissinginKVM

    XenServerusesamechanismcalledDynamicMemoryControl

    (DMC)toimplementmemoryreclamationItworksbyproportionally

    adjustingmemoryamongrunningVMsbasedonpredened

    minimumandmaximummemoryVMsgenerallyrunwithmaximum

    memoryandthememorycanbereclaimedviaaballoondriverwhen

    memorycontentioninthehostoccursHoweverXendoesnot

    provideawaytoovercommitthehostphysicalmemoryhence

    itsconsolidationratioislargelylimitedUnlikeotherhypervisors

    Xenprovidesatranscendentmemorymanagementmechanism

    tomanageallhostidlememoryandguestidlememoryTheidle

    memoryiscollectedintoapoolanddistributedbasedonthe

    demandorunningVMsThisapproachrequirestheguestOStobeparavirtualizedandonlyworkswellorguestswith

    non-concurrentmemorypressure

    WhencomparedtoexistinghypervisorsESXallowsorreliable

    memoryovercommitmenttoachievehighconsolidationratiowith

    norequirementsormodicationsorrunningguestsItimplements

    variousmemoryreclamationtechniquestoenableovercommitment

    andmanagestheminanecientmannertomitigatepossible

    perormancepenaltiestoVMs

    Workonoptimaluseohostmemoryinahypervisorhasalsobeen

    demonstratedbytheresearchcommunityAnoptimizationorthe

    KSMtechniquehasbeenattemptedinKVMwithSingleton[]

    Sub-pagelevelpagesharingusingpatchinghasbeendemonstrated

    inXenwithDierenceEngine[]Paravirtualizedguestandsharing

    opagesreadromstoragedeviceshavebeenshownusingXen

    inSatori[]

    MemoryballooninghasbeendemonstratedorthezVMhypervisor

    usingCMM[]Ginkgo[]implementsahypervisor-independent

    overcommitmentrameworkallowingJavaapplicationstooptimize

    theirmemoryootprintAlltheseworkstargetspecicaspectso

    memoryovercommitmentchallengesinvirtualizationenvironment

    TheyarevaluablereerencesorutureoptimizationsinESX

    memorymanagement

    Background:MemoryovercommitmentinESXisreliable(see

    Table)ThisimpliesthatVMswillnotbeprematurelyterminated

    orsuspendedowingtomemoryovercommitmentMemory

    overcommitmentinESXistheoreticallylimitedbytheoverhead

    memoryoESXESXguaranteesreliabilityooperationunderall

    levelsoovercommitment

    backgroundinormationonmemoryovercommitmentSection

    describesmemoryovercommitmentSectionprovidesquantitative

    understandingoperormancecharacteristicsomemory

    overcommitmentandSectionconcludesthearticle

    2BackgroundandRelatedWork

    Memoryovercommitmentenablesahigherconsolidationratioina

    hypervisorUsingmemoryovercommitmentuserscanconsolidate

    VMsonaphysicalmachinesuchthatphysicalresourcesareutilizedinanoptimalmannerwhiledeliveringgoodperormanceFor

    exampleinavirtualdesktopinrastructure(VDI)deployment

    ausermayoperatemanyWindowsVMseachcontainingaword

    processingapplicationItispossibletoovercommitahypervisor

    withsuchVDIVMsSincetheVMscontainsimilarOSesand

    applicationsmanyotheirmemorypagesmaycontainsimilar

    contentThehypervisorwillndandconsolidatememorypages

    withidenticalcontentromtheseVMsthussavingmemory

    Thisenablesbetterutilizationomemoryandenableshigher

    consolidationratio

    Relatedwork:ContemporaryhypervisorssuchasHyper-VKVMand

    XenimplementdierentmemoryovercommitmentreclamationandoptimizationstrategiesTablesummarizesthememory

    reclamationtechnologiesimplementedinexistinghypervisors

    METHOD ESX HYPER-V KVM XEN

    share X X

    balloon X X X X

    compress X

    hypervisor

    swap

    X X X

    memoryhot-add

    X

    transcendent

    memory

    X

    Table 1Comparingmemoryovercommitmenttechnologiesinexistinghypervisors

    ESXimplementsreliableovercommitment

    Hyper-Vusesdynamicmemoryorsupportingmemory

    overcommitmentWithdynamicmemoryeachVMiscongured

    withasmallinitialRAMwhenpoweredonWhentheguest

    applicationsrequiremorememoryacertainamountomemory

    willbehot-addedtotheVMandtheguestOSWhenahostlacks

    reememoryaballoondriverwillreclaimmemoryromotherVMs

    andmakememoryavailableorhotaddingtothedemandingVMIntherareandrestrictedscenariosHyper-VwillswapVMmemory

    toahostswapspaceDynamicmemoryworksorthelatestversions

    oWindowsOSItusesonlyballooningandhypervisor-levelswapping

    toreclaimmemoryESXontheotherhandworksorallguestOSes

    Italsousescontent-basedpagesharingandmemorycompression

    ThisapproachimprovesVMperormanceascomparedtotheuse

    oonlyballooningandhypervisor-levelswapping

    MEMORY OVERCOMMITMENT IN THE ESX SERVER

    1 www.microsot.com/hyper-v-server/

    2 www.linux-kvm.org/

    3 www.xen.org/

  • 7/28/2019 VMware Technical Journal - Summer 2013

    6/64

    4

    memorypageThismethodgreatlyreducesthememoryootprint

    oVMswithcommonmemorycontentForexampleianESXhas

    manyVMsexecutingwordprocessingapplicationsthenESXmay

    transparentlyapplypagesharingtothoseVMsandcollapsethe

    textanddatacontentotheseapplicationstherebyreducingthe

    ootprintoallthoseVMsThecollapsedmemoryisreedbyESX

    andmadeavailableorpoweringonmoreVMsThisraisesthe

    consolidationratiooESXandenableshigherovercommitment

    Inadditionithesharedpagesarenotsubsequentlywritteninto

    bytheVMsthentheyremainsharedoraprolongedtime

    maintainingthereducedootprintotheVM

    Memoryballooningisanactivemethodorreclaimingidlememory

    romVMsItisusedwhenESXisinthesotstateIaVMhas

    consumedmemorypagesbutisnotsubsequentlyusingthem

    inanactivemannerESXattemptstoreclaimthemromtheVM

    usingballooningInthismethodanOS-specicballoondriver

    insidetheVMallocatesmemoryromtheOSkernelItthenhands

    thememorytoESXwhichisthenreetore-allocateittoanother

    VMwhichmightbeactivelyrequiringmemoryTheballoondriver

    eectivelyutilizesthememorymanagementpolicyotheguest

    OStoreclaimidlememorypagesTheguestOStypicallyreclaims

    idlememoryinsidetheguestOSandirequiredswapsthemtoitsownswapspace

    WhenESXentersthehardstateitactivelyandaggressivelyreclaims

    memoryromVMsbyswappingoutmemorytoaswapspaceDuring

    thisstepiESXdeterminesthatamemorypageissharableor

    compressiblethenthatpageissharedorcompressedinstead

    ThereclamationdonebyESXusingswappingisdierentromthat

    donebytheguestOSinsidetheVMTheguestOSmayswapout

    guestmemorypagestoitsownswapspaceorexampleswap

    orpagelesysESXuseshypervisor-levelswappingtoreclaim

    memoryromaVMintoitsownswapspace

    ThelowstateissimilartothehardstateInadditiontocompressing

    andswappingmemorypagesESXmayblockcertainVMsrom

    allocatingmemoryinthisstateItaggressivelyreclaimsmemory

    romVMsuntilESXmovesintothehardstate

    Pagesharingisapassivememoryreclamationtechniquethat

    operatescontinuouslyonapowered-onVMTheremaining

    techniquesareactiveonesthatoperatewhenreememory

    inESXislowAlsopagesharingballooningandcompression

    areopportunistictechniquesTheydonotguaranteememory

    reclamationromVMsForexampleaVMmaynothavesharable

    contenttheballoondrivermaynotbeinstalledoritsmemory

    pagesmaynotyieldgoodcompressionReclamationbyswapping

    isaguaranteedmethodorreclaimingmemoryromVMs

    InsummaryESXallowsorreliablememoryovercommitment

    toachieveahigherconsolidationratioItimplementsvarious

    memoryreclamationtechniquestoenableovercommitment

    whileimprovingeciencyandloweringcostooperationo

    VMsThenextsectiondescribesmemoryovercommitment

    WhenESXismemoryovercommitteditallocatesmemorytothose

    powered-onVMsthatneeditmostandwillperormbetterwith

    morememoryAtthesametimeESXreclaimsmemoryrom

    thoseVMswhicharenotactivelyusingitMemoryreclamation

    isthereoreanintegralcomponentomemoryovercommitment

    ESXusesdierentmemoryreclamationtechniquestoreclaim

    memoryromVMsThememoryreclamationtechniquesare

    transparentpagesharingmemoryballooningmemory

    compressionandmemoryswapping

    ESXhasanassociatedmemorystatewhichisdeterminedbythe

    amountoreeESXmemoryatagiventimeThestatesarehigh

    sothardandlowTableshowstheESXstatethresholdsEach

    thresholdisinternallysplitintotwosub-thresholdstoavoidoscillation

    oESXmemorystatenearthethresholdAteachmemorystate

    ESXutilizesacombinationomemoryreclamationtechniquesto

    reclaimmemoryThisisshowninTable

    STATE SHARE BALLOON COMPRESS SWAP BLOCK

    high X

    sot X X

    hard X X X

    low X X X X

    Table 3ActionsperormedbyESXindierentmemorystates

    Transparentpagesharing(pagesharing)isapassiveandopportunistic

    memoryreclamationtechniqueItoperatesonapowered-onVMat

    allmemorystatesthroughoutitslietimelookingoropportunities

    tocollapsedierentmemorypageswithidenticalcontentintoone

    MEMORY OVERCOMMITMENT IN THE ESX SERVER

    Table 2.FreememorystatetransitionthresholdinESX(a)Uservisible(b)Internal

    thresholdtoavoidoscillation

  • 7/28/2019 VMware Technical Journal - Summer 2013

    7/64

    5

    ESXalsoreservesasmallamountomemorycalledminreeThis

    amountisabueragainstrapidallocationsbymemoryconsumers

    ESXisinhighstateaslongasthereisatleastthisamountomemory

    reeIthereememorydipsbelowthisvaluethenESXisno

    longerinhighstateanditbeginstoactivelyreclaimmemory

    Figure(c)showstheschematicdiagramrepresentingan

    undercommittedESXwhenoverheadmemoryistakeninto

    accountInthisdiagramtheoverheadmemoryconsumedby

    ESXoritselandoreachpowered-onVMisshownFigure(d)showstheschematicdiagramrepresentinganovercommitted

    ESXwhenoverheadmemoryistakenintoaccount

    Figure(e)showsthetheoreticallimitomemoryovercommitment

    inESXInthiscasealloESXmemoryisconsumedbyESXoverhead

    andper-VMoverheadVMswillbeabletopoweronandboot

    howeverexecutionotheVMwillbeextremelyslow

    Forsimplicityodiscussionandcalculationthedenitiono

    overcommitmentromFigure(b)isollowedOverheadmemory

    isignoredordeningmemoryovercommitmentFromthisgure

    overcommit= v memsize / ESXmemory (1)

    where

    overcommit memoryovercommitmentactor

    V powered-onVMsinESX

    memsize conguredmemorysizeov

    ESXmemory totalinstalledESXmemory

    Therepresentationromthisgureisusedintheremainder

    othisarticle

    TounderstandmemoryovercommitmentanditseectonVM

    andapplicationsmappedconsumedandworkingsetmemory

    aredescribed

    32MappedMemoryThedenitionomemoryovercommitmentdoesnotconsider

    thememoryconsumptionormemoryaccesscharacteristicothe

    powered-onVMsImmediatelyateraVMispoweredonitdoes

    nothaveanymemorypagesallocatedtoitSubsequentlyasthe

    guestOSbootstheVMaccesspagesinitsmemoryaddressspace

    ESXoverheadbyreadingorwritingintoitESXallocatesphysical

    memorypagestobackthevirtualaddressspaceotheVMduring

    thisaccessGraduallyastheguestOScompletesbootingand

    applicationsarelaunchedinsidetheVMmorepagesinthevirtual

    addressspacearebackedbyphysicalmemorypagesDuringthe

    lietimeotheESXoverheadVMtheVMmayormaynotaccess

    allpagesinitsvirtualaddressspace

    Windowsorexamplewritesthezeropatterntothecomplete

    VMmemoryaddressspaceotheVMThiscausesESXtoallocate

    memorypagesorthecompleteaddressspacebythetime

    WindowshascompletedbootingOntheotherhandLinux

    doesnotaccesstheVMmemorycompleteaddressspaceo

    theVMwhenitbootsItaccessesmemorypagesonlyrequired

    toloadtheOS

    3MemoryOvercommitment

    Memoryovercommitmentenablesahigherconsolidationratioo

    VMsonanESXServerItreducescostooperationwhileutilizing

    computeresourcesecientlyThissectiondescribesmemory

    overcommitmentanditsperormancecharacteristics

    3Denitions

    ESXissaidtobememoryovercommittedwhenVMsarepowered

    onsuchthattheirtotalconguredmemorysizeisgreaterthanESX

    memoryFigureshowsanexampleomemoryovercommitment

    onESX

    Figure(a)showstheschematicdiagramoamemory-

    undercommittedESXServerInthisexampleESXmemory

    isGBTwoVMseachwithaconguredmemorysizeoGB

    arepoweredonThetotalconguredmemorysizeopowered-on

    VMsisthereoreGBwhichislessthanGBHenceESXis

    consideredtobememoryundercommitted

    Figure(b)showstheschematicdiagramoamemory-

    overcommittedESXServerInthisexampleESXmemory

    isGBThreeVMswithconguredmemorysizesoGBGB

    andGBarepoweredonThetotalconguredmemorysizeo

    powered-onVMsisthereoreGBwhichismorethanGB

    HenceESXisconsideredtobememoryovercommitted

    Thescenariosdescribedaboveomitthememoryoverheadconsumed

    byESXESXconsumesaxedamountomemoryoritsowntext

    anddatastructuresInadditionitconsumesoverheadmemoryor

    eachpowered-onVM

    MEMORY OVERCOMMITMENT IN THE ESX SERVER

    Figure 1.Memoryovercommitmentshownwithandwithoutoverheadmemory(a)

    UndercommittedESX(b)OvercommittedESXThismodelistypicallyollowedwhen

    describingovercommitment(c)UndercommittedESXshownwithoverheadmemory

    (d)OvercommittedESXshownwithoverheadmemory(e)Limitomemory

    overcommitment

    4 A memory page with all 0x00 content is a zero page.

  • 7/28/2019 VMware Technical Journal - Summer 2013

    8/64

    6

    AllmemorypagesoaVMwhichareeveraccessedbyaVMare

    consideredmappedbyESXAmappedmemorypagesisbacked

    byaphysicalmemorypagebyESXduringtheveryrstaccessby

    theVMAmappedpagemaysubsequentlybereclaimedbyESX

    ItisconsideredasmappedthroughthelietimeotheVM

    WhenESXisovercommittedthetotalmappedpagesbyallVMs

    mayormaynotexceedESXmemoryHenceitispossibleorESX

    tobememoryovercommittedbutatthesametimeowingtothe

    natureotheguestOSanditsapplicationsthetotalmappedmemoryremainwithinESXmemoryInsuchascenarioESXdoesnotactively

    reclaimmemoryromVMsandVMsperormancearenotaected

    bymemoryreclamation

    MappedmemoryisillustratedinFigureThisgureusesthe

    representationromFigure(b)whereisESXmemoryovercommitted

    InFigure(a)thememorymappedbyeachVMisshownThetotal

    mappedmemoryoallVMsislessthanESXmemoryInthiscase

    ESXisovercommittedHoweverthetotalmappedmemoryisless

    thanESXmemoryESXwillnotactivelyreclaimmemoryinthiscase

    InFigure(b)thememorymappedbyeachVMisshownInthis

    caseESXisovercommittedInadditionthetotalmappedmemory

    inESXexceedsESXmemoryESXmayormaynotactivelyreclaim

    memoryromtheVMsThisdependsonthecurrentconsumed

    memoryoeachVM

    33Consumedmemory

    AVMisconsideredtobeconsumingphysicalmemorypagewhen

    aphysicalmemorypageisusedtobackamemorypageinitsaddress

    spaceAmemorypageotheVMmayexistindierentstateswith

    ESXTheyareasollows

    Regular:AregularmemorypageintheaddressspaceoaVMisone

    whichisbackedbyonephysicalpageinESX

    Shared:AmemorypagemarkedassharedinaVMmaybeshared

    withmanyothermemorypagesIamemorypageisbeingshared

    withnothermemorypagestheneachmemorypageisconsidered

    tobeconsumingnoawholephysicalpage

    Compressed:VMpageswhicharecompressedtypicallyconsume

    oroaphysicalmemorypage

    Ballooned:Balloonedmemorypagearenotbackedbyany

    physicalpage

    Swapped:Swappedmemorypagearenotbackedbyany

    physicalpage

    ConsumedmemoryisillustratedinFigureThisgureusesthe

    representationromFigure(b)whereESXismemoryovercommitted

    InFigure(a)themappedandconsumedmemoryopowered-on

    VMsandESXisshownESXisovercommittedHoweverthetotal

    mappedmemoryislessthanESXmemoryInadditionthetotal

    consumedmemoryislessthanthetotalmappedmemoryThe

    consumptionislowerthanmappedsincesomememorypages

    mayhavebeensharedballoonedcompressedorswappedInthis

    stateESXwillnotactivelyreclaimmemoryromVMsHowever

    theVMsmaypossessballoonedcompressedorswappedmemory

    ThisisbecauseESXmayearlierhavereclaimedmemoryrom

    theseVMsowingtothememorystateatthattime

    ThetotalconsumptionoallVMstakentogethercannotexceed

    ESXmemoryThisisshowninFigure(b)InthisgureESXis

    overcommittedInadditionthetotalmappedmemoryisgreater

    thanESXmemoryHoweverwheneverVMsattempttoconsume

    morememorythanESXmemoryESXwillreclaimmemoryrom

    theVMsandredistributetheESXmemoryamongstallVMsThis

    preventsESXromrunningoutomemoryInthisstateESXis

    likelygoingtoactivelyreclaimmemoryromVMstoprevent

    memoryexhaustion

    MEMORY OVERCOMMITMENT IN THE ESX SERVER

    Figure 2.MappedmemoryoranovercommittedESXMappedregionisshaded

    ThemappedregioninESXisthesumomappedregionintheVMs(a)TotalmappedmemoryislessthanESXmemory(b)TotalmappedmemoryismorethanESXmemory

    Figure 3.ConsumedmemoryoranovercommittedESXConsumedandmapped

    regionareshadedTheconsumed(andmapped)regioninESXisthesumoconsumed(mapped)regionintheVMs(a)Totalconsumedandmappedmemoryislessthan

    ESXmemory(b)TotalmappedmemoryismorethanESXmemorytotalconsumed

    memoryisequaltoESXmemory

  • 7/28/2019 VMware Technical Journal - Summer 2013

    9/64

  • 7/28/2019 VMware Technical Journal - Summer 2013

    10/64

    8

    Page-ault cost:WhenasharedpageisreadbyaVMitisaccessed

    bytheVMinaread-onlymannerHencethatsharedpagesdoes

    notneedtobepage-aultedHencethereisnopage-aultcost

    Awriteaccesstoasharedpageincursacost

    CPUWhenasharedpageiswrittentobyaVMESXmustallocate

    anewpageandreplicatethesharedcontentbeoreallowingthe

    writeaccessromtheVMThisallocationincursaCPUcostTypically

    thiscostisverylowanddoesnotsignicantlyaectVMapplications

    andbenchmarks

    WaitThecopy-on-writeoperationwhenaVMaccessesashared

    pagewithwriteaccessisairlyastVMapplicationsaccessingthe

    pagedonotincurnoticeabletemporalcost

    b)Ballooning

    Reclamation cost:MemoryisballoonedromaVMusinga

    balloondriverresidinginsidetheguestOSWhentheballoon

    driverexpandsitmayinducetheguestOStoreclaimmemory

    romguestapplications

    CPUBallooningincursaCPUcostonaper-VMbasissinceit

    inducesmemoryallocationandreclamationinsidetheVM

    StorageTheguestOSmayswapoutmemorypagestotheguestswapspaceThisincursstoragespaceandstoragebandwidthcost

    Page-ault cost:

    CPUAballoonedpageacquiredbytheballoondrivermay

    subsequentlybereleasedbyitTheguestOSorapplicationmay

    thenallocateandaccessitThisincursapage-aultintheguestOS

    aswellasESXThepage-aultincursalowCPUcostsinceamemory

    pagesimplyneedstobeallocated

    StorageDuringreclamationbyballooningapplicationpagesmay

    havebeenswappedoutbytheguestOSWhentheapplication

    attemptstoaccessthatpagetheguestOSneedstoswapitin

    Thisincursastoragebandwidthcost

    WaitAtemporalwaitcostmaybeincurredbyapplicationiits

    pageswereswappedoutbytheguestOSThewaitcostoswapping

    inamemorypagebytheguestOSincursasmalleroverallwait

    costtotheapplicationthanahypervisor-levelswap-inThisis

    becauseduringapageaultintheguestOSbyonethreadthe

    guestOSmayscheduleanotherthreadHoweveriESXis

    swappinginapagethenitmaydescheduletheentireVM

    ThisisbecauseESXcannotrescheduleguestOSthreads

    c)Compression

    Reclamation cost:MemoryiscompressedinaVMbycompressing

    aullguestmemorypagesuchthatisconsumesorphysical

    memorypageThereiseectivelynomemorycostsinceeverysuccessulcompressionreleasesmemory

    CPUACPUcostisincurredoreveryattemptedcompressionThe

    CPUcostistypicallylowandischargedtotheVMwhosememory

    isbeingcompressedTheCPUcostishowevermorethanthator

    pagesharingItmayleadtonoticeablyreducedVMperormance

    andmayaectbenchmarks

    36CostoMemoryOvercommitment

    Memoryovercommitmentincurscertaincostintermsocompute

    resourceaswellasVMperormanceThissectionprovidesaqualitative

    understandingothedierentsourcesocostandtheirmagnitude

    WhenESXismemoryovercommittedandpowered-onVMs

    attempttoconsumemorememorythanESXmemorythenESX

    willbegintoactivelyreclaimmemoryromVMsHencememory

    reclamationisanintegralcomponentomemoryovercommitment

    Tableshowsthememoryreclamationtechniquesandtheir

    associatedcostEachtechniqueincursacostoreclamationwhen

    apageisreclaimedandacostopage-aultwhenaVMaccesses

    areclaimedpageThenumbero*qualitativelyindicatethe

    magnitudeothecost

    S HA RE BALLOON COMPRESS S WA P(SSD)

    SWAP(DISK)

    Reclaim

    Memory *

    CPU * * **

    Storage * * *

    Wait

    Page-ault

    Memory

    CPU * * **

    Storage * * *

    Wait * *** * ** ****

    1 indicates that this cost may be incurred under certain conditions

    Table 4CostoreclaimingmemoryromVMsandcostopage-aultwhenVMaccess

    areclaimedpageNumbero*qualitativelyindicatesthemagnitudeocostTheyare

    notmathematicallyproportional

    AMemorycostindicatesthatthetechniqueconsumesmemory

    meta-dataoverheadACPUindicatesthatthetechniqueconsumes

    non-trivialCPUresourcesAStoragecostindicatesthatthetechnique

    consumesstoragespaceorbandwidthAWaitcostindicatesthat

    theVMincursahypervisor-levelpage-aultcostasaresultothe

    techniqueThismayleadtotheguestapplicationtostallandhence

    leadtoadropinitsperormanceThereclamationandpage-ault

    costsaredescribedasollows

    a)PagesharingReclamation costPagesharingisacontinuousprocesswhich

    opportunisticallysharesVMmemorypages

    MemoryThistechniqueincursanESXwidememorycostorstoring

    pagesharingmeta-datawhichisapartoESXoverheadmemory

    CPUPagesharingincursanominalCPUcostonaperVMbasisThis

    costistypicallyverysmallanddoesnotimpacttheperormance

    oVMapplicationsorbenchmarks

    MEMORY OVERCOMMITMENT IN THE ESX SERVER

  • 7/28/2019 VMware Technical Journal - Summer 2013

    11/64

    9

    4Perormance

    Thissectionprovidesaquantitativedemonstrationomemory

    overcommitmentinESXSimpleapplicationsareshowntowork

    usingmemoryovercommitmentThissectiondoesnotprovide

    acomparativeanalysisusingsotwarebenchmarks

    4Microbenchmark

    Thissetoexperimentwasconductedonadevelopmentbuildo

    ESXTheESXServerhasGBRAMquad-coreAMDCPU

    conguredina-NUMAnodecongurationTheVMsusedinthe

    experimentsare-vCPUGBandcontainRHELOSTheESX

    ServerconsumesaboutGBmemoryESXmemoryoverhead

    per-VMmemoryoverheadreservedminreeThisgivesabout

    GBorallocatingtoVMsFortheseexperimentsESXmemory

    isGBHoweverESXwillactivelyreclaimmemoryromVMs

    whenVMsconsumemorethanGB(availableESXmemory)

    RHELconsumesaboutGBmemorywhenbootedandidleThis

    willcontributetothemappedandconsumedmemoryotheVM

    TheseexperimentsdemonstrateperormanceoVMswhenthetotal

    workingsetvariescomparedtoavailableESXmemoryFigure

    showstheresultsromthreeexperimentsForthepurposeo

    demonstrationandsimplicitythethreereclamationtechniquespage

    sharingballooningandcompressionaredisabledHypervisor-level

    memoryswappingistheonlyactivememoryreclamationtechnique

    inthisexperimentThisreclamationwillstartwhenVMsconsumed

    morethanavailableESXmemory

    Intheseexperimentspresenceohypervisor-levelpage-aultis

    anindicatoroperormancelossFiguresshowcumulativeswap-in

    (page-ault)valuesWhenthisvaluerisesVMwillexperience

    temporalwaitcostWhenitdoesnotrisethereisnotemporal

    waitcostandhencenoperormanceloss

    Experimenta: InFigure(a)anexperimentisconductedsimilarto

    Figure(a)Inthisexperimentthetotalmappedandtotalworking

    setmemoryalwaysremainlessthanavailableESXmemory

    Page-ault cost: Whenacompressedmemorypageisaccessedby

    theVMESXmustallocateanewpageandde-compressthepage

    beoreallowingaccesstotheVM

    CPUDe-compressionincursaCPUcost

    WaitTheVMalsowaitsuntilthepageisde-compressed

    Thisistypicallynotveryhighsincethede-compression

    takesplacein-memory

    d)Swap

    Reclamation cost:ESXswapsoutmemorypagesromaVMtoavoid

    memoryexhaustionTheswap-outprocesstakesplaceasynchronous

    totheexecutionotheVManditsapplicationHencetheVMand

    itsapplicationsdonotincuratemporalwaitcost

    StorageSwappingoutmemorypagesromaVMincursstorage

    spaceandstoragebandwidthcost

    Page-ault cost:WhenaVMpage-aultsaswappedpageESX

    mustreadthepageromtheswapspacesynchronouslybeore

    allowingtheVMtoaccessthepage

    StorageThisincursastoragebandwidthcost

    WaitTheVMalsoincursatemporalwaitcostwhiletheswapped

    pageissynchronouslyreadromstoragedeviceThetemporal

    waitcostishighestoraswapspacelocatedonspinningdisk

    SwapspacelocatedonSSDincurslowertemporalwaitcost

    Thecostoreclamationandpage-aultvarysignicantlybetween

    dierentreclamationtechniquesReclamationbyballooningmay

    incurastoragecostonlyincertaincasesReclamationbyswapping

    alwaysincursstoragecostSimilarlypage-aultowingtoreclamation

    incursatemporalwaitcostinallcasesPage-aultoaswappedpage

    incursthehighesttemporalwaitcostTableshowsthecosto

    reclamationaswellasthecostopage-aultincurredbyaVMon

    areclaimedmemorypage

    ThereclamationitseldoesnotimpactVMandapplication

    perormancesignicantlysincethetemporalwaitcosto

    reclamationiszerooralltechniquesSometechniqueshave

    alowCPUcostThiscostmaybechargedtotheVMleading

    toslightlyreducedperormance

    Howeverduringpage-aulttemporalwaitcostexistsorall

    techniquesThisaectsVMandapplicationperormanceWrite

    accesstosharedpageaccesstocompressedballoonedpages

    incurtheleastperormancecostAccesstoswappedpage

    speciallypagesswappedtospinningdiskincursthehighest

    perormancecost

    SectiondescribedmemoryovercommitmentinESXItdened

    varioustermsmappedconsumedworkingsetmemoryandshoweditsrelationtomemoryovercommitmentItalsoshowed

    thatmemoryovercommitmentdoesnotnecessarilyimpact

    VMandapplicationperormanceItalsoprovidedaqualitative

    analysisohowmemoryovercommitmentmayimpactVMand

    applicationperormancedependingontheVMsmemorycontent

    andaccesscharacteristics

    Thenextsectionprovidesquantitativedescriptionoovercommitment

    MEMORY OVERCOMMITMENT IN THE ESX SERVER

    6 CPU cost o reclamation may aect perormance only i ESX is operating

    at 100% CPU load

    Figure 5.Eectoworkingsetomemoryreclamationandpage-aultAvailableESX

    memory=8GB(a)mapped, working set < available ESX memory.(b)mapped > available ESX

    memory, working set < available ESX memory(c)mapped, working set > available ESX memory

  • 7/28/2019 VMware Technical Journal - Summer 2013

    12/64

    1 0

    TwoVMseachoconguredmemorysizeGBarepoweredon

    TheovercommitmentactorisEachVMrunsaworkloadEach

    workloadallocatesGBmemoryandwritesarandompatterninto

    itItthenreadsallGBmemorycontinuouslyinaroundrobin

    manneroriterationsThememorymappedbyeachVMisabout

    GB(workloadGBRHELGB)thetotalbeingGBThe

    workingsetisalsoGBBothexceedavailableESXmemory

    ThegureshowsmappedmemoryoreachVMandcumulative

    swap-inTheX-axisinthisgureshowstimeinsecondsY-axis(let)showsmappedmemoryinMBandY-axis(right)shows

    cumulativeswap-inItcanbeseenromthisgurethatasteady

    page-aultismaintainedoncetheworkloadshavemappedthe

    targetmemoryThisindicatesthatastheworkloadsareaccessing

    memorytheyareexperiencingpage-aultsESXiscontinuously

    reclaimingmemoryromeachVMastheVMpage-aultsits

    workingset

    Thisexperimentdemonstratesthatworkloadswillexperience

    steadypage-aultsandhenceperormancelosswhentheworking

    setexceedsavailableESXmemoryNotethatitheworkingset

    read-accessessharedpagesandpagesharingisenabled(deault

    ESXbehavior)thenapage-aultisavoided

    Inthissectionexperimentsweredesignedtohighlightthebasic

    workingoovercommitmentIntheseexperimentspagesharing

    ballooningandcompressionweredisabledorsimplicityThese

    reclamationtechniquesreclaimmemoryeectivelybeore

    reclamationbyswappingcantakeplaceSincepageaults

    tosharedballoonedandcompressedmemoryhavealower

    temporalwaitcosttheworkloadperormanceisbetterThis

    willbedemonstratedinthenextsection

    42RealWorkloads

    InthissectiontheexperimentsareconductedusingvSphere

    TheESXServerhasGBRAMquad-coreAMDCPUTheworkloads

    usedtoevaluatememoryovercommitmentperormanceareDVDstoreandaVDIworkloadExperimentswereconducted

    withdeaultmemorymanagementcongurationsorpshare

    balloonandcompression

    Experimentd: TheDVDstoreworkloadsimulatesonlinedatabase

    operationsororderingDVDsInthisexperimentDVDstoreVMs

    areusedeachconguredwithvCPUsandGBmemoryItcontains

    WindowsServerOSandSQLServerTheperormance

    metricisthetotaloperationsperminuteoallVMs

    Since-GBVMsareusedthetotalmemorysizeoall

    powered-onVMsisGBHowevertheESXServercontains

    GBinstalledRAMandwaseectivelyundercommittedHence

    memoryovercommitmentwassimulatedwiththeuseoamemory

    hogVMThememoryhogVMhadullmemoryreservationIts

    conguredmemorysizewasprogressivelyincreasedineach

    runThiseectivelyreducedtheESXmemoryavailabletothe

    DVDStoreVMs

    TwoVMseachoconguredmemorysizeGBarepowered

    onSinceESXmemoryisGBtheovercommitmentactoris

    (*)EachVMexecutesanidenticalhand-crated

    workloadTheworkloadisamemorystressprogramItallocates

    GBmemorywritesrandompatternintoallothisallocated

    memoryandcontinuouslyreadsthismemoryinaround-robin

    mannerThisworkloadisexecutedinbothVMsHenceatotal

    oGBmemoryismappedandactivelyusedbytheVMs

    Theworkloadisexecutedorseconds

    ThegureshowsmappedmemoryoreachVMItalsoshowsthe

    cumulativeswap-inoreachVMTheX-axisinthisgureshows

    timeinsecondsY-axis(let)showsmappedmemoryinMBand

    Y-axis(right)showscumulativeswap-inThememorymappedby

    eachVMisaboutGB(workloadGBRHELGB)thetotal

    beingGBThisislessthanavailableESXmemoryItcanbeseem

    romthisgurethatthereisnoswap-in(page-ault)activityatany

    timeHencethereisnoperormancelossThememorymappedby

    theVMsriseastheworkloadallocatesmemorySubsequentlyas

    theworkloadaccessesthememoryinaround-robinmannerallo

    thememoryisbackedbyphysicalmemorypagesbyESX

    Thisexperimentdemonstratesthatalthoughmemoryis

    overcommittedVMsmappingandactivelyusinglessthan

    availableESXmemorywillnotbesubjectedtoactivememory

    reclamationandhencewillnotexperienceanyperormanceloss

    Experimentb:InFigure(b)anexperimentisconductedsimilarto

    Figure(b)Inthisexperimentthetotalmappedmemoryexceeds

    availableESXmemorywhilethetotalworkingsetmemoryisless

    thanavailableESXmemory

    TwoVMseachoconguredmemorysizeGBarepoweredon

    TheresultingmemoryovercommitmentisTheworkloadin

    eachVMallocatesGBomemoryandwritesarandompattern

    intoitItthenreadsaxedblockoGB(GB)insize

    romthisGBinaround-robinmanneroriterationsThereaterthesamexedblockisreadinaround-robinmanneror

    secondsThisworkloadisexecutedinbothVMsThememory

    mappedbyeachVMisaboutGB(workloadGBRHELGB)

    thetotalbeingGBThisexceedsavailableESXmemoryThe

    workingsetisthereateratotalo*GB

    ThegureshowsmappedmemoryoreachVMandcumulative

    swap-inTheX-axisinthisgureshowstimeinsecondsY-axis

    (let)showsmappedmemoryinMBandY-axis(right)shows

    cumulativeswap-inItcanbeseenromthisgurethatasthe

    workloadmapsmemorythemappedmemoryrisesThereater

    thereisaninitialrisingswap-inactivityastheworkingsetispage-

    aultedAtertheworkingsetispage-aultedthecumulativeswap-inissteady

    Thisexperimentdemonstratesthatalthoughmemoryis

    overcommittedandVMshavemappedmorememorythan

    availableESXmemoryVMswillperormbetterwhenits

    workingsetissmallerthanavailableESXmemory

    Experimentc:InFigure(c)anexperimentisconductedsimilarto

    Figure(c)Inthisexperimentthetotalmappedmemoryaswell

    asthetotalworkingsetexceedsavailableESXmemory

    MEMORY OVERCOMMITMENT IN THE ESX SERVER

    7 http://en.community.dell.com/techcenter/extras/w/wiki/dvd-store.aspx

  • 7/28/2019 VMware Technical Journal - Summer 2013

    13/64

    1 1

    InthisexperimentVDIVMsarepowered-onintheESXServer

    EachVMisconguredwithvCPUandGBmemorythetotal

    conguredmemorysizeoallVMsbeing*GBTheVMs

    containWindowsOSSimilartoexperimentdamemoryhogVM

    withullmemoryreservationisusedtosimulateovercommitment

    byeectivelyreducingtheESXmemoryontheESXServer

    FigureshowsthethpercentileooperationlatencyintheVDI

    VMsorESXmemoryvalueso{}GBAtthesedata

    pointsthememoryhogVMhasaconguredmemorysizeo{}GBOwingtothereducedsimulatedESXmemory

    thesimulatedovercommitmentactor(wtotalVMmemorysize

    oGB)was{}

    TheY-axis(right)showsthetotalamountoballoonedswapped

    memory(MB)romallVMsTheY-axis(let)showstheaverage

    latencyooperationsintheVDIworkloadWhenESXmemoryis

    GBballoonandswappingisnotobservedbecause)pagesharing

    helpsreclaimmemoryand)theapplicationstotalmappedmemory

    andworkingsettsintotheavailableESXmemoryWhenESX

    memoryreducestoGBatotalaboutGBmemoryareballooned

    whichcausesaslightincreaseintheaveragelatencytoThis

    indicatesthatVMsactiveworkingsetisbeingreclaimedatthispoint

    WhenESXmemoryurtherdecreasestoGandGBtheamount

    oballoonedmemoryincreasesdramaticallyAlsotheswapped

    memoryrisesAsaresulttheaveragelatencyincreasesto

    andtimesrespectivelyThisisowingtothesignicantlyhigher

    page-aultwaitcostoballooningandhypervisor-levelswapping

    ThisexperimentwasrepeatedaterattachinganSSDosizeGB

    totheESXServerESXisdesignedtoutilizeanSSDasadeviceor

    swappingoutVMsmemorypagesTheaveragelatencyinthe

    presenceotheSSDisshowninthesamegure

    Itcanbeseenthattheperormancedegradationissignicantly

    lowerThisexperimentshowsthatcertainworkloadsmaintain

    perormanceinthepresenceoanSSDevenwhentheESXServer

    isseverelyovercommitted

    Theexperimentsinthissectionshowthatmemoryovercommitment

    doesnotnecessarilyindicateperormancelossAslongasthe

    applicationsactiveworkingsetsizeissmallerthantheavailable

    FigureshowstheperormanceoDVDstoreVMsorESXmemory

    valueso{}GBAtthesedatapointsthememory

    hogVMhasaconguredmemorysizeo{}GB

    OwingtothereducedsimulatedESXmemorythesimulated

    overcommitmentactorusingEquation(wtotalVMmemory

    sizeoGB)was{}

    TheY-axis(right)showsthetotalamountoballoonedswapped

    memoryinMBromallVMsItcanbeseenthattheamounto

    balloonedmemoryincreasesgraduallywithoutsignicant

    hypervisor-levelswappingasESXmemoryisreducedrom

    GBtoGBTheY-axis(let)showstheoperationsperminute

    (OPM)executedbytheworkloadTherisingballoonedmemory

    contributestoreducedapplicationperormanceasESXmemory

    isreducedromGBtoGBThisisowingtothewaitcosttothe

    applicationassomeoitsmemoryareswappedoutbyguestOS

    BetweentheX-axispointsandtheOPMdecreasesrom

    to(about)owingtoballooningTheovercommitment

    actorbetweenthesepointsisand

    HoweverwhenESXmemoryisreducedtoGB(overcommitment)

    ballooningitselisinsucienttoreclaimsucientmemoryNon-

    trivialamountsohypervisor-levelswappingtakeplaceOwingto

    thehigherwaitcostohypervisor-levelpage-aultsthereislarger

    perormancelossandOPMdropsto

    ThisexperimentwasrepeatedaterattachinganSSDosize

    GBtotheESXServerTheOPMinthepresenceotheSSD

    isshowninthesamegureItcanbeseenthattheperormance

    degradationislesserthaninthepresenceoSSD

    Experimente:TheVDIworkloadisasetointeractiveoceuser-levelapplicationssuchasMicrosotOcesuiteInternetexplorer

    Theworkloadisacustommadesetoscriptswhichsimulateuser

    actionontheVDIapplicationsThescriptstriggermouseclicks

    andkeyboardinputstotheapplicationsThetimetocompleteo

    eachoperationisrecordedInonecompleterunotheworkload

    user-actionsareperormedThemeasuredperormance

    metricisthethpercentileolatencyvalueoalloperations

    Asmallerlatencyindicatesbetterperormance

    MEMORY OVERCOMMITMENT IN THE ESX SERVER

    Figure 6.DVDStoreOperationsperminute(OPM)oveDVDStoreVMswhen

    changingESXmemoryromGBtoGB

    Figure 7.VDI AveragethpercentileoperationlatencyoteenVDIVMswhen

    changingESXmemoryromGBtoGB

  • 7/28/2019 VMware Technical Journal - Summer 2013

    14/64

    1 2

    Reerences

    1 A. Arcangeli, I. Eidus, and C. Wright. Increasing memory

    density by using KSM. In Proceedings o the Linux

    Symposium, pages 313328, 2009.

    2 D. Gupta, S. Lee, M. Vrable, S. Savage, A. C. Snoeren,

    G. Varghese, G. M. Voelker, and A. Vahdat. Dierence engine:

    harnessing memory redundancy in virtual machines.

    Commun. ACM, 53(10):85-93, Oct. 2010.

    3 M. Hines, A. Gordon, M. Silva, D. Da Silva, K. D. Ryu, and

    M. BenYehuda. Applications Know Best: Perormance-Driven

    Memory Overcommit with Ginkgo. In Cloud Computing

    Technology and Science (CloudCom), 2011 IEEE Third

    International Conerence on, pages 130-137, 2011.

    4 G. Milos, D. Murray, S. Hand, and M. Fetterman. Satori:

    Enlightened o page sharing. In Proceedings o the 2009

    conerence on USENIX Annual technical conerence.

    USENIX Association, 2009.

    5 M. Schwidesky, H. Franke, R. Mansell, H. Raj, D. Osisek, and

    J. Choi. Collaborative Memory Management in Hosted Linux

    Environments. In Proceedings o the Linux Symposium, pages

    313-328, 2006.

    6 P. Sharma and P. Kulkarni. Singleton: system-wide page

    deduplication in virtual environments. In Proceedings o the

    21st international symposium on High-Perormance Parallel

    and Distributed Computing, HPDC 12, pages 15-26,

    New York, NY, USA, 2012. ACM.

    7 C. A. Waldspurger. Memory resource management in

    VMware ESX server. SIGOPS Oper. Syst. Rev., 36(SI):181-194,

    Dec. 2002.

    ESXmemorytheperormancedegradationmaybetolerableInmany

    situationswherememoryisslightlyormoderatelyovercommitted

    pagesharingandballooningisabletoreclaimmemorygraceully

    withoutsignicantlyperormancepenaltyHoweverunderhigh

    memoryovercommitmenthypervisor-levelswappingmayoccur

    leadingtosignicantperormancedegradation

    5Conclusion

    ReliablememoryovercommitmentisauniquecapabilityoESXnotpresentinanycontemporaryhypervisorUsingmemory

    overcommitmentESXcanpoweronVMssuchthatthetotal

    conguredmemoryoallpowered-onVMsexceedESXmemory

    ESXdistributesmemorybetweenallVMsinaairandecient

    mannersoastomaximizeutilizationotheESXServerAtthe

    sametimememoryovercommitmentisreliableThismeansthat

    VMswillnotbeprematurelyterminatedorsuspendedowingto

    memoryovercommitmentMemoryreclamationtechniqueso

    ESXguaranteessaeoperationoVMsinamemory

    overcommittedenvironment

    6AcknowledgmentsMemoryovercommitmentinESXwasdesignedandimplemented

    byCarlWaldspurger[]

    MEMORY OVERCOMMITMENT IN THE ESX SERVER

  • 7/28/2019 VMware Technical Journal - Summer 2013

    15/64

    1 3

    Introduction

    Traditionallystoragearrayswerebuiltospinningdiskswithaew

    gigabytesobattery-backedNVRAMaslocalcacheThetypical

    IOresponsetimewasmultiplemillisecondsandthemaximum

    supportedIOPSwereaewthousandTodayintheasheraarrays

    areadvertisingIOlatenciesounderamillisecondandIOPSon

    theorderomillionsXtremIO[](nowEMC)ViolinMemory[]

    WhipTail[]Nimbus[]Solid-Fire[]PureStorage[]

    Nimble[]GridIron(nowViolin)[]CacheIQ(nowNetApp)[]

    andAvereSystems[]aresomeotheemergingstartupsdeveloping

    storagesolutionsthatleverageashAdditionallyestablished

    players(namelyEMCIBMHPDellandNetApp)arealsoactively

    developingsolutionsFlashisalsobeingadoptedwithinserversas

    aashcachetoaccelerateIOsbyservingthemlocallysomeo

    theexamplesolutionsinclude[]Giventhecurrenttrends

    itisexpectedthatall-ashandhybridarrayswillcompletely

    replacethetraditionaldisk-basedarraysbytheendothis

    decadeTosummarizetheIOsaturationbottleneckisnow

    shitingadministratorsarenolongerworriedabouthowmany

    requeststhearraycanservicebutratherhowasttheservercan

    beconguredtosendtheseIOrequestsandutilizethebandwidth

    Multipathingisamechanismoraservertoconnecttoastorage

    arrayusingmultipleavailableabricportsESXismultipathing

    logicisimplementedasaPathSelectionPlug-in(PSP)withinthe

    PSA(PluggableStorageArchitecture)layer[]TheESXiproduct

    todayshipswiththreedierentmultipathingalgorithmsasaNMP

    (NativeMultipathingPlug-in)rameworkPSPFIXEDPSPMRU

    andPSPRRBothPSPFIXEDandPSPMRUutilizeonlyonexed

    pathorIOrequestsanddonotperormanyloadbalancingwhile

    PSPRRdoesasimpleround-robinloadbalancingamongallActive

    OptimizedpathsAlsotherearecommercialsolutionsavailable(as

    describedintherelatedworksection)thatessentiallydierinhow

    theydistributeloadacrosstheActiveOptimizedpaths

    InthispaperweexploreanovelideaousingbothActiveOptimizedandUn-optimizedpathsconcurrentlyActiveUn-optimizedpaths

    havetraditionallybeenusedonlyorailoverscenariossincethese

    pathsareknowntoexhibitahigherservicetimecomparedtoActive

    OptimizedpathsThehypothesisoourapproachwasthattheservice

    timeswerehighsincethecontentionbottleneckisthearray

    bandwidthlimitedbythediskIOPSInthenewasherathe

    arrayisarrombeingahardwarebottleneckWediscovered

    thatourhypothesisishaltrueandwedesignedaplug-in

    solutionarounditcalledPSPAdaptive

    Abstract

    Attheadventovirtualizationprimarystorageequatedspinning

    disksTodaytheenterprisestoragelandscapeisrapidlychanging

    withlow-latencyall-ashstoragearraysspecializedash-based

    IOappliancesandhybridarrayswithbuilt-inashAlsowiththe

    adoptionohost-sideashcachesolutions(similartovFlash)the

    read-writemixooperationsemanatingromtheserverismore

    write-dominated(sincereadsareincreasinglyservedlocallyrom

    cache)IstheoriginalESXiIOmultipathinglogicthatwasdeveloped

    ordisk-basedarraysstillapplicableinthisnewashstorageera?

    Arethereoptimizationswecandevelopasadierentiatorinthe

    vSphereplatormorsupportingthiscoreunctionality?

    ThispaperarguesthatexistingIOmultipathinginESXiisnotthe

    mostoptimalorash-basedarraysInourevaluationthemaximum

    IOthroughputisnotboundbyahardwareresourcebottleneck

    butratherbythePluggableStorageArchitecture(PSA)module

    thatimplementsthemultipathinglogicTherootcauseisthe

    anitymaintainedbythePSAmodulebetweenthehosttrac

    andasubsetotheportsonthestoragearray(reerredtoas

    ActiveOptimizedpaths)TodaytheActiveUn-optimizedpaths

    areusedonlyduringhardwareailovereventssinceun-optimized

    pathsexhibithigherservicetimethanoptimizedpathsThuseven

    thoughtheHostBusAdaptor(HBA)hardwareisnotcompletely

    saturatedwearearticiallyconstrainedinsotwarebylimiting

    totheActiveOptimizedpathsonly

    WeimplementedanewmultipathingapproachcalledPSPAdaptive

    asaPath-SelectionPlug-ininthePSAThisapproachdetectsIO

    pathsaturation(leveragingexistingSIOCtechniques)andspreads

    thewriteoperationsacrossalltheavailablepaths(optimizedand

    un-optimized)whilereadscontinuetomaintaintheiranity

    pathsThekeyobservationwasthatthehigherservicetimesin

    theun-optimizedpathsarestilllowerthanthewaittimesinthe

    optimizedpathsFurtherreadanityisimportanttomaintain

    giventhesession-basedpreetchingandcachingsemanticsusedbythestoragearraysDuringperiodsonon-saturationour

    approachswitchestothetraditionalanitymodelorbothreads

    andwritesInourexperimentsweobservedsignicant(upto

    )improvementsinthroughputorsomeworkloadscenarios

    Wearecurrentlyintheprocessoworkingwithawiderangeo

    storagepartnerstovalidatethismodelorvariousAsymmetric

    LogicalUnitsAccess(ALUA)storageimplementationsandeven

    MetroClusters

    Redefning ESXi IO Multipathing in the Flash EraFeiMeng

    NorthCarolinaStateUniversity

    meng@ncsuedu

    LiZhou

    FacebookInc

    lzhou@bcom

    SandeepUttamchandani

    VMwareInc

    sandeepu@vmwarecom

    XiaosongMa

    NorthCarolinaState

    University&OakRidgeNationalLab

    ma@cscncsuedu

    1 Li Zhou was a VMware employee when working on this project.

    REDEFINING ESXi IO MULTIPATHING IN THE FLASH ERA

  • 7/28/2019 VMware Technical Journal - Summer 2013

    16/64

    1 4

    MultipathinginESXiToday

    FigureshowstheIOmultipathingarchitectureovSphereIna

    typicalSANcongurationeachhosthasmultipleHostBusAdapter

    (HBA)portsconnectedtobothothestoragearrayscontrollers

    Thehosthasmultiplepathstothestoragearrayandperorms

    loadbalancingamongallpathstoachievebetterperormanceInvSpherethisisdonebythePathSelectionPlug-ins(PSP)at

    thePSA(PluggableStorageArchitecture)layer[]ThePSA

    rameworkcollapsesmultiplepathstothesamedatastoreand

    presentsonelogicaldevicetotheupperlayerssuchasthele

    systemInternallytheNMP(NativeMultipathingPlug-in)ramework

    allowsdierentpath-selectionpoliciesbysupportingdierentPSPs

    ThePSPsdecidewhichpathtorouteanIOrequesttovSphere

    providesthreedierentPSPsPSPFIXEDPSPMRUandPSPRR

    BothPSPFIXEDandPSPMRUutilizeonlyonepathorIO

    requestsanddonotdoanyload-balancingwhilePSPRRdoes

    asimpleround-robinloadbalancingamongallactivepathsor

    activeactivearrays

    Insummarynoneotheexistingpathload-balancingimplementations

    concurrentlyutilizeActiveOptimizedandUn-optimizedpathso

    ALUAstoragearraysorIOsVMwarecanprovideasignicant

    dierentiatedvaluebysupportingPSPAdaptiveasanativeoption

    orash-basedarrays

    Thekeycontributionsothispaperare

    DevelopanovelapproachforI/OmultipathinginESXi,

    specicallyoptimizedorash-enabledarraysWithinthis

    contextwedesignedanalgorithmthatadaptivelyswitches

    betweentraditionalmultipathingandspreadingwritesacross

    ActiveOptimizedandActiveUn-optimizedpaths

    ImplementationofthePSPAdaptiveplug-inwithinthe

    PSAmoduleandexperimentalevaluation

    TherestothepaperisorganizedasollowsSectioncovers

    multipathinginESXitodaySectiondescribesrelatedwork

    Sectionsandcoverthedesigndetailsandevaluation

    ollowedbyconclusionsinSection

    2RelatedWork

    Therearethreedierenttypesostoragearraysregardingthe

    dualcontrollerimplementationactiveactiveactivestandbyand

    AsymmetricLogicalUnitsAccess(ALUA)[]DenedintheSCSI

    standardALUAprovidesastandardwayorhoststodiscoverand

    managemultiplepathstothetargetUnlikeactiveactivesystems

    ALUAaliatesoneothecontrollersasoptimizedandservesIOs

    romthiscontrollerinthecurrentVMwareESXipathselection

    plug-in[]UnlikeactivestandbyarraysthatcannotserveIO

    throughthestandbycontrollerALUAstoragearrayisable

    toserveIOrequestsrombothoptimizedandun-optimized

    controllersALUAstoragearrayshavebecomeverypopular

    Todaymostmainstreamstoragearrays(egmostpopular

    arraysmadebyEMCandNetApp)supportALUA

    MultipathingorStorageAreaNetwork(SAN)hasbeendesigned

    asaault-tolerancetechniquetoavoidsinglepointailureaswell

    asprovideperormanceenhancementoptimizationviaload

    balancing[]Multipathinghasbeenimplementedinallmajor

    operatingsystemssuchasLinuxincludingdierentstoragestack

    layers[]Solaris[]FreeBSD[]andWindows[]

    Multipathinghasalsobeenoeredasthird-partyprovided

    productssuchasSymantecVeritas[]andEMCPowerpath[]

    ESXihasinboxmultipathingsupportinNativeMultipathPlug-in

    (NMP)andithasseveraldierentavorsopathselectionalgorithm

    (suchasRoundRobinMRUandFixed)ordierentdevicesESXi

    alsosupportsthird-partymultipathingplug-insintheormsoPSP

    (PathSelectionPlug-in)SATP(StorageArrayTypePlug-in)andMPP

    (Multi-PathingPlug-in)allundertheESXiPSArameworkEMC

    NetAppDellEqualogicandothershavedevelopedtheirsolutions

    onESXiPSAMostotheimplementationsdosimpleroundrobin

    amongactivepaths[]basedonnumberocompleteIOsor

    transerredbytesoreachpathSomethird-partysolutionssuchas

    EMCsPowerPathadoptcomplicatedload-balancingalgorithms

    butperormance-wiseareonlyatparorevenworsethanVMwares

    NMPKiyoshietalproposedadynamicloadbalancingrequest-

    baseddevicemapperMultipathorLinuxbutdidnotimplement

    thiseature[]

    Figure 1.High-levelarchitectureoIOMultipathinginvSphere

    REDEFINING ESXi IO MULTIPATHING IN THE FLASH ERA

  • 7/28/2019 VMware Technical Journal - Summer 2013

    17/64

    1 5

    casesspreadingwritestoun-optimizedpathswillhelplowertheload

    ontheoptimizedpathsandtherebyboostsystemperormanceThus

    inouroptimizedplug-inonlywritesarespreadtoun-optimizedpaths

    33SpreadStartandStopTriggersBecauseotheasymmetricperormancebetweenoptimizedand

    un-optimizedpathsweshouldonlyspreadIOtoun-optimizedpaths

    whentheoptimizedpathsaresaturated(iethereisIOcontention)

    ThereoreaccurateIOcontentiondetectionisthekeyAnotheractor

    weneedtoconsideristhattheALUAspecicationdoesnotspeciy

    theimplementationdetailsThereoredierentALUAarraysrom

    dierentvendorscouldhavedierentALUAimplementationsand

    hencedierentbehaviorsonservingIOissuedontheun-optimized

    pathsInourexperimentswehaveoundthatatleastoneALUA

    arrayshowsunacceptableperormanceorIOsissuedonthe

    un-optimizedpathsWeneedtotakethisintoaccountanddesign

    thePSPAdaptivetobeabletodetectsuchbehaviorandstop

    routingWRITEstotheun-optimizedpathsinoIOperormanceimprovementisobservedTheollowingsectionsdescribethe

    implementationdetails

    3.3.1 I/O Contention Detection

    WeapplythesametechniquesthatSIOC(StorageIOControl)

    usestodaytothePSPAdaptiveorIOcontentiondetectionuse

    IOlatencythresholdsToavoidthrashingtwolatencythresholds

    taandtb(ta > ta)areusedtotriggerstartandstopowritespread

    tonon-optimizedpathsPSPAdaptivekeepsmonitoringIOlatency

    3DesignandImplementation

    Considerahighwayandalocalroadbothtothesamedestination

    Whenthetracisboundedbyatollplazaatthedestinationthere

    isnopointtoroutetractothelocalroadHoweverithetollplaza

    isremoveditstartstomakesensetorouteapartotractothe

    localroadduringrushhoursbecausenowthecontentionpointhas

    shitedThesamestrategycanbeappliedtotheload-balancing

    strategyorALUAstoragearraysWhenthearrayisthecontention

    pointthereisnopointinroutingIOrequeststothenon-optimizedpathsthelatencyishigherandhost-sideIObandwidthisnotthe

    boundHoweverwhenthearraywithashisabletoservemillions

    oIOPSitisnolongerthecontentionpointandhost-sideIOpipes

    couldbecomethecontentionpointduringheavyIOloadItstarts

    tomakesensetorouteapartotheIOtractotheun-optimized

    pathsAlthoughthelatencyontheun-optimizedpathsishigher

    consideringtheoptimizedpathsaresaturatedusingun-optimized

    pathscanstillboostaggregatedsystemIOperormancewith

    increasedIOthroughputandIOPSThisshouldonlybedoneduring

    rushhourswhentheIOloadisheavyandtheoptimizedpathsare

    saturatedAnewPSPAdaptiveisimplementedusingthisstrategy

    3UtilizeActive/Non-optimizedPathsorALUASystemsFigureshowsthehigh-leveloverviewomultipathinginvSphere

    NMPcollapsesallpathsandpresentsonlyonelogicaldevicetoupper

    layerswhichcanbeusedtostorevirtualdisksorVMsWhenan

    IOisissuedtothedeviceNMPqueriesthepathselectionplug-in

    PSPRRtoselectapathtoissuetheIOInternallyPSPRRusesa

    simpleroundrobinalgorithmtoselectthepathorALUAsystems

    Figure(a)showsthedeaultpath-selectionalgorithmIOsare

    dispatchedtoallActiveOptimizedpathsalternatelyActive

    Un-optimizedpathsarenotusedevenitheActiveOptimized

    pathsaresaturatedThisapproachisawasteoresourceithe

    optimizedpathsaresaturated

    ToimproveperormancewhentheActiveOptimizedpaths

    aresaturatedwespreadWRITEIOstotheun-optimized

    pathsEventhoughthelatencywillbehighercompared

    toIOsusingtheoptimizedpathstheaggregatedsystem

    throughputandIOPSwillbeimprovedFigure(b)illustrates

    theoptimizedpathdispatching

    32WriteSpreadOnly

    Foranyactiveactive(includingALUA)dual-controllerarrayeach

    controllerhasitsowncachetocachedatablocksorbothreads

    andwritesOnwriterequestscontrollersneedtosynchronizetheir

    cacheswitheachothertoguaranteedataintegrityForreadsitis

    notnecessarytosynchronizethecacheAsaresultreadshave

    anitytoaparticularcontrollerwhilewritesdonotThereorewe

    assumeissuingwritesoneithercontrollerissymmetricwhileitis

    bettertoissuereadsonthesamecontrollersothatthecachehit

    ratewouldbehigher

    MostworkloadshavemanymorereadsthanwritesHoweverwith

    theincreasingadoptionohost-sideashcachingtheactualIOs

    hittingthestoragecontrollersareexpectedtohaveamuchhigher

    writereadratioalargeportionothereadswillbeservedrom

    thehost-sideashcachewhileallwritesstillhitthearrayInsuch

    Figure 2.PSPRoundRobinPolicy

    REDEFINING ESXi IO MULTIPATHING IN THE FLASH ERA

  • 7/28/2019 VMware Technical Journal - Summer 2013

    18/64

    1 6

    oroptimizedpaths(to)ItoexceedstaPSPAdaptivestartsto

    triggerwritespreadItoallsbelowtbPSPAdaptivestopswrite

    spreadLikeSIOCtheactualvaluesotaandtbaresetbytheuser

    andcoulddierordierentstoragearrays

    3.3.2 Max I/O Latency Threshold

    AsdescribedearlierdierentstoragevendorsALUAimplementations

    varyTheIOperormanceontheun-optimizedpathsorsomeALUA

    arrayscouldbeverypoorForsucharraysweshouldnotspreadIO

    totheun-optimizedpathsTohandlesuchcasesweintroduceathirdthresholdmaxIOlatencytc( tc > ta)Latencyhigherthanthisvalue

    isunacceptabletotheuserPSPAdaptivemonitorsIOlatencyon

    un-optimizedpaths(tuo)whenwritespreadisturnedonIPSP

    Adaptivedetectstuoexceedingthevalueotcitconcludesthat

    theun-optimizedpathsshouldnotbeusedandstopswritespread

    Asimpleonoswitchisalsoaddedasacongurableknobor

    administratorsIauserdoesnotwanttouseun-optimizedpaths

    anadministratorcansimplyturntheeatureothroughtheesxcli

    commandPSPAdaptivewillbehavethesameasPSPRRinsuch

    caseswithoutspreadingIOtoun-optimizedpaths

    3.3.3 I/O Perormance Improvement Detection

    WewanttospreadIOtoun-optimizedpathsonlyiitimproves

    aggregatedsystemIOPSandorthroughputPSPAdaptivecontinues

    monitoringaggregatedIOPSandthroughputonallpathstothe

    specictargetDetectingIOperormanceimprovementsismore

    complicatedhoweversincesystemloadandIOpatterns(eg

    blocksize)couldchangeIOlatencynumberscannotbeusedto

    decideithesystemperormanceimprovesornotorthisreason

    TohandlethissituationwemonitorandcomparebothIOPSand

    throughputnumbersWhenIOlatencyonoptimizedpathsexceeds

    thresholdtaPSPAdaptivesavestheIOPSandthroughputdataas

    thereerencevaluesbeoreitturnsonwritespreadtoun-optimized

    pathsItthenperiodicallychecksitheaggregatedIOPSandor

    throughputimprovedbycomparingthemagainstthereerencevaluesInotimproveditwillstopwritespreadotherwisenoactions

    aretakenToavoidnoiseatleastimprovementoneitherIOPS

    orthroughputisrequiredtoconcludethatperormanceisimproved

    Overallsystemperormanceisconsideredimprovedevenionly

    oneothetwomeasures(IOPSandthroughput)improvesThis

    isbecauseIOpatternchangeshouldnotdecreasebothvalues

    simultaneouslyForexampleiIOblocksizesgodownaggregated

    throughputcouldgodownbutIOPSshouldgoupIsystemload

    goesupbothIOPSandthroughputshouldgoupwithwritespread

    IbothaggregatedIOPSandthroughputgodownPSPAdaptive

    concludesthatitisbecausesystemloadisgoingdown

    IsystemloadgoesdowntheaggregatedIOPSandthroughput

    couldgodownaswellandcausePSPAdaptivetostopwrite

    spreadThisisnebecauselesssystemloadmeansIOlatency

    willbeimprovedUnlesstheIOlatencyonoptimizedpaths

    exceedstaagainwritespreadwillnotbeturnedonagain

    3.3.4 Impact on Other Hosts in the Same Cluster

    Usuallyonehostutilizingtheun-optimizedpathscouldnegatively

    aectotherhoststhatareconnectedtothesameALUAstorage

    arrayHoweverasexplainedintheearliersectionsthegreatly

    boostedIOperormanceonewash-basedstoragemeansthe

    storagearrayismuchlesslikelytobecomethecontentionpoint

    evenimultiplehostsarepumpingheavyIOloadtothearray

    simultaneouslyTherebythenegativeimpacthereisnegligible

    Ourperormancebenchmarkalsoprovesthat

    4EvaluationandAnalysis

    AlltheperormancenumbersarecollectedononeESXiserverThe

    physicalmachinecongurationislistedinTableIometerrunning

    insideWindowsVMsisusedtogenerateworkloadSincevFlashandVFCwerenotavailableatthetimetheprototypewasdone

    weusedIometerloadwithrandomreadandrandom

    writetosimulatetheeectohost-sidecachingwhichchanges

    theIOREADWRITEratiohittingthearray

    CPU IntelXeonCoresLogicalCPUs

    Memory GBDRAM

    HBA DualportGbpsHBA

    Storage Array ALUAenabledarraywithLUNsonSSD

    eachLUNMB

    FC Switch GbpsFCswitch

    Table 1TestbedConguration

    FigureandFigurecomparetheperormanceoPSPRRand

    PSPAdaptivewhenthereisIOcontentionontheHBAportBy

    spreadingWRITEstoun-optimizedpathsduringIOcontention

    PSPAdaptiveisabletoincreaseaggregatedsystemIOPSand

    throughputatthecostoslightlyhigheraverageWRITElatency

    Theaggregatedsystemthroughputimprovementalsoincreases

    withincreasingIOblocksize

    OveralltheperormanceevaluationresultsshowthatPSP

    AdaptivecanincreasesystemaggregatedthroughputandIOPS

    duringIOcontentionandissel-adaptivetoworkloadchanges

    Figure 3.ThroughputPSPRRvsPSPAdaptive

    REDEFINING ESXi IO MULTIPATHING IN THE FLASH ERA

  • 7/28/2019 VMware Technical Journal - Summer 2013

    19/64

    1 7

    Reerences

    1 EMC PowerPath.

    http://www.emc.com/storage/powerpath/powerpath.htm

    2 EMC VPLEX. http://www.emc.com/storage/vplex/vplex.htm.

    3 Multipath I/O. http://en.wikipedia.org/wiki/Multipath_I/O

    4 Implementing vSphere Metro Storage Cluster using

    EMC VPLEX. http://kb.vmware.com/kb/2007545

    5 Solaris SAN Conguration and Multipathing Guide, 2000.

    http://docs.oracle.com/cd/E19253-01/820-1931/820-1931.pd

    6 FreeBSD disk multipath control.http://www.reebsd.org/cgi/

    man.cgi?query=gmultipath&apropos=0&sektion=0&manpath=

    FreeBSD+7.0-RELEASE&ormat=html

    7 Nimbus Data Unveils High-Perormance Gemini Flash Arrays,

    2001 http://www.crn.com/news/storage/240005857/

    nimbus-data-unveils-high-perormance-gemini-ash-arrays-

    with-10-year-warranty.htm

    8 Proximal Datas Caching Solutions Increase Virtual Machine

    Density in Virtualized Environments, 2012. http://www.businesswire.com/news/home/20120821005923/en/

    Proximal-Data%E2%80%99s-Caching-Solutions-Increase-

    Virtual-Machine

    9 DM Multipath, 2012. https://access.redhat.com/knowledge/

    docs/en-US/Red_Hat_Enterprise_Linux/6/html-single/DM_

    Multipath/index.html

    10 EMC vcache, Server Flash Cache, 2012 http://www.emc.

    com/about/news/press/2012/20120206-01.htm

    11 Flash and Virtualization Take Center Stage at SNW or Avere.,

    2012. http://www.averesystems.com/

    12 Multipathing policies in ESXi 4.x and ESXi 5.x, 2012. http://kb.vmware.com/kb/1011340

    13 No-Compromise Storage or the Modern Datacenter, 2012,

    http://ino.nimblestorage.com/rs/nimblestorage/images/

    Nimble_Storage_Overview_White_Paper.pd

    14 Pure Storage: 100% ash storage array: Less than the cost o

    spinning disk, 2012. http://www.purestorage.com/ash-array/

    15 Symantec Veritas Dynamic Multi-Pathing, 2012.

    http://www.symantec.com/docs/DOC5811

    16 Violin memory, 2012. http://www.violin-memory.com/

    17 VMware vFlash ramework, 2012. http://blogs.vmware.com/vsphere/2012/12/virtual-ash-vash-tech-preview.html

    5ConclusionandFutureWork

    Withtherapidadoptionoashitisimportanttorevisitsome

    otheundamentalbuildingblocksothevSpherestackIO

    multipathingiscriticalorscaleandperormanceandanyimprovementstranslateintoimprovedend-userexperience

    aswellashigherVMdensityonESXiInthispaperweexplored

    anapproachthatchallengedtheoldwisdomomultipathingthat

    activeun-optimizedpathsshouldnotbeusedorloadbalancing

    Ourimplementationshowedthatspreadingowritesonallpaths

    hasadvantagesduringcontentionbutshouldbeavoidedduring

    normalloadscenariosPSPAdaptiveisaPSPplug-inthatwe

    developedtoadaptivelyswitchload-balancingstrategiesbased

    onsystemload

    Movingorwardweareworkingtourtherenhancetheadaptive

    logicbyintroducingapathscoringattributethatranksdierent

    pathsbasedonIOlatencybandwidthandotheractorsThescoreisusedtodecidewhetheraspecicpathshouldbeusedor

    dierentsystemIOloadconditionsFurtherwewanttodecide

    thepercentageoIOrequeststhatshouldbedispatchedtoa

    certainpathWecouldalsocombinethepathscorewithIO

    prioritiesbyintroducingpriorityqueuingwithinPSA

    Anotherimportantstoragetrendistheemergenceoactiveactive

    storageacrossmetrodistancesEMCsVPLEX[]istheleading

    solutioninthisspaceSimilartoALUAactiveactivestoragearrays

    exposeasymmetryinservicetimesevenacrossactiveoptimized

    pathsduetounpredictablenetworklatenciesandnumbero

    intermediatehopsAnadaptivemultipathstrategycouldbe

    useulortheoverallperormance

    Figure 4.IOPSPSPRRvsPSPAdaptive

    REDEFINING ESXi IO MULTIPATHING IN THE FLASH ERA

    http://www.freebsd.org/cgi/man.cgi?query=gmultipath&apropos=0&sektion=0&manpath=FreeBSD+7.0-RELEASE&format=htmlhttp://www.freebsd.org/cgi/man.cgi?query=gmultipath&apropos=0&sektion=0&manpath=FreeBSD+7.0-RELEASE&format=htmlhttp://www.freebsd.org/cgi/man.cgi?query=gmultipath&apropos=0&sektion=0&manpath=FreeBSD+7.0-RELEASE&format=htmlhttp://www.crn.com/news/storage/240005857/nimbus-data-unveils-high-performance-gemini-flash-arrays-with-10-year-warranty.htmhttp://www.crn.com/news/storage/240005857/nimbus-data-unveils-high-performance-gemini-flash-arrays-with-10-year-warranty.htmhttp://www.crn.com/news/storage/240005857/nimbus-data-unveils-high-performance-gemini-flash-arrays-with-10-year-warranty.htmhttp://www.businesswire.com/news/home/20120821005923/en/Proximal-Data%E2%80%99s-Caching-Solutions-Increase-Virtual-Machinehttp://www.businesswire.com/news/home/20120821005923/en/Proximal-Data%E2%80%99s-Caching-Solutions-Increase-Virtual-Machinehttp://www.businesswire.com/news/home/20120821005923/en/Proximal-Data%E2%80%99s-Caching-Solutions-Increase-Virtual-Machinehttp://www.businesswire.com/news/home/20120821005923/en/Proximal-Data%E2%80%99s-Caching-Solutions-Increase-Virtual-Machinehttps://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html-single/DM_Multipath/index.htmlhttps://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html-single/DM_Multipath/index.htmlhttps://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html-single/DM_Multipath/index.htmlhttp://www.emc.com/about/news/press/2012/20120206-01.htmhttp://www.emc.com/about/news/press/2012/20120206-01.htmhttp://info.nimblestorage.com/rs/nimblestorage/images/Nimble_Storage_Overview_White_Paper.pdfhttp://info.nimblestorage.com/rs/nimblestorage/images/Nimble_Storage_Overview_White_Paper.pdfhttp://blogs.vmware.com/vsphere/2012/12/virtual-flash-vflash-tech-preview.htmlhttp://blogs.vmware.com/vsphere/2012/12/virtual-flash-vflash-tech-preview.htmlhttp://blogs.vmware.com/vsphere/2012/12/virtual-flash-vflash-tech-preview.htmlhttp://blogs.vmware.com/vsphere/2012/12/virtual-flash-vflash-tech-preview.htmlhttp://info.nimblestorage.com/rs/nimblestorage/images/Nimble_Storage_Overview_White_Paper.pdfhttp://info.nimblestorage.com/rs/nimblestorage/images/Nimble_Storage_Overview_White_Paper.pdfhttp://www.emc.com/about/news/press/2012/20120206-01.htmhttp://www.emc.com/about/news/press/2012/20120206-01.htmhttps://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html-single/DM_Multipath/index.htmlhttps://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html-single/DM_Multipath/index.htmlhttps://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html-single/DM_Multipath/index.htmlhttp://www.businesswire.com/news/home/20120821005923/en/Proximal-Data%E2%80%99s-Caching-Solutions-Increase-Virtual-Machinehttp://www.businesswire.com/news/home/20120821005923/en/Proximal-Data%E2%80%99s-Caching-Solutions-Increase-Virtual-Machinehttp://www.businesswire.com/news/home/20120821005923/en/Proximal-Data%E2%80%99s-Caching-Solutions-Increase-Virtual-Machinehttp://www.businesswire.com/news/home/20120821005923/en/Proximal-Data%E2%80%99s-Caching-Solutions-Increase-Virtual-Machinehttp://www.crn.com/news/storage/240005857/nimbus-data-unveils-high-performance-gemini-flash-arrays-with-10-year-warranty.htmhttp://www.crn.com/news/storage/240005857/nimbus-data-unveils-high-performance-gemini-flash-arrays-with-10-year-warranty.htmhttp://www.crn.com/news/storage/240005857/nimbus-data-unveils-high-performance-gemini-flash-arrays-with-10-year-warranty.htmhttp://www.freebsd.org/cgi/man.cgi?query=gmultipath&apropos=0&sektion=0&manpath=FreeBSD+7.0-RELEASE&format=htmlhttp://www.freebsd.org/cgi/man.cgi?query=gmultipath&apropos=0&sektion=0&manpath=FreeBSD+7.0-RELEASE&format=htmlhttp://www.freebsd.org/cgi/man.cgi?query=gmultipath&apropos=0&sektion=0&manpath=FreeBSD+7.0-RELEASE&format=html
  • 7/28/2019 VMware Technical Journal - Summer 2013

    20/64

    1 8

    24 BYAN, S., LENTINI, J., MADAN, A., AND PABON, L. Mercury:

    Host-side Flash Caching or the Data Center. InMSST (2012),

    IEEE, pp. 112.

    25 EMC ALUA System. http://www.emc.com/collateral/hardware/

    white-papers/h2890-emc-clariion-asymm-active-wp.pd

    26 KIYOSHI UEDA, JUNICHI NOMURA, M. C. Request-based

    Device-mapper multipath and Dynamic load balancing. In

    Proceedings o the Linux Symposium (2007)

    27 LUO, J., SHU, J.-W., AND XUE, W. Design and Implementation

    o an Ecient Multipath or a SAN Environment. In Proceedings

    o the 2005 international conerence on Parallel and Distributed

    Processing and Applications (Berlin, Heidelberg, 2005),

    ISPA05, Springer-Verlag, pp. 101110.

    28 MICHAEL ANDERSON, P. M. SCSI Mid-Level Multipath.

    In Proceedings o the Linux Symposium (2003).

    29 UEDA, K. Request-based dm-multipath, 2008.

    http://lwn.net/Articles/274292/.

    18 Who is WHIPTAIL?, 2012,

    http://whiptail.com/papers/who-is-whiptail/

    19 Windows Multipath I/O, 2012.

    http://technet.microsot.com/en-us/library/cc725907.aspx.

    20 XtremIO, 2012. http://www.xtremio.com/

    21 NetApp Quietly Absorbs CacheIQ, Nov, 2012, http://www.

    networkcomputing.com/storage-networkingmanagement/

    netapp-quietly-absorbs-cacheiq/240142457

    22 SolidFire Reveals New Arrays or White Hot Flash Market,

    Nov, 2012 http://siliconangle.com/blog/2012/11/13/

    solidre-reveals-new-arrays-or-white-hot-ash-market//

    23 Gridiron Intros New Flash Storage Appliance, Hybrid Flash Array,

    Oct, 2012. http://www.crn.com/news/storage/240008223/

    gridiron-introsnew-ash-storage-appliance-hybrid-ash-

    array.htm

    REDEFINING ESXi IO MULTIPATHING IN THE FLASH ERA

    http://www.emc.com/collateral/hardware/white-papers/h2890-emc-clariion-asymm-active-wp.pdfhttp://www.emc.com/collateral/hardware/white-papers/h2890-emc-clariion-asymm-active-wp.pdfhttp://www.networkcomputing.com/storage-networkingmanagement/netapp-quietly-absorbs-cacheiq/240142457http://www.networkcomputing.com/storage-networkingmanagement/netapp-quietly-absorbs-cacheiq/240142457http://www.networkcomputing.com/storage-networkingmanagement/netapp-quietly-absorbs-cacheiq/240142457http://siliconangle.com/blog/2012/11/13/solidfire-reveals-new-arrays-for-white-hot-flash-market//http://siliconangle.com/blog/2012/11/13/solidfire-reveals-new-arrays-for-white-hot-flash-market//http://www.crn.com/news/storage/240008223/gridiron-introsnew-flash-storage-appliance-hybrid-flash-array.htmhttp://www.crn.com/news/storage/240008223/gridiron-introsnew-flash-storage-appliance-hybrid-flash-array.htmhttp://www.crn.com/news/storage/240008223/gridiron-introsnew-flash-storage-appliance-hybrid-flash-array.htmhttp://www.crn.com/news/storage/240008223/gridiron-introsnew-flash-storage-appliance-hybrid-flash-array.htmhttp://www.crn.com/news/storage/240008223/gridiron-introsnew-flash-storage-appliance-hybrid-flash-array.htmhttp://www.crn.com/news/storage/240008223/gridiron-introsnew-flash-storage-appliance-hybrid-flash-array.htmhttp://siliconangle.com/blog/2012/11/13/solidfire-reveals-new-arrays-for-white-hot-flash-market//http://siliconangle.com/blog/2012/11/13/solidfire-reveals-new-arrays-for-white-hot-flash-market//http://www.networkcomputing.com/storage-networkingmanagement/netapp-quietly-absorbs-cacheiq/240142457http://www.networkcomputing.com/storage-networkingmanagement/netapp-quietly-absorbs-cacheiq/240142457http://www.networkcomputing.com/storage-networkingmanagement/netapp-quietly-absorbs-cacheiq/240142457http://www.emc.com/collateral/hardware/white-papers/h2890-emc-clariion-asymm-active-wp.pdfhttp://www.emc.com/collateral/hardware/white-papers/h2890-emc-clariion-asymm-active-wp.pdf
  • 7/28/2019 VMware Technical Journal - Summer 2013

    21/64

    1 9

    CategoriesandSubjectDescriptors: C[Computer Systems

    Organization]PerormanceoSystemsDesignstudies

    MeasurementtechniquesPerormanceattributes

    GeneralTerms:MeasurementPerormanceDesign

    Experimentation

    Keywords:Tier-Applicationsdatabaseperormance

    hardwarecounters

    Introduction

    Overthepastyearstheperormanceoapplicationsrunning

    onvSpherehasgoneromaslowasonativeintheveryearly

    daysoESXtowhatiscommonlycallednear-nativeperormance

    atermthathascometoimplyanoverheadolessthanMany

    toolsandmethodswereinventedalongthewaytomeasureand

    tuneperormanceinvirtualenvironments[]Butwithoverheads

    atsuchlowlevelsdrillingdownintothesourceoperormance

    dierencesbetweennativeandvirtualhasbecomeverydicult

    Thegenesisothisworkwasaprojecttostudytheperormance

    ovSphererunningTier-Applicationsinparticulararelational

    databaserunninganOnlineTransactionProcessing(OLTP)

    workloadwhichiscommonlyconsideredaveryheavyworkload

    Alongthewaywediscoveredthatexistingtoolscouldnotaccount

    orthedierencebetweennativeandvirtualperormanceAtthis

    pointweturnedtohardwareeventmonitoringtoolsandcombined

    themwithsotwareprolingtoolstodrilldownintotheperormance

    overheadsLatermeasurementswithHadoopandlatency-sensitive

    applicationsshowedthatthesamemethodologyandtoolscan

    beusedorinvestigatingperormanceothoseapplicationsand

    urthermorethesourcesovirtualizationperormanceoverhead

    aresimilaroralltheseworkloadsThispaperdescribesthetools

    andthemethodologyorsuchmeasurements

    vSpherehypervisor

    VMwarevSpherecontainsvariouscomponentsincludingthe

    virtualmachinemonitor(VMM)andtheVMkernelTheVMM

    implementsthevirtualhardwarethatrunsavirtualmachineThe

    virtualhardwarecomprisesthevirtualCPUstimersanddevices

    AvirtualCPUincludesaMemoryManagementUnit(MMU)which

    Abstract

    Withtherecentadvancesinvirtualizationtechnologybothinthe

    hypervisorsotwareandintheprocessorarchitecturesonecan

    statethatVMwarevSphererunsvirtualizedapplicationsatnear-

    nativeperormancelevelsThisiscertainlytrueagainstabaseline

    otheearlydaysovirtualizationwhenreachingevenhalthe

    perormanceonativesystemswasadistantgoalHoweverthis

    near-nativeperormancehasmadethetaskorootcausingany

    remainingperormancedierencesmoredicultInthispaperwe

    willpresentamethodologyoramoredetailedexaminationothe

    eectsovirtualizationonperormanceandpresentsampleresults

    WewilllookattheperormanceonvSphereoanOLTPworkload

    atypicalHadoopapplicationandlowlatencyapplicationsWewill

    analyzetheperormanceandhighlighttheareasthatcausean

    applicationtorunmoreslowlyinavirtualizedserverThepioneers

    otheearlydaysovirtualizationinventedabatteryotoolsto

    studyandoptimizeperormanceWewillshowthatasthegap

    betweenvirtualandnativeperormancehasclosedthesetraditional

    toolsarenolongeradequateordetailedinvestigationsOneo

    ournovelcontributionsiscombiningthetraditionaltoolswith

    hardwaremonitoringacilitiestoseehowtheprocessorexecution

    prolechangesonavirtualizedserver

    Wewillshowthattheincreaseintranslationlookasidebuer

    (TLB)misshandlingcostsduetothehardware-assistedmemory

    managementunit(MMU)isthelargestcontributortotheperormance

    gapbetweennativeandvirtualserversTheTLBmissratioalso

    risesonvirtualserversurtherincreasingthemissprocessing

    costsManyotheotherperormancedierenceswithnative

    (egadditionaldatacachemisses)arealsoduetotheheavier

    TLBmissprocessingbehaviourovirtualserversDepending

    ontheapplicationtootheperormancedierenceisdue

    tothetimespentinthehypervisorkernelThisisexpectedasall

    networkingandstorageIOhastogetprocessedtwiceonceinthevirtualdeviceintheguestonceinthehypervisorkernel

    Thehypervisorvirtualmachinemonitor(VMM)isresponsibleor

    only-otheoveralltimeInotherwordstheVMMwhichwas

    responsibleormuchothevirtualizationoverheadotheearly

    hypervisorsisnowasmallcontributortovirtualizationoverhead

    TheresultspointtonewareassuchasTLBsandaddresstranslation

    toworkoninordertourtherclosethegapbetweenvirtualand

    nativeperormance

    Methodology or Perormance Analysis oVMware vSphere under Tier-1 Applications

    JereyBuell DanielHecht JinHeo KalyanSaladi HRezaTaheriVMwarePerormance VMwareHypervisorGroup VMwarePerormance VMwarePerormance VMwarePerormance

    jbuell@vmwarecom dhecht@vmwarecom heoj@vmwarecom ksaladi@vmwarecom rtaheri@vmwarecom

    METHODOLOGY FOR PERFORMANCE ANALYSIS OF

    VMWARE VSPHERE UNDER TIER-1 APPLICATIONS

    1 When VMware introduced its server-class, type 1 hypervisor in 2001, it was called ESX.

    The name changed to vSphere with release 4.1 in 200 9.

  • 7/28/2019 VMware Technical Journal - Summer 2013

    22/64

  • 7/28/2019 VMware Technical Journal - Summer 2013

    23/64

    2 1

    WereliedheavilyonOraclestatspack(alsoknownasAWR)stats

    ortheOLTPworkload

    WecollectedandanalyzedUnisphereperformancelogsfromthe

    arraystostudyanypossiblebottlenecksinthestoragesubsystem

    ortheOLTPworkload

    Hadoopmaintainsextensivestatisticsonitsownexecution.

    Theseareusedtoanalyzehowwell-balancedtheclusteris

    andtheexecutiontimesovarioustasksandphases

    2.3 .1 Hardware counters

    AbovetoolsarecommonlyusedbyvSphereperormanceanalysts

    insideandoutsideVMwaretostudytheperormanceovSphere

    applicationsandwemadeextensiveuseothemButoneothe

    keytake-awaysothisstudyisthattoreallyunderstandthesources

    ovirtualizationoverheadthesetoolsarenotenoughWeneeded

    togooneleveldeeperandaugmentthetraditionaltoolswithtools

    thatrevealthehardwareexecutionprole

    Recentprocessorshaveexpandedcategoriesohardwareevents

    thatsotwarecanmonitor[]Butusingthesecountersrequires

    1. A tool to collect data

    2. A methodology or choosing rom among the thousands o

    available events, and combining the event counts to derive

    statistics or events that are not directly monitored

    3. Meaningul interpretation o the results

    Allotheabovetypicallyrequireacloseworkingrelationshipwith

    microprocessorvendorswhichwereliedonheavilytocollectand

    analyzethedatainthisstudy

    Processorarchitecturehasbecomeincreasinglycomplexespecially

    withtheadventomultiplecoresresidinginaNUMAnodetypically

    withasharedlastlevelcacheAnalyzingapplicationexecutionon

    theseprocessorsisademandingtaskandnecessitatesexamining

    howtheapplicationisinteractingwiththehardwareProcessorscomeequippedwithhardwareperormancemonitoringunits(PMU)

    thatenableecientmonitoringohard-ware-levelactivitywithout

    signicantperormanceoverheadsimposedontheapplicationOS

    WhiletheimplementationdetailsoaPMUcanvaryromprocessor

    toprocessortwocommontypesoPMUareCoreandUncore(see

    )EachprocessorcorehasitsownPMUandinthecaseo

    hyper-threadingeachlogicalcoreappearstohaveadedicated

    corePMUallbyitsel

    AtypicalCorePMUconsistsooneormorecountersthatarecapable

    ocountingeventsoccurringindierenthardwarecomponents

    suchasCPUCacheorMemoryTheperormancecounterscan

    becontrolledindividuallyorasagroupthecontrolsoeredareusuallyenabledisableanddetectoverow(viagenerationoan

    interrupt)tonameaewandhavedierentmodes(userkernel)

    EachperormancecountercanmonitoroneeventatatimeSome

    processorsrestrictorlimitthetypeoeventsthatcanbemonitored

    romagivencounterThecountersandtheircontrolregisterscan

    beaccessedthroughspecialregisters(MSRs)andorthroughthe

    PCIbusWewilllookintodetailsoPMUssupportedbythetwo

    mainxprocessorvendorsAMDandIntel

    2.2 .2 Hadoop Workload Confguration

    ClusteroHPDLGservers

    Two3.6GHz4-coreIntelXeonX5687(Westmere-EP)processors

    withhyper-threadingenabled

    72GB1333MHzmemory,68GBusedfortheVM

    16local15KRPMlocalSASdrives,12usedforHadoopdata

    Broadcom10GbENICsconnectedtogetherinaattopology

    throughanAristaSswitchSoftware:vSphere5.1,RHEL6.1,CDH4.1.1withMRv1

    vSpherenetworkdriverbnx2x,upgradedforbestperformance

    2.2 .3 Latency-sensitive workload confguration

    ThetestbedconsistsooneservermachineorservingRR

    (request-response)workloadrequestsandoneclientmachine

    thatgeneratesRRrequests

    Theservermachineisconguredwithdual-socket,quad-core

    GHzIntelXeonXprocessorsandGBoRAMwhile

    theclientmachineisconguredwithdual-socketquad-core

    GHzIntelXeonXprocessorsandGBoRAM

    Bothmachinesareequippedwitha10GbEBroadcomNIC.Hyper-threadingisnotused

    Twodierentcongurationsareused.Anativeconguration

    runsnativeRHELLinuxonbothclientandservermachines

    whileaVMcongurationrunsvSphereontheservermachine

    andnativeRHELontheclientmachineThevSpherehostsa

    VMthatrunsthesameversionoRHELLinuxForboththe

    nativeandVMcongurationsonlyoneCPU(VCPUortheVM

    conguration)isconguredtobeusedtoremovetheimpacto

    parallelprocessinganddiscardanymulti-CPUrelatedoverhead

    TheclientmachineisusedtogenerateRRworkloadrequests,

    andtheservermachineisusedtoservetherequestsandsend

    repliesbacktotheclientinbothcongurationsVMXNETis

    usedorvirtualNICsintheVMcongurationAGigabit

    Ethernetswitchisusedtointerconnectthetwomachines

    23Tools

    Obviouslytoolsarecriticalinastudylikethis

    WecollecteddatafromtheusualLinuxperformancetools,

    suchasmpstatiostatsarnumastatandvmstat

    WiththelaterRHEL6.1experiments,weusedtheLinux

    per()[]acilitytoprolethenativeguestapplication

    Naturally,inastudyofvirtualizationoverhead,vSpheretools

    arethemostwidelyused

    Esxtopiscomm


Recommended