Post on 22-Apr-2020
transcript
NetworksandOpera/ngSystemsChapter21:VirtualMachineMonitors
(252‐0062‐00)
DonaldKossmann&TorstenHoeflerFrühjahrssemester2013
©SystemsGroup|DepartmentofComputerScience|ETHZürich
Last/me:I/O
• Networkstackimplementa/on• NetworkdevicesandnetworkI/O• MemorymanagementintheI/Osubsystem
• Performanceissues– Buffering– Mul/plequeuesandreceive‐sidescaling
This/me:VirtualMachineMonitors
• Basicdefini/ons• Whywouldyouwantone?• Structure• Howdoesitwork?– CPU– MMU– Memory– Devices– Network
• Acknowledgement:ThankstoSteveHandforsomeoftheslides!
WhatisaVirtualMachineMonitor?
• Virtualizesanen/re(hardware)machine– ContrastwithOSprocesses– Interfaceprovidedis“illusionofrealhardware”– Applica/onsarethereforecompleteOpera/ngSystemsthemselves
– Terminology:GuestOpera+ngSystems
• Oldidea:IBMVM/CMS(1960s)– Recentlyrevived:VMware,Xen,Hyper‐V,kvm,etc.
VMMsandHypervisors
Realhardware
Hypervisor
Guestopera/ngsystem
App
App
Guestopera/ngsystem
App
App
VMM VMM
Somefolksdis/nguishtheVirtualMachineMonitorfromthe
Hypervisor(wewon’t)
Createsillusionofhardware
Whywouldyouwantone?
• Diagrams:• Serverconsolida/on(programassumesownmachine)
• Performanceisola/on
• Backwardcompa/bility
• Cloudcompu/ng(unitofsellingcycles)
• SomethingundertheOS:replay,audi/ng,trustedcompu/ng,rootkits
Runningmul/pleOSesononemachine
• Applica/oncompa/bility– IuseUbuntufor
almosteverything,butIeditslidesinPowerPoint
– SomepeoplecompileBarrelfishinaDebianVMoverWindows7withHyper‐V
• Backwardcompa/bility– Nothingbeatsa
Windows98virtualmachineforplayingoldcomputergames
Realhardware
Hypervisor
App
App
App
App
App
App
Serverconsolida/on
• Manyapplica/onsassumetheyhavethemachinetothemselves
• Eachmachineismostlyidle
⇒ ConsolidateserversontoasinglephysicalmachineRealhardware
Hypervisor
App
lica/
on
App
lica/
on
App
lica/
on
Resourceisola/on
• Surprisingly,modernOSesdonothaveanabstrac/onforasingleapplica/on
• Performanceisola/oncanbecri/calinsomeenterprises
• UsevirtualmachinesasresourcecontainersRealhardware
Hypervisor
App
lica/
on
App
lica/
on
App
lica/
on
Cloudcompu/ng
• Sellingcompu/ngcapacityondemand– E.g.AmazonEC2,
GoGrid,etc.• Hypervisors
decouplealloca+onofresources(VMs)fromprovisioningofinfrastructure(physicalmachines)
Realhardware
Hypervisor
App
lica/
on
App
lica/
on
Realhardware
Hypervisor
App
lica/
on
App
lica/
on
Realhardware
Hypervisor
App
lica/
on
App
lica/
on
Realhardware
Hypervisor
App
lica/
on
App
lica/
on
Realhardware
Hypervisor
App
lica/
on
App
lica/
on
Realhardware
Hypervisor
App
lica/
on
App
lica/
on
Opera/ngSystemdevelopment
• Buildingandtes/nganewOSwithoutneedingtorebootrealhardware
• VMMomengivesyoumoreinforma/onaboutfaultsthanrealhardwareanywayRealhardware
Hypervisor
Compiler
Edito
r
Visual
Stud
io
Othercoolapplica/ons…
• Tracing• Debugging• Execu/onreplay
• Lock‐stepexecu/on
• Livemigra/on• Rollback• Specula/on• Etc….Realhardware
Hypervisor
Tracer
App
lica/
on
App
lica/
on
Howdoesitallwork?
• Note:ahypervisorisbasicallyanOS– Withan“unusualAPI”
• Manyfunc/onsquitesimilar:– Mul/plexingresources– Scheduling,virtualmemory,devicedrivers
• Different:– Crea/ngtheillusionofhardwareto“applica/ons”– GuestOSesarelessflexibleinresourcerequirements
HostedVMMs
Realhardware
Hostopera/ngsystem
App
lica/
on
Guestopera/ngsystem
App
App
VMM
App
lica/
on Examples:
• VMwareworksta/on• LinuxKVM• MicrosomHyper‐V
Hypervisor‐basedVMMs
Realhardware
Hypervisor
Console(Mgmt)opera/ngsystem
Console
Mgm
t.
Guestopera/ngsystem
App
App
VMM VMM
Guestopera/ngsystem
App
App
VMM
Examples:• VMwareESX• IBMVM/CMS• Xen
Howtovirtualize…
• TheCPU(s)?• TheMMU?
• Physicalmemory?
• Devices(disks,etc.)?• TheNetwork
and?
VirtualizingtheCPU
• ACPUarchitectureisstrictlyvirtualizableifitcanbeperfectlyemulatedoveritself,withallnon‐privilegedinstruc/onsexecutedna/vely
• Privilegedinstruc/ons⇒trap– Kernel‐mode(i.e.theVMM)emulatesinstruc/on– Guest’skernelmodeisactuallyusermode
• Oranother,extraprivilegelevel(suchasring1)
• Examples:IBMS/390,Alpha,PowerPC
VirtualizingtheCPU
• Astrictlyvirtualizableprocessorcanexecuteacompletena/veGuestOS– Guestapplica/onsruninusermodeasbefore– Guestkernelworksexactlyasbefore
• Problem:x86architectureisnotvirtualizable– About20instruc/onsaresensi/vebutnotprivileged– Mostlysegmentloadsandprocessorflagmanipula/on
Non‐virtualizablex86:example
• PUSHF/POPFinstruc/ons– Push/popcondi/oncoderegister– Includesinterruptenableflag(IF)
• Unprivilegedinstruc/ons:fineinuserspace!– IFisignoredbyPOPFinusermode,notinkernelmode
⇒VMMcan’tdetermineifGuestOSwantsinterrruptsdisabled!– Can’tcauseatrapona(privileged)POPF – Preventscorrectfunc/oningoftheGuestOS
Solu/ons1. Emula/on:emulateallkernel‐modecodeinsomware
– Veryslow–par/cularlyforI/Ointensiveworkloads– Usedby,e.g.,SomPC
2. Paravirtualiza8on:modifyGuestOSkernel– Replacewithexplicittrapinstruc/ontoVMM– Alsocalleda“HyperCall”(usedforallkindsofthings)– Usedby,e.g.,Xen
3. Binaryrewri/ng:– Protectkernelinstruc/onpages,traptoVMMonfirstIFetch– ScanpageforPOPFinstruc/onsandreplace– Restartinstruc/oninGuestOSandcon/nue– Usedby,e.g.VMware
4. Hardwaresupport:IntelVT‐x,AMD‐V– ExtraprocessormodecausesPOPFtotrap
VirtualizingtheMMU
• HypervisorallocatesmemorytoVMs– Guestassumescontroloverallphysicalmemory
– VMMcan’tletGuestOStoinstallmappings
• Defini/onsneeded:– Virtualaddress:avirtualaddressintheguest– Physicaladdress:asseenbytheguest– Machineaddress:realphysicaladdress• AsseenbytheHypervisor
Virtual/Physical/Machine
GuestVirtualAS
GuestPhysicalAS
MachineMemory
5
5
9
2
6
17Guest1:
Guest2:
MMUVirtualiza/on
• Cri/calforperformance,challengingtomakefast,especiallySMP– Hot‐unplugunnecessaryvirtualCPUs– Usemul/castTLBflushparavirtualiza/onsetc
• Xensupports3MMUvirtualiza/onmodes1. Direct(“Writable”)pagetables2. Shadowpagetables3. HardwareAssistedPaging
• OSParavirtualiza/oncompulsoryfor#1,op/onal(andverybeneficial)for#2&3
Paravirtualiza/onapproach
• GuestOScreatespagetablesthehardwareuses– VMMmustvalidateallupdatestopagetables– Requiresmodifica/onstoGuestOS– Notquiteenough…
• VMMmustcheckallwritestoPTEs– Write‐protectallPTEstotheGuestkernel– AddaHyperCalltoupdatePTEs– Batchupdatestoavoidtrapoverhead– OSisnowawareofmachineaddresses– Significantoverhead!
Para‐VirtualizingtheMMU
• GuestOSesallocateandmanageownPTs– HypercalltochangePTbase
• VMMmustvalidatePTupdatesbeforeuse– Allowsincrementalupdates,avoidsrevalida/on
• Valida/onrulesappliedtoeachPTE:– 1.Guestmayonlymappagesitowns*
– 2.PagetablepagesmayonlybemappedRO
• VMMtrapsPTEupdatesandemulates,or‘unhooks’PTEpageforbulkupdates
WriteablePageTables:1–Writefault
MMU
GuestOS
VMM
Hardware
pagefault
firstguestwrite
guestreads
Virtual→Machine
WriteablePageTables:2–Emulate?
GuestOS
VMM
Hardware
firstguestwrite
guestreads
Virtual→Machine
emulate?
yes
MMU
WriteablePageTables:3‐Unhook
GuestOS
VMM
Hardware
guestwrites
guestreads
Virtual→MachineX
MMU
WriteablePageTables:4‐FirstUse
GuestOS
VMM
Hardware
pagefault
guestwrites
guestreads
Virtual→MachineX
MMU
WriteablePageTables:5–Re‐hook
GuestOS
VMM
Hardware
validate
guestwrites
guestreads
Virtual→Machine
MMU
Writeablepagetablesrequireparavirtualiza/on
GuestVirtualAS
MachineMemory
5
5
9
2
6
17Guest1:
Guest2:
GuestsdirectlyshareMachineMemory
ShadowPageTables
• GuestOSsetsupitsownpagetables– Notusedbythehardware!
• VMMmaintainsshadowpagetables– MapdirectlyfromGuestVAstoMachineAddresses– HardwareswitchedwheneverGuestreloadsPTPR
• VMMmustkeepV→MtableconsistentwithGuestV→Ptableandit’sownP→Mtable– VMMwrite‐protectsallguestpagetables– Write⇒trap:applywritetoshadowtableaswell– Significantoverhead!
ShadowPageTables
GuestVirtualAS
GuestPhysicalAS
MachineMemory
5
5
9
2
6
17Guest1:
Guest2:
Shadowpagetablemappings
Shadowpagetables
MMU
GuestOS
VMM
Hardware
accessedanddirtybits
guestwrites
guestreads
Virtual→Guest‐Physical
Virtual→Machine
updates
• Guestchangesop/onal,buthelpwithbatching,knowingwhentounshadow
• Latestalgorithmsworkremarkablywell
Hardwaresupport
• “Nestedpagetables”– Rela/velynewinAMD(NPT)andIntel(EPT)hardware
• Two‐leveltransla/onofaddressesintheMMU– Hardwareknowsabout:
• V→Ptables(intheGuest)• P→Mtables(intheHypervisor)
– TaggedTLBstoavoidexpensiveflushonaVMentry/exit
• Veryniceandeasytocodeto– Onereasonkvmissosmall
• Significantperformanceoverhead…
Memoryalloca/on
• GuestOSisnotexpec/ngphysicalmemorytochangeinsize!
• Twoproblems:– HypervisorwantstoovercommitRAM– Howtoreallocate(machine)memorybetweenVMs
• Phenomenon:DoublePaging– Hypervisorpagesoutmemory– GuestOSdecidestopageoutphysicalframe– (Unwivngly)faultsitinviatheHypervisor,onlytowriteitoutagain
Ballooning
• TechniquetoreclaimmemoryfromaGuest• Installa“balloondriver”inGuestkernel– Canallocateandfreekernelphysicalmemory• Justlikeanyotherpartofthekernel
– UsesHyperCallstoreturnframestotheHypervisor,andhavethemreturned• GuestOSisunware,simplyallocatesphysicalmemory
Ballooning:takingRAMawayfromaVM
1. VMMasksballoondriverformemory
2. BalloondriverasksGuestOSkernelformoreframes– “inflatestheballoon”
3. BalloondriversendsphysicalframenumberstoVMM
4. VMMtranslatesintomachineaddressandclaimstheframes
Balloon
Guestphysicaladdressspace
Balloondriver
Ballooning:takingRAMawayfromaVM
1. VMMasksballoondriverformemory
2. BalloondriverasksGuestOSkernelformoreframes– “inflatestheballoon”
3. BalloondriversendsphysicalframenumberstoVMM
4. VMMtranslatesintomachineaddressesandclaimstheframes
Balloon
Guestphysicaladdressspace
Physicalmemoryclaimedby
balloondriver
Balloondriver
ReturningRAMtoaVM
1. VMMconvertsmachineaddressintoaphysicaladdresspreviouslyallocatedbytheballoondriver
2. VMMhandsPFNtoballoondriver
3. BalloondriverfreesphysicalframebacktoGuestOSkernel– “deflatestheballoon”
Balloon
Guestphysicaladdressspace
Balloondriver
VirtualizingDevices
• Familiarbynow:trap‐and‐emulate– I/Ospacetraps– Protectmemoryandtrap– “Devicemodel”:somwaremodelofdeviceinVMM
• Interrupts→upcallstoGuestOS– Emulateinterruptcontroller(APIC)inGuest– EmulateDMAwithcopyintoGuestPAS
• Significantperformanceoverhead!
Paravirtualizeddevices
• “Fake”devicedriverswhichcommunicateefficientlywithVMMviahypercalls– Usedforblockdeviceslikediskcontrollers– Networkinterfaces– “VMwaretools”ismostlyaboutthese
• Drama/callybeyerperformance!
Networking
• VirtualnetworkdeviceintheGuestVM• Hypervisorimplementsa“somswitch”– En/revirtualIP/Ethernetnetworkonamachine
• Manydifferentaddressingop/ons– SeparateIPaddresses– SeparateMACaddresses
– NAT• Etc.
Wherearetherealdrivers?
1. IntheHypervisor– E.g.VMwareESX– Problem:needtorewritedevicedrivers(newOS)
2. IntheconsoleOS– ExportvirtualdevicestootherVMs
3. In“driverdomains”– Maphardwaredirectlyintoa“trusted”VM
• DevicePassthrough– RunyourfavoriteOSjustforthedevicedriver– UseIOMMUhardwaretoprotectothermemoryfromdriverVM
4. Use“self‐virtualizingdevices”
Xen3.xArchitecture
XenVirtualMachineMonitorEventChannel VirtualMMUVirtualCPUControlIF
Hardware(SMP,MMU,physicalmemory,Ethernet,SCSI/IDE)
GuestOS(XenLinux)
DeviceManager&Controls/w
Na/veDeviceDrivers
VM0
GuestOS(XenLinux)
UnmodifiedUser
Somware
VM1
SMPGuestOS(XenLinux)
UnmodifiedUser
Somware
Front‐EndDeviceDrivers
VM2
UnmodifiedGuestOS(WinXP)
UnmodifiedUser
Somware
Front‐EndDeviceDrivers
VM3
SafeHWIF
Virtualswitch
Front‐EndDeviceDrivers
ThankstoSteveHandforsomeofthesediagrams
Rememberthiscard?
SR‐IOV
• Single‐RootI/OVirtualiza/on• Keyidea:dynamicallycreatenew“PCIedevices”– PhysicalFunc/on(PF):originaldevice,fullfunc/onality
– VirtualFunc/on(VF):extra“device”,limitedfun/onality
– VFscreated/destroyedviaPFregisters• Fornetworking:– Par//onsanetworkcard’sresources– Withdirectassignmentcanimplementpassthrough
SR‐IOVinac/on
SR‐IOVNICVirtualethernetbridge/switch,packetclassifier
LAN
Virtualfunc/on
Virtualfunc/on
Virtualfunc/on Physicalfunc/on
PCIe
IOMMU
VMM
VM
VFdriver
VM
VFdriver
VM
VFdriver
VM
VNICdrvr
VM
PFdriver
VSwitch
Self‐virtualizingdevices
• Candynamicallycreateupto2048dis/nctPCIdevicesondemand!– HypervisorcancreateavirtualNICforeachVM– Somswitchdriverprograms“master”NICtodemuxpacketstoeachvirtualNIC
– PCIbusisvirtualizedineachVM– EachGuestOSappearstohave“real”NIC,talksdirecttotherealhardware
NextWeek
Reliablestorage
OSResearch/Future™