Copyright © 2007 VMware, Inc. All rights reserved.
MIT IAP Course
Lecture #1: Virtualization 101
Carl Waldspurger (SB SM ’89 PhD ’95)
VMware R&D
January 16, 2007
2Copyright © 2007 VMware, Inc. All rights reserved.
What is Virtualization?
Virtual systems
• Abstract physical components using logical objects
• Dynamically bind logical objects to physical configurations
Examples
• Network – Virtual LAN (VLAN), Virtual Private Network (VPN)
• Storage – Storage Area Network (SAN), LUN
• Computer – Virtual Machine (VM), simulator
vir•tu•al (adj): existing in essence or effect,though not in actual fact
3Copyright © 2007 VMware, Inc. All rights reserved.
Overview
Virtual Machines
Virtualization Approaches
Processor Virtualization
Additional Topics
4Copyright © 2007 VMware, Inc. All rights reserved.
Starting Point: A Physical Machine
Physical Hardware
• Processors, memory, chipset,I/O bus and devices, etc.
• Physical resources often underutilized
Software
• Tightly coupled to hardware
• Single active OS image
• OS controls hardware
5Copyright © 2007 VMware, Inc. All rights reserved.
What is a Virtual Machine?
Hardware-Level Abstraction
• Virtual hardware: processors, memory, chipset, I/O devices, etc.
• Encapsulates all OS and application state
Virtualization Software
• Extra level of indirectiondecouples hardware and OS
• Multiplexes physical hardwareacross multiple “guest” VMs
• Strong isolation between VMs
• Manages physical resources, improves utilization
6Copyright © 2007 VMware, Inc. All rights reserved.
VM Isolation
Secure Multiplexing
• Run multiple VMs on single physical host
• Processor hardware isolates VMs, e.g. MMU
Strong Guarantees
• Software bugs, crashes, viruses within one VM cannot affect other VMs
Performance Isolation
• Partition system resources
• Example: VMware controls for reservation, limit, shares
7Copyright © 2007 VMware, Inc. All rights reserved.
VM Encapsulation
Entire VM is a File
• OS, applications, data
• Memory and device state
Snapshots and Clones
• Capture VM state on the fly and restore to point-in-time
• Rapid system provisioning, backup, remote mirroring
Easy Content Distribution
• Pre-configured apps, demos
• Virtual appliances
8Copyright © 2007 VMware, Inc. All rights reserved.
VM Compatibility
Hardware-Independent
• Physical hardware hidden by virtualization layer
• Standard virtual hardware exposed to VM
Create Once, Run Anywhere
• No configuration issues
• Migrate VMs between hosts
Legacy VMs
• Run ancient OS on new platform
• E.g. DOS VM drives virtual IDE and vLance devices, mapped tomodern SAN and GigE hardware
9Copyright © 2007 VMware, Inc. All rights reserved.
Common Virtualization Uses Today
Server Consolidation and Containment – Eliminate server sprawl by deploying systems into virtual machines that can run safely and move transparently across shared hardware
Test and Development – Rapidly provision test and development servers; store libraries of pre-configured test machines
Enterprise Desktop – Secure unmanaged PCs without compromising end-user autonomy by layering a security policy in software around desktop virtual machines
Business Continuity – Reduce cost and complexity by encapsulating entire systems into single files that can be replicated and restored onto any target server
10Copyright © 2007 VMware, Inc. All rights reserved.
Overview
Virtual Machines
Virtualization Approaches
• Virtual machine monitors (VMMs)
• Virtualization platform types
• Alternative system virtualizations
Processor Virtualization
Additional Topics
11Copyright © 2007 VMware, Inc. All rights reserved.
What is a Virtual Machine Monitor?
VMM Characteristics
• Fidelity
• Performance
• Isolation / Safety
An Old Concept
• Classic definition fromPopek & Goldberg ’74
• IBM mainframes since ’60s
12Copyright © 2007 VMware, Inc. All rights reserved.
VMM Technology
So this is just like Java, right?
• No, a Java VM is very different from the physical machine that runs it
• A hardware-level VM reflects underlying processor architecture
Like a simulator or emulator that can run old Nintendo games?
• No, they emulate the behavior of different hardware architectures
• Simulators generally have very high overhead
• A hardware-level VM utilizes the underlying physical processor directly
13Copyright © 2007 VMware, Inc. All rights reserved.
VMMs Past
An Old Idea
• Hardware-level VMs since ’60s
• IBM S/360, IBM VM/370mainframe systems
• Timeshare multiple single-user OS instances on expensive hardware
Classical VMM
• Run VM directly on hardware
• “Trap and emulate” modelfor privileged instructions
• Vendors had vertical control over proprietary hardware, operating systems, VMM
From IBM VM/370 product announcement, ca. 1972
14Copyright © 2007 VMware, Inc. All rights reserved.
VMMs Present
Renewed Interest
• Academic research since ’90s
• VMs for commodity systems
• Server consolidation
VMM for x86
• Industry-standard hardware, from laptops to datacenter
• Run unmodified commodity guest operating systems
• Significant challenges, e.g.“non-virtualizable” instructions
• Pioneered by VMware in ’98
VMware Fusion for Mac OS X running WinXP, 2006
15Copyright © 2007 VMware, Inc. All rights reserved.
VMM Platform Types
Hosted Architecture
• Install as application on existing x86 “host” OS, e.g. Windows, Linux, OS X
• Small context-switching driver
• Leverage host I/O stack and resource management
• Examples: VMware Player/Workstation/Server, Microsoft Virtual PC/Server, Parallels Desktop
Bare-Metal Architecture
• “Hypervisor” installs directly on hardware
• Acknowledged as preferred architecture for high-end servers
• Examples: VMware ESX Server, Xen, Microsoft Viridian (2008)
16Copyright © 2007 VMware, Inc. All rights reserved.
System Virtualization Alternatives
OS Level Hardware Level
Virtual machines abstracted using a layer at different places
Language Level
17Copyright © 2007 VMware, Inc. All rights reserved.
System Virtualization Taxonomy
System Virtualization
• Java• Microsoft .NET / Mono• Smalltalk
High-Level LanguageHardware Level
Bare-Metal/Hypervisor
• HP Integrity VM• IBM zSeries z/VM• VMware ESX Server• Xen
Hosted
• Microsoft Virtual Server• Microsoft Virtual PC• Parallels Desktop• VMware Player• VMware Workstation• VMware Server
Para-virtualization
• Virtual Iron• VMware VMI• Xen
OS Level
• FreeBSD Jail• HP Secure Resource
Partitions• Sun Solaris Zones• SWsoft Virtuozzo• User-Mode Linux
• Bochs• Microsoft VPC for Mac• QEMU• Virtutech Simics
Emulators
18Copyright © 2007 VMware, Inc. All rights reserved.
Overview
Virtual Machines
Virtualization Approaches
Processor Virtualization
• Classical techniques
• Software x86 VMM
• Hardware-assisted x86 VMM
• Para-virtualization
Additional Topics
19Copyright © 2007 VMware, Inc. All rights reserved.
Classical Instruction Virtualization
Trap and Emulate
• Run guest operating system deprivileged
• All privileged instructions trap into VMM
• VMM emulates instructions against virtual statee.g. disable virtual interrupts, not physical interrupts
• Resume direct execution from next guest instruction
Implementation Technique
• This is just one technique
• Popek and Goldberg criteria permit others
20Copyright © 2007 VMware, Inc. All rights reserved.
Classical Memory Virtualization
Traditional VMM Approach
Extra Level of Indirection
• Virtual →→→→ “Physical”Guest maps VPN to PPNusing primary page tables
• “Physical” →→→→ MachineVMM maps PPN to MPN
Shadow Page Table
• Composite of two mappings
• For ordinary memory referencesHardware maps VPN to MPN
• Cached by physical TLB
VPN
PPN
MPN
hardwareTLB
shadowpage table
guest
VMM
21Copyright © 2007 VMware, Inc. All rights reserved.
Memory Traces
Shadow Page Table
• Derived from primary page table in guest
• VMM must keep primary and shadow coherent
Trace = Coherency Mechanism
• Write-protect primary page table
• Trap guest writes to primary
• Update or invalidate corresponding shadow
• Transparent to guest
22Copyright © 2007 VMware, Inc. All rights reserved.
Classical VMM Performance
Native Speed Except for Traps
• No overhead in direct execution
• Overhead = trap frequency × average trap cost
Trap Sources
• Most frequent: Guest page table traces
• Privileged instructions
• Memory-mapped device traces
23Copyright © 2007 VMware, Inc. All rights reserved.
x86 Virtualization Challenges
Not Classically Virtualizable
• x86 ISA includes instructions that read or modify privileged state
• But which don’t trap in unprivileged mode
Example: POPF instruction
• Pop top-of-stack into EFLAGS register
• EFLAGS.IF bit privileged (interrupt enable flag)
• POPF silently ignores attempts to alter EFLAGS.IF in unprivileged mode!
• So no trap to return control to VMM
Deprivileging not possible with x86!
24Copyright © 2007 VMware, Inc. All rights reserved.
How to Virtualize x86?
Interpretation
• Problem – too inefficient
• x86 decoding slow
Code Patching
• Problem – not transparent
• Guest can inspect its own code
Binary Translation (BT)
• Approach pioneered by VMware
• Run any unmodified x86 OS in VM
Extend x86 Architecture
25Copyright © 2007 VMware, Inc. All rights reserved.
Software VMM: Binary Translation
Direct execute unprivileged guest application code
• Will run at full speed until it traps, we get an interrupt, etc.
“Binary translate” all guest kernel code, run it unprivileged
• Since x86 has non-virtualizable instructions,proactively transfer control to the VMM (no need for traps)
• Safe instructions are emitted without change
• For “unsafe” instructions, emit a controlled emulation sequence
• VMM translation cache for good performance
26Copyright © 2007 VMware, Inc. All rights reserved.
VMware Translator Properties
Binary – input is x86 “hex”, not source
Dynamic – interleave translation and execution
On Demand – translate only what about to execute (lazy)
System Level – makes no assumptions about guest code
Subsetting – full x86 to safe subset
Adaptive – adjust translations based on guest behavior
27Copyright © 2007 VMware, Inc. All rights reserved.
BT Mechanics
Each Translator Invocation
• Consume a basic block (BB)
• Produce a compiled code fragment (CCF)
Store CCF in Translation Cache
• Future reuse
• Capture working set of guest kernel
• Amortize translation costs
• Not “patching in place”
translator
Input: BB
Output: CCF
55 ff 33 c7 03 ...
55 ff 33 c7 03 ...
28Copyright © 2007 VMware, Inc. All rights reserved.
Example: IDENT Translation
80304a69 push %ebp
80403a6a push (%ebx)
80403a6c mov (%ebx), ffffffff
80403a72 mov %edx, %esp
80403a74 mov %esp, 81c(%ebx)
80403a7a push %edx
80403a7b mov %ebp, %eax
80403a7d call 80460ba4
25555b0 push %ebp
25555b1 push (%ebx)
25555b3 mov (%ebx), ffffffff
25555b9 mov %edx, %esp
25555bb mov %esp, 81c(%ebx)
25555c1 push %edx
25555c2 mov %ebp, %eax
25555c4 push 80403a82
25555c9 int 3a
25555cb data: 80460ba4BB
CCF25555c4: push return address25555c9: invoke translator on callee
29Copyright © 2007 VMware, Inc. All rights reserved.
Adaptive BT
Translated Code Is Fast
• Mostly IDENT translations
• Runs “at speed”
Except Writes to Traced Memory
• Page fault (shown as !*!)
• Decode and interpret instruction
• Fire trace callbacks
• Resume execution
• Can take 1000’s of cycles
!*!
Invoke Translator
TranslationCache
30Copyright © 2007 VMware, Inc. All rights reserved.
Adaptive BT: Fast Trace Handling
Detect and Track Trace Faults
Splice in TRACE Translation
• Execute memory access in software
• Avoid page fault
• No re-decoding
• Faster resumption
Faster Traces
• 10x performance improvement
• Adapts to runtime behavior
JMP
Invoke Translator
TRACE
31Copyright © 2007 VMware, Inc. All rights reserved.
Software VMM Evaluation
Benefits
• Adaptation
• Fast traces
• Fast I/O emulation
• Flexibility
Costs
• Running translator
• Path lengthening
• System call slowdown
• Complexity
32Copyright © 2007 VMware, Inc. All rights reserved.
Hardware-Assisted VMM
Recent x86 Extension
• 1998 – 2005: Software-only VMMs using binary translation
• 2005: Intel and AMD start extending x86 to support virtualization
First-Generation Hardware
• Enables classical trap-and-emulate VMMs
• Intel VT, aka “Vanderpool Technology”
• AMD SVM, aka “Pacifica”
Performance
• VT/SVM help avoid BT, but not MMU ops (actually slower!)
• Main problem is efficient virtualization of MMU and I/O,Not executing the virtual instruction stream
33Copyright © 2007 VMware, Inc. All rights reserved.
VT/SVM Architecture
Diagram
• Y-axis: old school x86 privilege (CPL)
• X-axis: virtualization privilege
Guest Mode
• Runs unmodified OS
• Sensitive operations “exit”(trap out) to host mode
VMCB
• Virtual Machine Control Block
• VMM-controlled, hardware-walked
• Buffers simple exits
CPL 3CPL 3
CPL 2
CPL 1
CPL 0
CPL 2
CPL 1
CPL 0
Host Guest
34Copyright © 2007 VMware, Inc. All rights reserved.
Hardware-Assisted VMM
Hardware-Assisted Direct ExecCPL 0-3
VMMCPL 0-3
Host mode
Guest mode
Fault,Trace, Interrupt, I/O ...
Resume Guest
35Copyright © 2007 VMware, Inc. All rights reserved.
Hardware-Assisted VMM Evaluation
Benefits
• Simplicity (no BT)
• Fast system calls
• No translator overheads
Costs
• Exits: 1000’s of cycles for traces and I/O
• No adaptation or software flexibility
• Stateless model
Future
• Hardware support for fast MMU virtualization
• Intel EPT, AMD NPT
36Copyright © 2007 VMware, Inc. All rights reserved.
What is Paravirtualization?
Full Virtualization
• No modifications to guest OS
• Excellent compatibility, good performance, but complex
Paravirtualization Exports Simpler Architecture
• Term coined by Denali project in ’01, popularized by Xen
• Modify guest OS to be aware of virtualization layer
• Remove non-virtualizable parts of architecture
• Avoid rediscovery of knowledge in hypervisor
• Excellent performance and simple, but poor compatibility
Ongoing Linux Standards Work
• “Paravirt Ops” interface between guest and hypervisor
• Small team from VMware, Xen, IBM LTC, etc.
37Copyright © 2007 VMware, Inc. All rights reserved.
Paravirtualization: Conceptual Diagram
Hardware
Hypervisor
Guest OS
Hardware
Hypervisor
Guest OS
Full Virtualization Paravirtualization
Hypercalls(GOOD)
System callinterface
NOT GOOD!
38Copyright © 2007 VMware, Inc. All rights reserved.
VMware Vision: Transparent Paravirtualization
Same OS binary
Xen 3.0.x VMware ESX
NativeNative Native
Dom0VMI
LinuxDomU
XenoLinux
VMILinux
VMILinux
WindowsSolaris
39Copyright © 2007 VMware, Inc. All rights reserved.
Further Reading
VMware Publications
• www.vmware.com/academic/resources.html
• A Comparison of Software and Hardware Techniques for x86 Virtualization (ASPLOS ’06)
• Fast Transparent Migration for Virtual Machines (USENIX ’05)
• Memory Resource Management in VMware ESX Server (OSDI ’02)
• Virtualizing I/O Devices on VMware Workstation’s Hosted VMM (USENIX ’01)
Additional Academic Publications
• Xen and the Art of Virtualization (SOSP ’03)
• Disco: Running Commodity Operating Systems on Scalable Multiprocessors (SOSP ’97)
• Many more …
40Copyright © 2007 VMware, Inc. All rights reserved.
Additional Topics
I/O Virtualization
Memory Management
41Copyright © 2007 VMware, Inc. All rights reserved.
I/O Virtualization Stack
Guest Device Driver
Virtual Device
• Model existing device, e.g. e1000
• Model an idealized device, e.g. vmxnet
Virtualization Layer
• Emulates the virtual device
• Remaps guest and real I/O addresses
• Multiplexes and drives physical device
• Provides additional features, e.g. transparent NIC teaming
Real Device
• Physical hardware, e.g. bcm5700
• Likely to be different than virtual device
Guest OS
Device Driver
Device Driver
I/O Stack
DeviceEmulation
42Copyright © 2007 VMware, Inc. All rights reserved.
I/O Virtualization Implementations
Device Driver
I/O Stack
Guest OS
Device Driver
DeviceEmulation
Device Driver
I/O Stack
Guest OS
Device Driver
DeviceEmulation
DeviceEmulation
Host OS/Dom0/Parent Domain
Guest OS
Device Driver
DeviceManager
Hosted or Split Hypervisor Direct
Passthrough I/O
VMware Workstation, VMware Server,VMware ESX Server (for slow devices),Xen, Microsoft Viridian, Virtual Server
VMware ESX Server (storage and network)
A Future OptionMany Challenges
Emulated I/O
43Copyright © 2007 VMware, Inc. All rights reserved.
Passthrough I/O Virtualization
High Performance
• Guest drives device directly
• Minimizes CPU utilization
Enabled by HW Assists
• I/O-MMU for DMA isolatione.g. Intel VT-d, AMD IOMMU
• Partitionable I/O devicee.g. PCI-SIG IOV spec
Challenges
• Hardware independence
• Migration, suspend/resume
• Memory overcommitment
I/O MMU
DeviceManager
VF VF VF
PF
PF = Physical Function, VF = Virtual Function
I/O Device
Guest OS
Device Driver
Guest OS
Device Driver
Guest OS
Device Driver
VirtualizationLayer
44Copyright © 2007 VMware, Inc. All rights reserved.
Additional Topics
I/O Virtualization
Memory Management
45Copyright © 2007 VMware, Inc. All rights reserved.
Memory Management
Desirable capabilities
• Efficient memory overcommitment
• Accurate resource controls
• Exploit sharing opportunities
Challenges
• Allocations should reflect both importance and working set
• Best data to guide decisions known only to guest OS
• Guest and meta-level policies may clash
46Copyright © 2007 VMware, Inc. All rights reserved.
VMware Memory Management
Reclamation mechanisms
• Ballooning – guest driver allocates pinned PPNs, hypervisor deallocates backing MPNs
• Swapping – hypervisor transparently pages out PPNs,paged in on demand
• Page sharing – hypervisor identifies identical PPNsbased on content, maps to same MPN copy-on-write
Allocation policies
• Proportional sharing – revoke memory from VMwith minimum shares-per-page ratio
• Idle memory tax – charge VM more for idle pagesthan for active pages to prevent unproductive hoarding
47Copyright © 2007 VMware, Inc. All rights reserved.
Ballooning
Guest OS
balloon
Guest OS
balloon
Guest OS
inflate balloon
(+ pressure)
deflate balloon
(– pressure)
may page outto virtual disk
may page in
from virtual disk
guest OS manages memory
implicit cooperation
48Copyright © 2007 VMware, Inc. All rights reserved.
Page Sharing
Motivation
• Multiple VMs running same OS, apps
• Collapse redundant copies of code, data, zeros
Transparent page sharing
• Map multiple PPNs to single MPN copy-on-write
• Pioneered by Disco [Bugnion ’97], but required guest OS hooks
Content-based sharing
• General-purpose, no guest OS changes
• Background activity saves memory over time
49Copyright © 2007 VMware, Inc. All rights reserved.
Page Sharing: Scan Candidate PPN
VM 1 VM 2 VM 3
011010110101010111101100
MachineMemory …06af
343f8123b
Hash:VM:PPN:MPN:
hint frame
hashtable
hash page contents…2bd806af