Date post: | 30-Nov-2014 |
Category: |
Technology |
Upload: | cameroon45 |
View: | 976 times |
Download: | 0 times |
Sogang University Distributed Computing & Communication Lab.
Virtualization & Techniques
Kwon-yong LeeKwon-yong Lee
Distributed Computing & Communication Lab.Distributed Computing & Communication Lab.(URL: http://dcclab.sogang.ac.kr)(URL: http://dcclab.sogang.ac.kr)
Dept. of Computer Science Dept. of Computer Science Sogang UniversitySogang University
Seoul, KoreaSeoul, Korea
Tel : +82-2-3273-8783Tel : +82-2-3273-8783 Email : [email protected] : [email protected]
(2)
2
Outline
Virtualization Taxonomy
History of techniques Popek & Goldberg Classic Virtualization
• Trap-and-Emulate x86 architecture Software Virtualization
• Binary Translation Hardware Virtualization
• Intel VT & AMD SVM
Comparision
Sogang University Distributed Computing & Communication Lab.
Virtualization & Techniques
Virtualization
Definition of Virtualization
The abstraction of computer resources A technique for hiding the physical characteristics of
computing resources from the way in which other systems, applications, or end users interact with those resources
4
Purposes Abstraction - To simplify
the use of the underlying resource
Replication - To create multiple instances of the resource
Isolation - To separate the uses which clients make of the underlying resources
Abstraction
Computer systems are built on levels of abstraction.
5
Higher level of abstraction hide details at lower levels.
Ex) Files are an abstraction of a disk
Machine (1/3)
Different perspectives on what the machine is: OS developer
6
ISA Instruction Set
Architecture Major division between
hardware and software
ISA
Machine (2/3)
Different perspectives on what the machine is: Compiler developer
7
ABI Application Binary
Interface User ISA + OS calls ABI
Different perspectives on what the machine is: Application programmer
Machine (3/3)
8
API Application Program
Interface User ISA + library calls API
Definition for Virtual Machine
By adding Virtualizing Software to a Machine, we can create a Virtual Machine (VM)
Processes or operating systems can run on a VM
Guest : The process or system that runs on a VMHost : The underlying platform that supports the VM
9
Virtual Machine Taxonomy
10
Virtualizing Software
At the ABI or API level,atop the OS/HW combination
Between the host hardware machine and the guest
softwareThe virtualizing software emulates both user-level instructions and either OS or library calls.
The VMM emulates the hardware ISA so that the guest software can potentially execute a different ISA from the one implemented on the host.
However, in many system VM applications, the VMM does not perform instruction emulation; rather, its primary role is to provide virtualized hardware resources.
Process VM System VM
Emulator Classic system VM
High-Level Language VM
Hosted VM
Whole-system VM
Kernel Virtualization
OS-level Virtualization
Process Virtual Machines
Process VMs Provide a virtual ABI or API
environment for user application
Offer replication, emulation, and optimization
Virtualizing software – Runtime software
11
Types of Process VMs (1/2)
Emulators Support one instruction set on
hardware designed for another Interpreter
12
• Fetches, decodes, and emulates the execution of individual source instructions.
• Can be slow
Dynamic Binary Translator (different ISA)• Blocks of source instructions converted to target instructions• Translated blocks cached to exploit locality.• Ex) IA-32 EL, Digital FX!32
Dynamic Binary Optimizers (same ISA)• Optimize code on the fly• Same as emulators except source and target ISAs are the same.• Ex) HP Dynamo
Types of Process VMs (2/2)
High-Level Language VMs Ex) Pascal, JVM, CLR
13
System Virtual Machines
System VMs Provides a complete environment in
which an operating system and many processes, possibly belonging to multiple users, can coexist.
14
By using system VMs, a single-host hardware platform can support multiple, isolated guest operating system environments simultaneously.
Virtualizing software – Virtual Machine Monitor
(VMM)
Virtual Machine Monitor
A virtualization system that partitions a single physical machine into multiple VMsIn a system VM, the Virtual Machine Monitor (VMM) primarily provides platform replication. The VMM has access to, and manages, all the h/w resources. A guest operating system and its application processes are
then managed under (hidden) control of the VMM.
15
Emulation &Optimization
Replication Composition
Emulation : Mix-and-match cross-platform portability
Optimization : Usually done with emulation for platform-specific performance improvement
Replication : Multiple VMs on single platform
Composition : Form more complex flexible systems
Types of System VMs (1/5)
Classic system VMs (same ISA) VMM runs directly on bare-machine
• Try to execute natively on the host ISA• VMM provides all device drivers
Primary goal : high performance Example
• Full Virtualization – Does not modify guest OS (ex. VMware ESX server)
• Para-virtualization – Modifies guest OS (ex. Xen)
16
Types of System VMs (2/5)
Hosted VMs (same ISA) VMM runs on Host OS
• Operates in process space• Relies on host OS to provide drivers
Primary goal : Ease of construction/installation/acceptability Example
• Full Virtualization – VMware Workstation, MS’s Virtual Server, VirtualBox
• Para-virtualization – UML
17
Types of System VMs (3/5)
Whole-system VMs (different ISA) Host and Guest ISA are different
• Hosted VM + Emulation
Requires full emulation of Guest OS and its applications
Example• Virtual PC for Mac
18
Types of System VMs (4/5)
Kernel Virtualization
The Linux kernel runs the VMs just like any other user-
space process.
Ex) KVM
19
OS-level Virtualization
The kernel of an OS allows for multiple isolated user-
space instances (instead of just one).
Ex) Linux VServer, OpenVZ/Virtuozzo
Types of System VMs (5/5)
20
Sogang University Distributed Computing & Communication Lab.
Virtualization & Techniques
History of techniques
Evolution Chain
22
Classic Virtualization(Popek & Goldberg)
System Virtualization
Trap-and-emulate
VMM / Guest OS Interface Hardware / VMM Interface
Enhancement
Software Virtualization
(VMware)
Binary Translation
New Approach
Para-virtualization
(Xen)
…
Hardware Support for Virtualization(Intel VT & AMD SVM)
…
Classic Virtualization
Trap-and-Emulation
Software Virtualization
Binary translation (BT)
Hardware Virtualization
Intel VT & AMD SVM
23
Popek and Goldberg
“Formal requirements for virtualizable third generation
architectures”
Theorem For any conventional third generation computer, a virtual
machine monitor may be constructed if the set of sensitive instructions for that computer is a subset of the set of privileged instructions.
24Communications of the ACM, vol 17, no 7, 1974,
pp.412-421
Virtualizable Architecture An architecture is virtualizable if the sets of
behavior and control sensitive instructions are subsets of the set of privileged instructions.
The IA-32/x86 architecture is not virtualizable.
In order for an architecture to be virtualizable, Popek and Goldberg determined that all sensitive instructions must also be privileged instructions. Intuitively, this means that a hypervisor must be able to intercept any instructions that change the state of the machine in a way that impacts other processes.
Popek and Goldberg – VMM
Virtual Machine Monitor (VMM)
An efficient, isolated duplicate of the real machine
On a virtualizable architecture, a VMM works usinga trap and emulate technique.
Essential characteristics to be considered a VMM
• Equivalence (Fidelity)– Software on the VMM executes identically to its execution on hardware,
barring timing effects.
• Performance– An overwhelming majority of guest instructions are executed by the
hardware without the intervention of the VMM.
• Safety– The VMM manages all hardware resources.
25
Popek and Goldberg – Instruction Types
Instruction types Privileged instructions
• Privileged instructions are defined as those that may execute in a privileged mode, but will trap if executed outside this mode.
Sensitive• Control sensitive instructions
– Control sensitive instructions are those that attempt to change the configuration of resources in the system
– Ex) Updating virtual to physical memory mappings, Communicating with devices, or Manipulating global configuration registers
• Behavior sensitive instructions– Behave in a different way depending on the configuration of resources,
including all load and store operations that act on virtual memory– Location sensitive : Execution behavior depends on location in memory– Mode sensitive : Execution behavior depends on the privileged mode
Innocuous• An instruction that is not sensitive
26
Classic Virtualization
Important strategies from classical VMM implementation
De-privileging
Shadow structures
Traces
27
De-privileging (Ring Compression)
Run privileged guest OS code at user-level All privileged instructions can be made to trap when executed
in an unprivileged context.
Privileged instructions trap, and emulated by VMM The VMM intercepts traps from the de-privileged guest, and
emulates the trapping instruction against the virtual machine state.
28
OS
Applications
VMM
OS OS
Applications
Applications
User mode
Kernel mode
Background (1/6)
Basic Approach to Virtualization
29
Background (2/6)
Protection Mechanism – PL (Privileged Level) At any given time, an x86 CPU is running in a specific PL,
which determines what code can and cannot do. Most modern x86 kernels use only two PLs, 0 and 3. An attempt to run the privileged instructions outside of
ring 0 causes a general-protection exception, like when a program uses invalid memory addresses.
Likewise, access to memory and I/O ports is restricted based on PL.
30
Background (3/6)
Protection Mechanism – PL (Privileged Level)
31
Background (4/6)
Protection Mechanism – PL (Privileged Level) CPL (Current Privileged Level)
• 현재 실행중인 태스크의 PL• CS, SS selector register 의 0-1 번째 bit 에 저장된 값• 일반적으로 CS selector register 의 값을 CPL 로 간주함• 특정 코드 및 데이터 그리고 스택 영역에 접근 시 통상적으로 이
권한과 해당 descriptor 의 DPL 을 비교해서 접근 허용이 결정됨• 프로그램이 다른 PL 의 code segment 로 제어가 변경되면 CPU 에
의해 CPL 이 변경
DPL (Description Privileged Level)• Descriptor 가 가지고 있는 PL• Descriptor 생성 시 설정• TSS, Data Segment, Call Gate 등이 포함• CPL 권한이 DPL 권한보다 낮으면 해당 Segment 는 사용할 수
없음32
Background (5/6)
Protection Mechanism – PL (Privileged Level) RPL (Requested Privileged Level)
• 각 Segment Selector 의 0-1 bit 에 설정• Call Gate 를 통한 상위레벨의 루틴을 사용할 때 해당 세그먼트의
원래 권한 레벨을 나타냄– Call Gate 는 낮은 PL 에서 높은 PL 을 사용할 수 있게 해줌
» PL 3 인 프로세스가 Call Gate 를 통해 PL 0 에 있는 루틴을 실행할 때 일시적으로 PL 0 의 데이터 영역에도 접근할 수 있도록 해줌
– PL 3 인 프로세스가 Call Gate 를 통해 PL 0 루틴을 호출했다면 Data Segment Selector 에 PL 3 에서 불러졌다는 것을 표시하여 PL 0 의 데이터를 건드리지 못하게 됨
33
Background (6/6)
Protection Mechanism – PL (Privileged Level) Call Gate
• 낮은 PL 의 프로그램이 실행 도중 높은 PL 로 변경되는 수단은 대표적으로 Interrupt, Exception, Call Gate 등이 있음
– Hardware interrupt 와 Exception 은 무조건적으로 PL 의 변경이 이루어지고 Software interrupt 와 Call Gate 는 코드와 상황에 따라 발생한다 .
• CALL 명령을 통해 이루어짐– PL 3 에서 CALL 명령으로 PL 0 의 루틴을 불러내게 되면 CS Selector 의 0-1
bit 에 00 이 들어감 ( 즉 , CPL 이 0 으로 초기화 )– 루틴이 끝나고 RET 명령으로 PL 3 으로 돌아오면 CPL 은 3 이 됨
34
Trap and Emulate (1/2)
35
Trap and Emulate (2/2)
Considered the only practical method for virtualization until a recent
Related to the processor virtualization Goal
• Process normal instructions as fast as possible• Forward privileged instructions to emulation routines
Handling Instructions Normal instructions run directly on processor. Privileged instructions trap into the VMM. The VMM emulates the effect of the privileged
instructions for the guest OS.
36
Trap and Emulate – Example
37
CPUStateVirtualizing the Interrupt Flag with Instruction Interpreter
Example
GPR (General Purpose Register)LR (Location Register)PC (Priority Control)IE (Information Executions)IRQ (Interrupt ReQuest)
Trap and Emulate – Sequence
Handling Privileged Instructions
Instruction trap invokes VMM dispatcher.
Dispatcher calls instruction routine.
Changes mode to supervisor
Emulates instruction
Computes return target
Restores mode to user
Jumps to target
38
Classic Virtualization
Important strategies from classical VMM implementation
De-privileging
Shadow structures
Traces
39
Primary and Shadow Structures
The VMM derives shadow structures from guest-level primary structures. The VMM maintains an image of the guest register, and
refers to that image in instruction emulation as guest operations trap.
40
VMM
OS OS
Applications
Applications
User mode
Kernel mode
PrimaryPage table
Shadow page table
Shadow page table
Memory Traces
Write-protect primary copies so that update operations cause page faults which can be caught, interpreted, and emulated.
To maintain coherency of shadow structures, VMM typically use hardware page protection mechanisms to trap accesses to in-memory primary structures. EX) Guest PTEs for which shadow PTEs have been constructed may
be write-protected.
The page-protection technique is known as tracing. Classical VMMs handle a trace fault similarly to a privileged
instruction fault by:• Decoding the faulting guest instruction• Emulating its effect in the primary structure• And propagating the change to the shadow structure
41
Classic Virtualization (1/2)
Architectural Obstacles Traps are expensive. (~3000 cycles) Many traps are unavoidable. (ex. Page faults) Not all architectures support the trap-and-emulate. (x86)
x86 is not virtualizable. Not all privileged operations trap when run in user mode.
• Dual-purpose instructions don’t trap.• Ex) popf
– Pops flag registers including interrupt-enable flag from stack– Privileged mode : change user (ALU) and system flags– User mode : change user (ALU) flags only, but no trap
Some privileged state is visible in user mode• Ex) Guest OS can observe that current privilege level (CPS) in
code segment selection (%cs).
42
Classic Virtualization (2/2)
Enhancements Exploits flexibility in the VMM/guest OS interface
• Reduce traps• → Paravirtualization (Xen)
Exploits flexibility in the hardware/VMM interface• Hardware VM modes (IBM s370)• Interpretive execution
– A hardware execution mode for running guest OSes
• → Intel VT & AMD SVM
43
IBM Systems Journal, vol. 18, no. 1, 1979, pp. 4-17.
x86 architecture
Historically lacked hardware support for virtualization Until recently, the x86 architecture has not permitted
classical trap-and-emulate virtualization. VMMs for x86 have instead used binary translation of the
guest kernel code. The software VMMs, VMware Workstation and Virtual PC,
use binary translation to fully virtualize x86.
Intel’s VT & AMD’s Pacifica technologies Recently, the major x86 CPU manufacturers have
announced architectural extensions to directly support virtualization in hardware.
The transition from software-only VMMs to hardware-assisted VMMs provides an opportunity to examine the strengths and weaknesses of both techniques.
44
Classic Virtualization
Trap-and-Emulation
Software Virtualization
Binary translation (BT)
Hardware Virtualization
Intel VT & AMD SVM
45
Binary Translation (1/3)
Static Binary Translation Scaling compiler algorithms to whole programs Cross-ISA optimizations in heterogeneous multi-cores Cross-boundary optimization Applications of static binary translation
• Profiling, code obfuscation, code diversification, interactive optimization
Dynamic Binary Translation Virtual machines or platform virtualization Dynamic optimization Performance models for applications running in virtual
machines Security, resource management
46
Binary Translation (2/3)
Separating virtual state (VCPU) from physical state (CPU)
The guest executes on an interpreter instead of directly on a physical CPU.
Prevent leakage of privileged state, such as CPL, from the physical CPU into the guest computation
Correctly implement non-trapping instructions, like popf, by referencing the virtual CPL regardless of the physical CPL
Operating system running in VM is unmodified.
Binary Rewriting VMM scans guest OS memory for problematic instructions and
rewrites them.47
Binary Translation (3/3)
Characteristics Binary
• Input is binary x86 code (machine code), not source code. Dynamic
• Translation happens at runtime. On demand
• Code is translated when need for execution. System level
• The translation makes no assumption about guest code.• Rules are set by the x86 ISA.
Sub-setting• Translates from the full x86 instruction set, including all privileged
instructions, to safe subset (mostly user-mode instructions) Adaptive
• Translated code is adjusted on guest behavior to achieve efficiency
48
Simple Binary Translation
Popek and Goldberg’s essential characteristics of VMM Ensures fidelity and safety
Fails to meet the performance bar• The fetch-decode-execute cycle of the interpreter may burn
hundreds of physical instructions per guest instruction.
However, binary translation can combine the semantic precision of interpretation with high performance, yielding an execution engine that meets all of Popek and Goldberg’s criteria.
A VMM built around a suitable binary translation can virtualize the x86 architecture and it is a VMM according to Popek and Goldberg.
49
Simple Example (1/8)
50
Compile
Binary (“hex”) representation
Translator
Simple Example (2/8)
Intermediate representation (IR) Translator reads the guest’s memory at the address indicated
by the guest program counter (PC), classifying the bytes as prefixes, opcodes or operands to produce IR objects
Each IR object represents one guest instruction.
Translation Unit (TU) – basic block (BB) Translator accumulates IR objects into a TU, stopping at
51
First TU
• 12 instructions• Terminating instruction (usually control
flow) The fixed-size cap allows stack
allocation of all data structures without risking overflow; in practice it is rarely reached since control flow tends to terminate TUs sooner.
Simple Example (3/8)
IDENT (for “identically”) Most code can be translated IDENT Preserving most compiler optimizations and only slightly
increasing code size
Turn jge into two translator-invoking continuations, one for each of the successors (fall-through and taken-branch), yielding this translation:
52
IDENT
non-IDENT
[ ] : continuation
Simple Example (4/8)
Translator One TU → One compiled code fragment (CCF) Although we show CCFs in textual form with labels like
isPrime’ to remind us that the address contains the translation of isPrime, in reality the translator produces binary code directly and tracks the input-to-output correspondence with a hash table.
53
Simple Example (5/8)
isPrime(49) jge is not taken (%ecx is 49) Proceed into the fallthrAddr case
Invoke the translator on guest address nexti
Second TU
54
IDENT
non-IDENT
Simple Example (6/8)
Chaining optimization (employed the translator) To speed-up inter-CCF transfers
Allowing one CCF to jump directly to another without calling out of the translation cache (TC)
These chaining jumps replace the continuation jumps, which therefore are “execute once”.
Moreover, it is often possible to elide chaining jumps and fall through from one CCF into the next.
55
Simple Example (7/8)
The interleaving of translation and execution continues for as long as the guest runs, with a decreasing proportion of translation as the TC gradually captures the guest’s working set.
56
Simple Example (8/8)
For isPrime(49), after looping the for loop for long enough to detect that 49 isn’t a prime, we end up with this code in the TC:
57
1st CCF
2nd CCF
3rd CCF
4th CCF
Non-IDENT (1/2)
While most instructions can be translated IDENT, there are several noteworthy exceptions: PC-relative addressing
• Since the translator output resides at a different address than the input
– The translator inserts compensation code to ensure correct addressing.
• Small code expansion and slow-down
Direct control flow• Since code layout changes during translation, control flow
must be reconnected in the TC.• For direct calls, branches and jumps, the translator can do
the mapping from guest address to TC address.• Not significant slowdown
58
Non-IDENT (2/2)
Indirect control flow (jmp, call, ret)• Does not go to a fixed target, preventing translation-time
binding• Instead the translated target must be computed
dynamically, e.g., with a has table lookup.• The resulting overhead varies by workload but is typically a
single-digit percentage.
Privileged instructions• We use in-TC sequences for simple operations.• These may run faster than native.
– Ex) on a Pentium 4» cli (clear interrupts) takes 60 cycles » vcpu.flags.IF:=0 takes a handful of cycles
• Complex operations like context switches call out to the runtime, causing measurable overhead due both to the callout and the emulation work.
59
Hybrid Approach
BT is not required for safe execution of most user code on most guest operating systems. By switching guest execution between BT mode and
direct execution as the guest switches between kernel- and user-mode, we can limit BT overheads to kernel code and permit application code to run at native speed.
60
Adaptive Binary Translation (1/4)
BT VMM can outperform a classical VMM. Avoiding privileged instruction traps Ex) rdtsc – On a Pentium 4 CPU
• Trap-and-emulate : 2030 cycles• Callout-and-emulate (Para-virtualization) : 1254 cycles• In-TC emulation (BT) : 216 cycles
However, while simple BT eliminates traps from privileged instructions, an even more frequent trap source remains: non-privileged instructions (e.g., loads and stores) accessing sensitive data such as page tables.
61
Adaptive Binary Translation (2/4)
“Innocent until proven guilty” Guest instructions start in the innocent state, ensuring maximal
use of IDENT translations.
During execution of translated code, • Detect instructions that trap frequently• Adapt their translation:
– Retranslate non-IDENT to avoid the trap– Patch the original IDENT translation with a forwarding jump to the new
translation
Process Privileged instructions – eliminated by simple BT Non-privileged instructions – eliminated by adaptive BT
(a) Detect a CCF containing an instruction that trap frequently(b) Generate a new translation of the CCF to avoid the trap (perhaps
inserting a call-out to an interpreter), and patch the original translation to execute the new translation
62
Adaptive Binary Translation (3/4)
Adaptation from IDENT to SIMULATE
63
A control flow graph with an IDENT translation in ccf1, and arbitrary other control flow in the TC represented by ccf2, ccf3, and ccf4
The result of adapting from IDENT in ccf1 to translation type SIMULATE in ccf5.
After adaptation, we avoid taking a trap in ccf1 and instead execute a faster callout in ccf5.
The SIMULATE callout continues to monitor the behavior of the offending instruction. If the behavior changes and the instruction becomes innocent again, we switch the active translation type back to IDENT by removing the forwarding jump from ccf1 and inserting an opposing one in ccf5.
Adaptive Binary Translation (4/4)
The VMM uses adaptation
In a bimodal form
• Distinguish between innocent and guilty instructions
With the ability to adapt to a variety of situations
• Access to a page table
• Access to a particular device
• Access to the VMM’s address range
64
Classic Virtualization
Trap-and-Emulation
Software Virtualization
Binary translation (BT)
Hardware Virtualization
Intel VT & AMD SVM
65
Sogang University Distributed Computing & Communication Lab.
Virtualization & Techniques
Comparision