KVM Weather Report
Red Hat
Author Gleb Natapov
May 29, 2013
Part IWhat is KVM
Section 1KVM Features
KVM Features 4
KVM features
VT-x/AMD-V (hardware virtualization)
EPT/NPT (two dimensional paging)
CPU/memory overcommit
Scalability (160 vcpus and 2TB RAM are tested)
Live Migration
KSM
Transparent huge pages
Security and isolation with sVirt and seccomp
Secure PCI device assignment using IOMMU
SVVP (Server Virtualization Validation Program)
PV IO using VirtIO
WHQL VirtIO drivers
Section 2KVM Architecture
KVM Architecture 6
KVM is a Linux Subsystem
KVM Architecture 7
KVM VCPU Loop
KVM Architecture 8
Network Architecture
Emulated Devices
E1000
RTL8139
Native drivers
Compatibilityover performance
VirtIO Devices
Paravirtualized
virtio-net
Performace overcompatibility
Device Assignment
Native drivers
Compatibilityand Performance
No migration
KVM Architecture 9
Block Architecture
Section 3Linux as a Hypervisor
Linux as a Hypervisor 11
Why Linux?
Scalable scheduler
Sophisticated memory management (NUMA/huge pages)
Hardware enablement for free
Scalable I/O stack (AIO capabilities are lacking)
Isolation (cgroups)
Security (seccomp/SELinux)
Tracing and performance monitoring
And more...
Part IIWhat’s new
Section 4APIC Virtualization and VirtualInterrupts
APIC Virtualization and Virtual Interrupts 14
Problem: APIC Emulation is Inefficient
On each APIC access:
#VMEXIT
Instruction emulation (x2APIC mitigates this)
On each interrupt injection:
#VMEXIT
APIC state evaluation
APIC Virtualization and Virtual Interrupts 15
Solution: Move APIC virtualization into CPU!
In three easy steps:
APIC register virtualization
Virtual interrupt delivery
Posted interrupt processing
APIC Virtualization and Virtual Interrupts 16
APIC Register Virtualization
APIC-read reads from vAPIC page w/o causing #VMEXIT
APIC-write writes into vAPIC page w/o causing #VMEXIT
New APIC-write #VMEXIT
- Trap like #VMEXIT for APIC register writes that should behandled by VMM (no emulation needed)
APIC Virtualization and Virtual Interrupts 17
Virtual Interrupt Delivery
New VMCS field ”Guest interrupt status” with two subfields:
- Requesting virtual interrupt (RVI)- Servicing virtual interrupt (SVI)
Pending interrupt are evaluated on:
- VM Entry- TPR access- EOI access- self-IPI- Posted interrupt processing
Recognized interrupts are delivered without #VMEXIT
- RVI/SVI are updated accordingly
APIC Virtualization and Virtual Interrupts 18
Posted Interrupt Processing
Virtual interrupt are recorded in Posted Interrupt Descriptor
Special Posted Interrupt Notification IPI
- If, when the IPI is received, CPU is in a guest mode virtualinterrupts from Posted Interrupt Descriptor are transfered tovAPIC page and processed by the CPU without #VMEXIT
APIC Virtualization and Virtual Interrupts 19
Result
Eliminates up to 50% of #VMEXITs on I/O intensive workloads
Section 5Nested VMX Improvements
Nested VMX Improvements 21
Problem: VMX Emulation is Expensive
Each L1 VMREAD/VMWRITE is emulated by L0
There are many of them for each VMLAUNCH
Nested VMX Improvements 22
Problem: VMX Emulation is Expensive
Each L1 VMREAD/VMWRITE is emulated by L0
There are many of them for each VMLAUNCH
Nested VMX Improvements 23
Solution: VMCS Shadowing
L1 has shadow VMCS page linked from main VMCS
L1’s VMREAD/VMWRITE reads/writes into shadow VMCSpage without #VMEXIT
When L1 executes VMLAUNCH L0 copies L2 VMCS valuesfrom the shadow VMCS page to VMCS02 (VMCS used by L0to run L2)
To minimize copying only most often used field are shadowed,access to the rest generates #VMEXIT as before
Nested VMX Improvements 24
Result
Up to 40% less #VMEXITs
Nested VMX Improvements 25
Nested VMX ongoing development
Nested EPT
A lot of bug fixes lately
Section 6MUMA Improvements
MUMA Improvements 27
Problem: Non Optimal Memory Placement
MUMA Improvements 28
Solution: Numa Aware Scheduler
Memory follows CPU
Periodically mark process memory as inaccessible
Migrate memory to where the task is running now on NUMAfault
Task follows memory
Statistics of recent NUMA faults incurred on each node iskept per task
Scheduler tries to run the task where its memory is
MUMA Improvements 29
Result
3-15% performance improvement.
Section 7VDSO pvclock
VDSO pvclock 31
Problem
Guest cannot use TSC directly as a clock source
- May be unsynchronised between CPU sockets- Frequency may change due to migration
Linux uses kvmclock as a clock source instead
gettimeofday() does system call
VDSO pvclock 32
Solution: Add kvmclock Support to VDSO
Map pvclock data structure into process memory
Run kvmclock code from VDSO
VDSO pvclock 33
Result
clock gettime() is reduced from ∼ 500 to ∼ 200 cycles
Section 8Guest Spinlock Improvements
Guest Spinlock Improvements 35
Pause Loop Exit (PLE) optimisations
Detect undercommit scenario:
ebizzy (rec/sec higher is beter)before stdev after stdev %improve
1x 2511.30 21.54 6051.80 170.25 140.9822x 2679.40 332.44 2692.30 251.40 0.4813x 2253.50 266.42 2192.16 178.97 -2.721dbench (throughput in MB/sec. higher is better)
before stdev after stdev %improve1x 6677.40 638.50 10098.00 3449.70 51.2262x 2012.67 64.76 2019.04 62.67 0.3163x 1302.07 40.83 1292.75 27.05 -0.716
Guest Spinlock Improvements 36
Pause Loop Exit (PLE) optimisations (Cont.)
Detect preempted vcpus:
ebizzy (rec/sec higher is beter)before stdev after stdev %improve
1x 5609.20 56.93 6263.70 64.70 11.6682x 2071.90 108.48 2653.50 181.83 28.0753x 1557.41 109.71 1993.50 166.31 28.0004x 1254.75 91.29 1765.50 237.54 40.705kernbench (exec time in sec lower is beter)
before stdev after stdev %improve1x 47.03 4.69 44.25 1.28 5.9092x 96.00 7.18 91.26 7.35 4.9443x 164.01 10.36 156.67 11.42 4.4754x 212.57 23.73 204.48 13.29 3.808
Guest Spinlock Improvements 37
Problem: Lock Waiter Preemption
Guest Spinlock Improvements 38
Problem: Lock Waiter Preemption
Guest Spinlock Improvements 39
Solution: Lock Waiter PreemptionWork in Progress By Jiannan Ouyang1
Preemptable Ticket Spinlock
Downgrade a fair lock to an unfair lock automatically uponpreemption, preserving the fairness otherwise.
1http://people.cs.pitt.edu/~ouyang/files/publication/
preemptable_lock-ouyang-vee13.pdf
http://people.cs.pitt.edu/~ouyang/files/publication/preemptable_lock-ouyang-vee13.pdfhttp://people.cs.pitt.edu/~ouyang/files/publication/preemptable_lock-ouyang-vee13.pdf
Section 9More of Hyper-V Emulation
More of Hyper-V Emulation 41
Relaxes timing
Disables Windows’ watchdog.
More of Hyper-V Emulation 42
Hyper-V Timers
Reference Time Counter
Per-partition reference time counter. Successive accesses returnstrictly monotonically increasing time values as determined by anyand all virtual processors of a partition. It rate constant andunaffected by processor or bus-speed transitions or deeppower-savings states.
Partition Reference Time
A reference time source that does not require an intercept into thehypervisor. This enlightenment is available only when theunderlying platform provides support of an invariant processor TSC.
Section 10PCI Device Assignment
PCI Device Assignment 44
PCI Device Assignment Improvements
Virtual Function IO based KVM PCI device
VFIO is a new UIO-like kernel driver that allows for a cleanerPCI device assignment architecture
Move to this model as the primary device assignmentmechanism
More maintainable, better architecture, more secure interfacefor enabling PCI device assignment
Legacy PCI device assignment is deprecated and now can becompiled out via kernel configuration
Section 11Live Migration Updates
Live Migration Updates 46
Live Migration Improvements
Better support for big guests
xbzrle compression
Migration thread
Accurate migration downtime calculation
Live Migration Updates 47
Post-Copy Live Migration
Latest patches use special device to trap guest memory accesson a destination
- Swapping, THP, KSM, NUMA balancing works only onanonymous memory
Proposed solution: MADV USERFAULT & remap anon pages()
- Guest memory VMAs marked with MADV USERFAULT- Guest’s access to unmapped page causes special notification to
be delivered to QEMU- QEMU receives missing page from migration to a local buffer
and remaps it into the guest memory withremap anon pages()
- Guest memory stays anonimous
Section 12Block
Block 49
Block
Live storage migration merged
- Migration without shared storage
GlusterFS block driver merged
- Together with upcoming Gluster 3.4 release allows QEMU tobypass FUSE for increased performance
Microsoft Hyper-V VHDX format support currently underdevelopment
- To ease migration to KVM
vhost-scsi
- Use LIO (linux-iscsi.org) in kernel SCSI target code to handleSCSI protocol.
- No QEMU userspace handling, no QEMU global mutex ondata path
- Code path is shorter, guest talks to host kernel directly- Higher performance (240K IOPS vs 12K IOPS for virtio-scsi)
Section 13Networking
Networking 51
Networking
Multi-queue NIC through virtio-net
- Improves network performance and throughput for SMP guests- Incoming traffic to a guest scales linearly- Outgoing - sometimes a regression since more queues means
more exits
Bridge zero copy transmit
- Zero copy transmit from guest to external traffic using networkbridge
- Improves network transmit performance for large message sizes- About 15% gain in CPU utilization
Section 14virtio RNG
virtio RNG 53
Virtio Random Number Generator (RNG)
Prevent entropy starvation in guests
Inject entropy from host to the guest
- The default mode uses the host’s /dev/random- HW RNG device or EGD (Entropy Gathering Daemon) source
Section 15New Hardware Architectures
New Hardware Architectures 55
New hardware Architectures
ARM32 is merged
ARM64 is on the way
MIPS32 trap and emulate is merged
MIPS-VZ is being worked on
Section 16QEMU Consolidation
QEMU Consolidation 57
QEMU Consolidation
qemu-kvm is no more
The end.Thanks for listening.
What is KVMKVM FeaturesKVM ArchitectureLinux as a Hypervisor
What's newAPIC Virtualization and Virtual InterruptsNested VMX ImprovementsMUMA ImprovementsVDSO pvclockGuest Spinlock ImprovementsMore of Hyper-V EmulationPCI Device AssignmentLive Migration UpdatesBlockNetworkingvirtio RNGNew Hardware ArchitecturesQEMU Consolidation