+ All Categories
Home > Documents > KVM Weather Report...KVM Weather Report Red Hat Author Gleb Natapov May 29, 2013 Part I What is KVM...

KVM Weather Report...KVM Weather Report Red Hat Author Gleb Natapov May 29, 2013 Part I What is KVM...

Date post: 08-Feb-2021
Category:
Upload: others
View: 2 times
Download: 1 times
Share this document with a friend
58
KVM Weather Report Red Hat Author Gleb Natapov May 29, 2013
Transcript
  • KVM Weather Report

    Red Hat

    Author Gleb Natapov

    May 29, 2013

  • Part IWhat is KVM

  • Section 1KVM Features

  • KVM Features 4

    KVM features

    VT-x/AMD-V (hardware virtualization)

    EPT/NPT (two dimensional paging)

    CPU/memory overcommit

    Scalability (160 vcpus and 2TB RAM are tested)

    Live Migration

    KSM

    Transparent huge pages

    Security and isolation with sVirt and seccomp

    Secure PCI device assignment using IOMMU

    SVVP (Server Virtualization Validation Program)

    PV IO using VirtIO

    WHQL VirtIO drivers

  • Section 2KVM Architecture

  • KVM Architecture 6

    KVM is a Linux Subsystem

  • KVM Architecture 7

    KVM VCPU Loop

  • KVM Architecture 8

    Network Architecture

    Emulated Devices

    E1000

    RTL8139

    Native drivers

    Compatibilityover performance

    VirtIO Devices

    Paravirtualized

    virtio-net

    Performace overcompatibility

    Device Assignment

    Native drivers

    Compatibilityand Performance

    No migration

  • KVM Architecture 9

    Block Architecture

  • Section 3Linux as a Hypervisor

  • Linux as a Hypervisor 11

    Why Linux?

    Scalable scheduler

    Sophisticated memory management (NUMA/huge pages)

    Hardware enablement for free

    Scalable I/O stack (AIO capabilities are lacking)

    Isolation (cgroups)

    Security (seccomp/SELinux)

    Tracing and performance monitoring

    And more...

  • Part IIWhat’s new

  • Section 4APIC Virtualization and VirtualInterrupts

  • APIC Virtualization and Virtual Interrupts 14

    Problem: APIC Emulation is Inefficient

    On each APIC access:

    #VMEXIT

    Instruction emulation (x2APIC mitigates this)

    On each interrupt injection:

    #VMEXIT

    APIC state evaluation

  • APIC Virtualization and Virtual Interrupts 15

    Solution: Move APIC virtualization into CPU!

    In three easy steps:

    APIC register virtualization

    Virtual interrupt delivery

    Posted interrupt processing

  • APIC Virtualization and Virtual Interrupts 16

    APIC Register Virtualization

    APIC-read reads from vAPIC page w/o causing #VMEXIT

    APIC-write writes into vAPIC page w/o causing #VMEXIT

    New APIC-write #VMEXIT

    - Trap like #VMEXIT for APIC register writes that should behandled by VMM (no emulation needed)

  • APIC Virtualization and Virtual Interrupts 17

    Virtual Interrupt Delivery

    New VMCS field ”Guest interrupt status” with two subfields:

    - Requesting virtual interrupt (RVI)- Servicing virtual interrupt (SVI)

    Pending interrupt are evaluated on:

    - VM Entry- TPR access- EOI access- self-IPI- Posted interrupt processing

    Recognized interrupts are delivered without #VMEXIT

    - RVI/SVI are updated accordingly

  • APIC Virtualization and Virtual Interrupts 18

    Posted Interrupt Processing

    Virtual interrupt are recorded in Posted Interrupt Descriptor

    Special Posted Interrupt Notification IPI

    - If, when the IPI is received, CPU is in a guest mode virtualinterrupts from Posted Interrupt Descriptor are transfered tovAPIC page and processed by the CPU without #VMEXIT

  • APIC Virtualization and Virtual Interrupts 19

    Result

    Eliminates up to 50% of #VMEXITs on I/O intensive workloads

  • Section 5Nested VMX Improvements

  • Nested VMX Improvements 21

    Problem: VMX Emulation is Expensive

    Each L1 VMREAD/VMWRITE is emulated by L0

    There are many of them for each VMLAUNCH

  • Nested VMX Improvements 22

    Problem: VMX Emulation is Expensive

    Each L1 VMREAD/VMWRITE is emulated by L0

    There are many of them for each VMLAUNCH

  • Nested VMX Improvements 23

    Solution: VMCS Shadowing

    L1 has shadow VMCS page linked from main VMCS

    L1’s VMREAD/VMWRITE reads/writes into shadow VMCSpage without #VMEXIT

    When L1 executes VMLAUNCH L0 copies L2 VMCS valuesfrom the shadow VMCS page to VMCS02 (VMCS used by L0to run L2)

    To minimize copying only most often used field are shadowed,access to the rest generates #VMEXIT as before

  • Nested VMX Improvements 24

    Result

    Up to 40% less #VMEXITs

  • Nested VMX Improvements 25

    Nested VMX ongoing development

    Nested EPT

    A lot of bug fixes lately

  • Section 6MUMA Improvements

  • MUMA Improvements 27

    Problem: Non Optimal Memory Placement

  • MUMA Improvements 28

    Solution: Numa Aware Scheduler

    Memory follows CPU

    Periodically mark process memory as inaccessible

    Migrate memory to where the task is running now on NUMAfault

    Task follows memory

    Statistics of recent NUMA faults incurred on each node iskept per task

    Scheduler tries to run the task where its memory is

  • MUMA Improvements 29

    Result

    3-15% performance improvement.

  • Section 7VDSO pvclock

  • VDSO pvclock 31

    Problem

    Guest cannot use TSC directly as a clock source

    - May be unsynchronised between CPU sockets- Frequency may change due to migration

    Linux uses kvmclock as a clock source instead

    gettimeofday() does system call

  • VDSO pvclock 32

    Solution: Add kvmclock Support to VDSO

    Map pvclock data structure into process memory

    Run kvmclock code from VDSO

  • VDSO pvclock 33

    Result

    clock gettime() is reduced from ∼ 500 to ∼ 200 cycles

  • Section 8Guest Spinlock Improvements

  • Guest Spinlock Improvements 35

    Pause Loop Exit (PLE) optimisations

    Detect undercommit scenario:

    ebizzy (rec/sec higher is beter)before stdev after stdev %improve

    1x 2511.30 21.54 6051.80 170.25 140.9822x 2679.40 332.44 2692.30 251.40 0.4813x 2253.50 266.42 2192.16 178.97 -2.721dbench (throughput in MB/sec. higher is better)

    before stdev after stdev %improve1x 6677.40 638.50 10098.00 3449.70 51.2262x 2012.67 64.76 2019.04 62.67 0.3163x 1302.07 40.83 1292.75 27.05 -0.716

  • Guest Spinlock Improvements 36

    Pause Loop Exit (PLE) optimisations (Cont.)

    Detect preempted vcpus:

    ebizzy (rec/sec higher is beter)before stdev after stdev %improve

    1x 5609.20 56.93 6263.70 64.70 11.6682x 2071.90 108.48 2653.50 181.83 28.0753x 1557.41 109.71 1993.50 166.31 28.0004x 1254.75 91.29 1765.50 237.54 40.705kernbench (exec time in sec lower is beter)

    before stdev after stdev %improve1x 47.03 4.69 44.25 1.28 5.9092x 96.00 7.18 91.26 7.35 4.9443x 164.01 10.36 156.67 11.42 4.4754x 212.57 23.73 204.48 13.29 3.808

  • Guest Spinlock Improvements 37

    Problem: Lock Waiter Preemption

  • Guest Spinlock Improvements 38

    Problem: Lock Waiter Preemption

  • Guest Spinlock Improvements 39

    Solution: Lock Waiter PreemptionWork in Progress By Jiannan Ouyang1

    Preemptable Ticket Spinlock

    Downgrade a fair lock to an unfair lock automatically uponpreemption, preserving the fairness otherwise.

    1http://people.cs.pitt.edu/~ouyang/files/publication/

    preemptable_lock-ouyang-vee13.pdf

    http://people.cs.pitt.edu/~ouyang/files/publication/preemptable_lock-ouyang-vee13.pdfhttp://people.cs.pitt.edu/~ouyang/files/publication/preemptable_lock-ouyang-vee13.pdf

  • Section 9More of Hyper-V Emulation

  • More of Hyper-V Emulation 41

    Relaxes timing

    Disables Windows’ watchdog.

  • More of Hyper-V Emulation 42

    Hyper-V Timers

    Reference Time Counter

    Per-partition reference time counter. Successive accesses returnstrictly monotonically increasing time values as determined by anyand all virtual processors of a partition. It rate constant andunaffected by processor or bus-speed transitions or deeppower-savings states.

    Partition Reference Time

    A reference time source that does not require an intercept into thehypervisor. This enlightenment is available only when theunderlying platform provides support of an invariant processor TSC.

  • Section 10PCI Device Assignment

  • PCI Device Assignment 44

    PCI Device Assignment Improvements

    Virtual Function IO based KVM PCI device

    VFIO is a new UIO-like kernel driver that allows for a cleanerPCI device assignment architecture

    Move to this model as the primary device assignmentmechanism

    More maintainable, better architecture, more secure interfacefor enabling PCI device assignment

    Legacy PCI device assignment is deprecated and now can becompiled out via kernel configuration

  • Section 11Live Migration Updates

  • Live Migration Updates 46

    Live Migration Improvements

    Better support for big guests

    xbzrle compression

    Migration thread

    Accurate migration downtime calculation

  • Live Migration Updates 47

    Post-Copy Live Migration

    Latest patches use special device to trap guest memory accesson a destination

    - Swapping, THP, KSM, NUMA balancing works only onanonymous memory

    Proposed solution: MADV USERFAULT & remap anon pages()

    - Guest memory VMAs marked with MADV USERFAULT- Guest’s access to unmapped page causes special notification to

    be delivered to QEMU- QEMU receives missing page from migration to a local buffer

    and remaps it into the guest memory withremap anon pages()

    - Guest memory stays anonimous

  • Section 12Block

  • Block 49

    Block

    Live storage migration merged

    - Migration without shared storage

    GlusterFS block driver merged

    - Together with upcoming Gluster 3.4 release allows QEMU tobypass FUSE for increased performance

    Microsoft Hyper-V VHDX format support currently underdevelopment

    - To ease migration to KVM

    vhost-scsi

    - Use LIO (linux-iscsi.org) in kernel SCSI target code to handleSCSI protocol.

    - No QEMU userspace handling, no QEMU global mutex ondata path

    - Code path is shorter, guest talks to host kernel directly- Higher performance (240K IOPS vs 12K IOPS for virtio-scsi)

  • Section 13Networking

  • Networking 51

    Networking

    Multi-queue NIC through virtio-net

    - Improves network performance and throughput for SMP guests- Incoming traffic to a guest scales linearly- Outgoing - sometimes a regression since more queues means

    more exits

    Bridge zero copy transmit

    - Zero copy transmit from guest to external traffic using networkbridge

    - Improves network transmit performance for large message sizes- About 15% gain in CPU utilization

  • Section 14virtio RNG

  • virtio RNG 53

    Virtio Random Number Generator (RNG)

    Prevent entropy starvation in guests

    Inject entropy from host to the guest

    - The default mode uses the host’s /dev/random- HW RNG device or EGD (Entropy Gathering Daemon) source

  • Section 15New Hardware Architectures

  • New Hardware Architectures 55

    New hardware Architectures

    ARM32 is merged

    ARM64 is on the way

    MIPS32 trap and emulate is merged

    MIPS-VZ is being worked on

  • Section 16QEMU Consolidation

  • QEMU Consolidation 57

    QEMU Consolidation

    qemu-kvm is no more

  • The end.Thanks for listening.

    What is KVMKVM FeaturesKVM ArchitectureLinux as a Hypervisor

    What's newAPIC Virtualization and Virtual InterruptsNested VMX ImprovementsMUMA ImprovementsVDSO pvclockGuest Spinlock ImprovementsMore of Hyper-V EmulationPCI Device AssignmentLive Migration UpdatesBlockNetworkingvirtio RNGNew Hardware ArchitecturesQEMU Consolidation


Recommended