VMworld 2013: Extreme Performance Series: Network Speed Ahead

1. Extreme Performance Series: Network Speed Ahead Lenin Singaravelu, VMware Haoqiang Zheng, VMware VSVC5596 #VSVC5596

2. 2 Agenda Networking Performance in vSphere 5.5 Network Processing Deep Dive Performance Improvements in vSphere 5.5 Tuning for Extreme Workloads Virtualizing Extremely Latency Sensitive Applications Available Resources Extreme Performance Series Sessions 3. 3 vSphere Networking Architecture: A Simplified View PNIC Driver Virtualization Layer vSwitch, vDS, VXLAN, NetIOC, Teaming, DVFilter, VM vNIC VM vNIC VM vNIC vmknic (VMkernel TCP/IP) NFS/iSCSI vMotion Mgmt FT VM vNIC VM vNIC VM SR-IOV VF VM Direct Path Device Emulation Layer VM vNIC VM vNIC VM DirectPath w/VMotion Pass-through 4. 4 Transmit Processing for a VM One transmit thread per VM, executing all parts of the stack Transmit thread can also execute receive path for destination VM Wakeup of transmit thread: Two mechanisms Immediate, forcible wakeup by VM (low delay, high CPU overhead) Opportunistic wakeup by other threads or when VM halts (potentially higher delay, low CPU overhead) PNIC Driver Network Virtualization Layer VM vNIC VM vNIC VM vNIC DevEmu DevEmu DevEmu Ring Entries To Packets Packets to Ring Entries Switching, Encapsulation, Teaming, Destination VM Packets To Ring Entries VMKcall Opport. 5. 5 Receive Processing For a VM One thread per device NetQueue enabled devices: one thread per NetQueue Each NetQueue processes traffic for one or more MAC addresses (vNICs) NetQueue vNIC mapping determined by unicast throughput and FCFS. vNICs can share queues Low throughput, Too many vNICs or Queue type mismatch (LRO Queue vs. non-LRO vNIC) PNIC Driver Network Virtualization Layer VM vNIC VM vNIC VM vNIC DevEmu DevEmu DevEmu Dedicated Queue Shared Queue 6. 6 Advanced Performance Monitoring using net-stats net-stats: single-host network performance monitoring tool since vSphere 5.1 Runs on ESXi console. net-stats h for help and net-stats A to monitor all ports Measure packet rates and drops at various layers (vNIC backend, vSwitch, PNIC) in a single place Identify VMkernel threads for transmit and receive processing Break down CPU cost for networking into interrupt, receive, vCPU and transmit thread PNIC Stats: NetQueue allocation information, interrupt rate vNIC Stats: Coalescing and RSS information 7. 7 vSphere 5.1 Networking Performance Summary TCP Performance 10GbE Line Rate to/from 1vCPU VM to external host with TSO & LRO Up to 26.3 Gbps between 2x 1-vCPU Linux VMs on same host Able to scale or maintain throughput even at 8X PCPU overcommit (64 VMs on a 8 core, HT-enabled machine) UDP Performance 0.7+ Million PPS (MPPS) with a 1vCPU VM, rising to 2.5+ MPPS with more VMs (over single 10GbE) Low loss rate at very high throughput might require tuning vNIC ring and socket buffer sizes Latency 35+ us for ping i 0.001 in 1vCPU-1VM case over 10GbE Can increase to hundreds of microseconds under contention Note: Microbenchmark performance is highly dependent on CPU clock speed and size of last-level cache (LLC) 8. 8 Agenda Networking Performance in vSphere 5.5 Network Processing Deep Dive Performance Improvements in vSphere 5.5 Tuning for Extreme Workloads Virtualizing Extremely Latency Sensitive Applications Available Resources Extreme Performance Series Sessions 9. 9 Whats New in vSphere 5.5 Performance 80 Gbps on a single host Support for 40Gbps NICs vmknic IPv6 Optimizations Reduced CPU Cycles/Byte vSphere Native Drivers VXLAN Offloads Dynamic queue balancing Experimental Packet Capture Framework Latency-Sensitivity Feature 10. 10 80 Gbps1 On Single Host 8 PNICs over 8 vSwitches 16 Linux VMs Apache 4x Intel E5-4650, 2.7 GHz 32 cores, 64 threads IXIA XT80-V2 Traffic Generator HTTP Get 1MB File 75+ Gbps 4.1 Million PPS HTTP POST 1MB File 75+ Gbps 7.3 Million PPS2 1Why stop at 80 Gbps? vSphere allows a maximum of 8x 10GbE PNICs. 2Software LRO less aggressive than TSO in aggregating packets 11. 11 40 Gbps NIC Support Inbox support for Mellanox ConnectX-3 40Gbps over Ethernet Max Throughput to 1 vCPU VM 14.1 Gbps Receive 36.2 Gbps Transmit Max Throughput to Single VM 23.6 Gbps Receive with RSS enabled in vNIC and PNIC. Max Throughput to Single Host 37.3 Gbps Receive RHEL 6.3 + VMXNET3 2x Intel Xeon [email protected] Mellanox MT27500 Netperf TCP_STREAM workload 12. 12 Vmknic IPv6 Enhancements TCP Checksum offload for Transmit and Receive Software Large Receive Offload (LRO) for TCP over IPv6 Zero-copy receives between vSwitch and TCP/IP stack Dirtying 48 GB RAM Intel Xeon E5-2667, 2.9 GHz 34.5 Gbps IPv4 32.5 Gbps IPv6 4x 10 GbE Links 13. 13 Reduced CPU Cycles/Byte Change NetQueue allocation model for some PNICs from throughput-based to CPU-usage based Fewer NetQueues used for low traffic workload TSO, Checksum offload for VXLAN for some PNICs Native vSphere drivers for Emulex PNIC Eliminate vmklinux layer from device drivers 10% - 35% lower CPU Cycles/byte in VMkernel 14. 14 Packet Capture Framework New Experimental Packet Capture Framework in vSphere 5.5 Designed for capture at moderate packet rates Capture packets at one or more layers of vSphere network stack --trace option timestamps packets as it passes through key points of the stack Useful in identifying sources of packet drops and latency e.g., between UplinkRcv, Vmxnet3Rx to check for packet drops by firewall e.g, With --trace enabled, diff of timestamp at Vmxnet3Tx, UplinkSnd informs us if NetIOC delayed any packet PNIC NetVirt VM vNIC DevEmu UplinkSnd EtherswitchOutput EtherswitchDispatch Vmxnet3Tx UplinkRcv EtherswitchDispatch EtherswitchOutput Vmxnet3Rx 15. 15 Agenda Networking Performance in vSphere 5.5 Network Processing Deep Dive Performance Improvements in vSphere 5.5 Tuning for Extreme Workloads Virtualizing Extremely Latency Sensitive Applications Available Resources Extreme Performance Series Sessions 16. 16 Improve Receive Throughput to a Single VM Single thread for receives can become bottleneck at high packet rates (> 1 Million PPS or > 15Gbps) Use VMXNET3 virtual device, Enable RSS inside Guest Enable RSS in Physical NICs (only available on some PNICs) Add ethernetX.pnicFeatures = 4 to VMs configuration parameters Side effects: Increased CPU Cycles/Byte PNIC NetVirt VM vNIC DevEmu PNIC NetVirt DevEmu VM vNIC vCPU0 vCPUn 14.1 Gbps on 40G PNIC 23.6 Gbps 17. 17 Improve Transmit Throughput with Multiple vNICs Some applications use multiple vNICs for very high throughput Common transmit thread for all vNICs can become bottleneck Add ethernetX.ctxPerDev = 1 to VMs configuration parameters Side effects: Increased CPU Cycles/Byte PNIC NetVirt VM vNICvNIC DevEmu DevEmu PNIC NetVirt VM vNICvNIC DevEmu DevEmu0.9 MPPS UDP Tx Rate 1.41 MPPS UDP Tx Rate 18. 18 Achieve Higher Consolidation Ratios Switch vNIC coalescing to static: ethernetX.coalescingScheme = static Reduce interrupts to VM and vmkcalls from VM for networking traffic Less interruptions => more efficient processing => more requests processed at lower cost Disable vNIC RSS in Guest for multi-vCPU VMs At low throughput and low vCPU utilization, RSS only adds overhead Side-effects: Potentially higher latency for some requests and some workloads 19. 19 Achieving tens of microseconds latency 20. 20 The Realm of Virtualization Grows WebService AppService E-Mail Desktops Databases X X Soft Real-Time Apps HPC X High Frequency Trading Tier1 Apps Highly Latency-Sensitive Applications Low latency/jitter requirement (10 us 100 us) Normally considered to be non-virtualizable 21. 21 The Latency Sensitivity Feature in vSphere 5.5 Physical Hardware Latency-sensitivity Feature Minimize virtualization overhead Achieve near bare-metal performance Latency-sensitivity HypervisorHypervisor 22. 22 Ping Latency Test 35 us Default VM to Native Native to Native Latency Sensitive VM to Native 18 us 20 us32 us 557 us 46 us Median Latency 99.99% Latency Jitter Metric 10X 23. 23 Agenda Network Performance in vSphere 5.5 Virtualizing Extremely Latency Sensitive Applications Sources of Latency and Jitter Latency Sensitivity Feature Performance Best Practices Available Resources Extreme Performance Series Sessions 24. 24 Maximize CPU Reservation 25. 25 Maximize Memory Reservation 26. 26 Ping Latency Test 35 us Default VM to Native 18 us 32 us 557 us Median Latency 99.99% Latency Jitter Metric Native to Native 27. 27 Sources of the Latency/Jitter CPU Contention CPU Scheduling Overhead Networking Stack Overhead PNIC NetVirt VM vNIC DevEmu 28. 28 System View from CPU Schedulers Perspective vcpu-0 vcpu-1 MKS I/O System threads hostd ssh Mem Mgr I/O VMs User threads We have more than just vCPUs from VMs CPU contention can occasionally occur on an under-committed system Some system threads run at higher priority 29. 29 Causes of Scheduling Related Execution Delay A B C D E E: HT, power management, cache related efficiency loss Wakeup Finish runningStart running A: ready time, waiting for other contexts to finish B: scheduling overhead and world switch overhead C: actual execution time D: overlap time (caused by interrupts etc) 30. 30 Setting Latency-Sensitivity to HIGH 31. 31 Reduce CPU Contention Using Exclusive Affinity vcpu A B C D E Exclusive Affinity PCPUs I/O Intr 32. 32 Reduce CPU Contention Using Exclusive Affinity (II) What about other contexts? Share cores without exclusive affinity May be contented and may cause jitters for the latency sensitive VM PCPUs vcpu0 vcpu1 vcpu2 vcpu3 vcpu0 vcpu1 MKSI/O hostd ssh Mem Mgr I/O 33. 33 Use DERatio to Monitor CPU Contention 34. 34 Side Effects of Exclusive Affinity Theres no such thing as a free lunch CPU cycles may be wasted The CPU will NOT be used by other contexts when the vCPU is idle Exclusive affinity is only applied when: The VMs latency-sensitivity is HIGH The VM has enough CPU allocation 35. 35 Latency/Jitter from Networking Stack PNIC NetVirt VM vNIC DevEmu Execute more code Context Switches - Scheduling Delays - Variance from Coalescing Large Receive Offload modifies TCP ACK behavior Disable vNIC Coalescing Disable LRO for vNICs Use Pass-Through device for networking traffic 36. 36 Pass-Through Devices SR-IOV or DirectPath I/O allow VM direct access to device Bypass Virtual Networking Stack, reducing CPU cost and Latency Pass-through NICs negate many benefits of virtualization Only some versions of DirectPath I/O allow sharing of devices and VMotion SR-IOV allows sharing of devices, but does not support Vmotion No support for NetIOC, FT, Memory Overcommit, HA, 37. 37 Agenda Network Performance in vSphere 5.5 Virtualizing Extremely Latency Sensitive Applications Sources of Latency and Jitter Latency Sensitivity Feature Performance Best Practices Available Resources Extreme Performance Series Sessions 38. 38 Performance of Latency Sensitivity Feature Single 2-vCPU VM to Native, RHEL 6.2, RTT from ping i 0.001 Intel Xeon E5-2640 @ 2.50 GHz, Intel 82599EB PNIC Median reduced by 15 us over Default, 6 us over SR-IOV 99.99th percentile lower than 50 us Performance gap to native is between 2us-10us 39. 39 Performance with Multiple VMs 4x 2-vCPU VMs on a 12-core host, same ping workload 4-VM performance very similar to that of 1-VM Median reduced by 20us over Default, 6 us over SR-IOV 99.99th percentile ~ 75 us: 400+us better than Default, 50us better than SR-IOV 40. 40 Extreme Performance with SolarFlare PNIC 1VM with Solarflare SFN6122F-R7 PNIC, Native with same PNIC Netperf TCP_RR workload OpenOnload enabled for netperf and netserver Median RTT of 6 us. 99.99th percentile

Date post:	22-Jan-2015
Category:	Technology
Upload:	vmworld
View:	3,044 times
Download:	9 times

VMworld 2013: Extreme Performance Series: Network Speed Ahead

Technology