Evolution of the netmap architecture -- Page 1/21L < > T H localEvolution of the netmap architecture
Evolution of the netmap architectureLuigi Rizzo, Università di Pisa http://info.iet.unipi.it/~luigi/vale/
these slides at http://info.iet.unipi.it/~luigi/netmap/talk-coseners.html
Starting pointPacket I/O in commodity OS is not fast enough
typically ~1 Mpps/core, vs 14.88 Mpps on 10 Gbit/s linksinsufficient for software packet processing nodesseveral ad-hoc or proprietary solution (Click-based, PacketShader, DPDK)
We tried to come up with a more generic solution. Ideally:as fast as (or better than) othersdevice-independent, OS independentdeveloped friendly
Netmap (2011)10..40 times faster than raw sockets, PF_PACKET, bpf14.88 Mpps with 1 core at 900 MHzpcap library on top of netmapas of 2012, also available on Linux
Next, get rid of hardwareUse netmap key ideas for virtual switches
simplify experiments with high speed network applicationsuseful to interconnect VMs
VALE (2012)Virtual Local Ethernet implenting ethernet learning bridgeup to 20 Mpps (64 byte frames) or 70 Gbit/s (1500 byte frames)
And then, how fast networking is in VMsFocus is mostly on bulk TCP
emulation of network devices very poorparavirtualized devices (vendor specific) reasonably fast for TCP20-30 Gbit/s using tricks (good ones)
We want to deal with high packet rates, tooSDN must be implemented, not just Definedsome applications have high pps requirements
Fast QEMU (2013)accelerated network I/O path in qemuparavirtualized e1000guest-guest rates on top of e1000: >1 Mpps with sockets, 5 Mpps with netmapMore at http://info.iet.unipi.it/~luigi/papers/20130520-rizzo-vm.pdf
Availabilitynetmap and VALE are in standard FreeBSD distributions (HEAD and stable/9)same code also runs on Linux as an add-on modulesupport for multiple NICs (Intel, Realtek, Mellanox, Nvidia)QEMU enhancements submitted to QEMU-dev
Netmap design principlesDesign principles:
no requirement/reliance on special hardwarefeaturesamortize costs over large batches (syscalls)remove unnecessary work (copies, mbufalloc/free)reduce runtime decisions (a single frameformat)modifying device drivers is permitted,as long as the code can be maintained
Netmap data structures and APIAccessopen("/dev/netmap")ioctl(fd, NIOCREG, arg) -->disconnect datapath from the OS;mmap(..., fd, 0)
Transmitfill buffers, update netmap ringioctl(fd,NIOCTXSYNC) queues thepackets
Receiveioctl(fd,NIOCRXSYNC) reports newly received packetsprocess buffers, update netmap ring
poll() and select() are used for synchronizationPOLLIN and POLLOUT select the rings to monitor
Netmap performance65 cycles/packet between the wire anduserspace (14.88 Mpps at 900 MHz)good scalability with CPU frequency andnumber of cores
Ported several apps (OVS, Click, ipfw, ...)with 5-10x speedup
netmap exposes bottlenecks inapplicationssome kernel functions may now runmore efficiently in userspace than in thekernel
VALE, a high performance Virtual Local Ethernet
NIOCREG valeX:Y dynamically creates switch instances (X) and ports (Y)same API as physical netmap ports, but separate memory regionseach switch runs the Learning Bridge algorithm (and now, also OVS)
Operation is sender-driven:each incoming packet is dispatched to the correct destination(s)writes are non-blocking, packets dropped when destination queue fullreads are blockingthe cost of forwarding/copying is charged to the sender.
VALE performanceMulti-stage forwarding to amortize locking and cache miss costs
1. Fetch a batch of packets, prefetch payload2. Compute destinations for each packet in the batch3. Forward traffic iterating on interfaces
VALE PerformanceVALE: our software switchNIC: switch/NIC-supported forwardingTAP: linux/BSD bridge, in-kernel OVSVALE achieves 18-20 Mpps / 70 Gbit/s
Virtual Machine network performanceEmulation of network peripherals historically slow,due to the following:
mmio 100 times more expensive than on baremetal, due to VM exitsinterrupts can be expensiveVMM/host/switch performance
Paravirtualized peripherals (virtio, vmxnet, xenfront)address only one part of the problem:
VMM, host and switch can still be limiting
Our workA set of host, guest, VMM modifications that can be used depending on operationalconstraints
1. proper interrupt moderation (VMM only)2. send combining (guest only)3. e1000 paravirtualization (host-guest)
btw, FreeBSD's DEVICE_POLLING (Rizzo 2001) is almost as good asparavirtualization!
4. use VALE instead of linux bridge (host, VMM)5. clean up the backend-frontend datapath (VMM)
Send combiningSend combining reduces VM exits on TX:
when there are pending interrupt, postponetransmissions until the next interrupt arrivesespecially useful when combined withmoderation
IMPLEMENTATION: ~100 lines, only in the guest device driverPERFORMANCE: TX rate, Kpps (guest-guest, TAP backend)
Function 1 VCPU 2 VCPU ------------------------------------ moderation: 24 -> 90 65 -> 87
send comb.: 24 -> 23 65 -> 301
both: 24 -> 322 65 -> 334
e1000 paravirtualizationReduce VM exits using a shared memory mailbox(CSB) between the VCPU and the IO thread
the first I/O or interrupt acts as a "kick" to wakeup a process on the other side;later, guest driver and I/O thread chase eachother exchanging read/write position through theCSB;
Data buffers and descriptors are in shared memory(guest physical, host virtual)IMPLEMENTATION
no need to create a brand new device, easy addon for any modern NIC~100 lines in the guest driver, ~100 lines in the frontend
e1000 paravirtualization performanceAlmost completely remove kicks and interrupts as packet rate increases24 -> 492 Kpps (1 VCPU), 65 -> 507 Kpps (2 VCPU)performance equivalent to that of virtiodoes not incur the latency of interrupt moderationreliability/portability/stability advantages
Cleaning up the hypervisor datapathData moves from frontend -> backend -> switch
access guest descriptors and bufferscopy to hypervisor memory (backend)pass to the switch
Initial throughput was limited to ~3 Mpps even before goingthrough the switch
removed unnecessary address translationoptimized copy routine
Current code can drive the switch at 10 Mpps
Guest-guest, TAP, sockets \ normal send_combining paravirt.itr \ 1 cpu 2 cpu 1 cpu 2 cpu 1 cpu 2 cpu \ +-------------------------------------------------------- | 0 | 24 65 23 301 492 507 | | | 1 | 22 68 23 303 | | | 100 | 80 79 322 334 | | |1000 | 90 87 293 323 | |
Guest-guest, VALE, sockets \ normal send_combining paravirt.itr \ 1 cpu 2 cpu 1 cpu 2 cpu 1 cpu 2 cpu \ +-------------------------------------------------------- | 0 | 24 65 23 301 492 507 | | 27 112 27 650 1080 1080 | 1 | 22 68 23 303 | | 25 97 24 670 | 100 | 80 79 322 334 | | 129 125 850 860 |1000 | 90 87 293 323 | | 147 140 960 930
(Black: TAP, Red: VALE; ~500 Kpps with 1500-byte frames)~5 Mpps or 25-Gbit/s with netmap within the guest.
Evolution of the netmap architectureAfter this experience, we decided to add features that proved useful, but trying to avoidfeature bloat or additions that impact performance.
"transparent mode" for netmap: packets not marked by user application are sent to theother sidedirect connection between VALE and NIC (as a real host bridge)programmable destination lookup function in VALE (now we can hook to OVS)indirect buffers and scatter-gather I/O in netmap (optimize bulk I/O)
AcknowledgementsMany contributions come from colleagues and former students:
Matteo Landi, Gaetano Catalli, Marta Carbone, Giuseppe Lettieri, Vincenzo Maffione,Michio Honda
Funding from EU Projects (CHANGE, OPENLAB), and companies (Netapp, Google).http://info.iet.unipi.it/~luigi/vale/[email protected]