Vyatta Network OS(vRouter)
March 1, 2017 SV Linux Users Group, San Jose, CA
Robyn GutierrezSven-Thorsten DietrichBecca Nitzan
© 2017 BROCADE COMMUNICATIONS SYSTEMS, INC. 2
• High Level Architecture Overview and Constraints• Forwarding Performance Using Intel 2690 V2
• Topo• Interface load distribution
• No load, one flow• Multiple imbalanced flows• Multiple balanced flows• Mutiple balanced flows with interface affinity configs
• Hugepages• Know your HW - Limitations when you’re least expecting it!
• then there’s inter versus intra nic• and Power
• Forwarding Performance Using Intel 2690 V3• Topo• Tuning comparisons
• Out of the zone, and into the hack
Topics
Vyatta High Level Architecture
3
IPv4/IPv6 Unicast
Firewall
Encrypt / Decrypt
Tunnels (GRE, mGRE)
Multicast
DPDK
QoS
NAT
Etc
Data Plane (vPlane)CLI
REST
Netconf
GUI
Script API AAA RoutingProtocols
Hybrid DevOpsData Model
Shadow Interfaces FIBvPlaned
Session State
Control Plane
Vyatta High Performance User-Space Networking Architecture
Shadow Interfaces UIO / VFIOLinux Kernel
StorageMulti-Queue NICs
(up to 40Gb)
Hardware / Virtualization
Console USB
AF_PACKET
WANUSB
© 2017 Brocade Communications Systems, Inc.
Why User-Space / Kernel Offload
© 2016 Brocade Communications Systems, Inc. proprietary and confidential — Discussed under NDA Only 4
Dataplane (basic) Packet Service Architecture
5
DPDK
Data Plane Packet Forwarder Threads
Vyatta High Performance User-Space Networking Architecture
UIO / VFIO
Linux Kernel
CPU0 NIC6
Hardware / Virtualization
CPU1 CPU2 CPU3 NIC2 NIC3NIC1 NIC4 NIC5
CPU 1pkt fwd
CPU 2pkt fwd
CPU 3pkt fwd
© 2017 Brocade Communications Systems, Inc.
Packet Service TimingPacket arrival / transmit average periods
© 2017 Brocade Communications Systems, Inc. 6
Link Speed Frame Size Time / Packet1 G 64 640 ns
10 G 64 64 ns40 G 64 16 ns
© 2017 BROCADE COMMUNICATIONS SYSTEMS, INC.
March 1, 2017 SV Linux Users Group, San Jose, CA
Robyn GutierrezSven-Thorsten DietrichBecca Nitzan
Host and Hardware Tuning for Optimal Forwarding Performance
Nominal Forwarding Performance
© 2017 BROCADE COMMUNICATIONS SYSTEMS, INC. 8
Using a simple topo for vRouter performance analysis, using intel E5-2690 v2
9
host_u5> grep "model name" /proc/cpuinfo | uniq
model name : Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
Spirent11/3
Spirent11/2
10G10G
Host OS: Ubuntu 14.04.5Hyperv: KVM
sriovsriovp2p1
traffic traffic
mgmt
bridged
p1p1
em1
vRouter8 vcpus8G RAM
dp0s6
dp0s2
dp0s5
CLI cmd line output of vRouter interface to cpu mapping shows no load and 1 flow load
10
Dataplane CPU activity
Core Interface RX Rate TX Rate--------------------------------------------------------
1 dp0s5 0[crypt] 0
2 dp0s6 0dp0s5 0
3 dp0s6 04 dp0s6 05 dp0s2 66 dp0s2 27 dp0s5 0
Dataplane CPU activity
Core Interface RX Rate TX Rate--------------------------------------------------------
1 dp0s5 0[crypt] 0
2 dp0s6 7.4Mdp0s5 7.4M
3 dp0s6 04 dp0s6 7.4M5 dp0s2 56 dp0s2 27 dp0s5 7.4M
No load One flow
With multiple flows per direction, two RX queues per interface are used, improving performance at 20.8 Mpps..
11
Dataplane CPU activity
Core Interface RX Rate TX Rate--------------------------------------------------------
1 dp0s5 3.5M[crypt] 0
2 dp0s6 6.9Mdp0s5 10.4M
3 dp0s6 3.5M4 dp0s6 10.4M5 dp0s2 56 dp0s2 17 dp0s5 6.9M
However:1) there are only 3 flows per direction, leading to a statistically imbalanced load2) cpus 1 and 3 have 3.5 Mpps, whereas cpus 1 and 7 carry 6.9 Mpps
With a statistically balanced number of flows, cpu load is more evenly distributed - performance is up at 23.2 Mpps..
12
Dataplane CPU activity
Core Interface RX Rate TX Rate--------------------------------------------------------
1 dp0s5 5.8M[crypt] 0
2 dp0s6 5.8Mdp0s5 11.6M
3 dp0s6 5.8M4 dp0s6 11.6M5 dp0s2 66 dp0s2 27 dp0s5 5.8M
While taking into account:1) mgmt interface dp0s2 is assigned to 2 cores (only 1 needed for this test)2) the crypto thread is sharing a cpu with a forwarding interface3) these can be adjusted via configs for better performance
A brief diversion: recall what’s in a minimum sized IP packet over 10G Ethernet..
13
preamble8 bytes
interframe gap12 bytes
mac destination6 bytes
ether type2 bytes
mac source6 bytes
IP minimum 46 bytes
CRC4 bytes
Consider:• 84 bytes total taken up on the wire per minimum sized IP packet• theoretical max pps per direction is 14,880,952, ~29.76 Mpps total bidr• ~70 overhead per packet, 26 bytes data
For a deterministic interface/cpu mapping, it’s possible to configure affinity bits per interface, as a result rates are up to near line rate at 28.4 Mpps
14
Dataplane CPU activity
Core Interface RX Rate TX Rate--------------------------------------------------------
1 dp0s2 6dp0s2 1
[crypt] 02 dp0s5 7.1M3 dp0s5 7.1M4 dp0s5 14.2M5 dp0s6 7.1M6 dp0s6 7.1M7 dp0s6 14.2M
Notes:Ø forwarding ints, dp0s5 and sp0s6, use 3 distinct cpus (no longer overlapping cpu 1 or cpu 2)Ø mgmt int dp0s2 now shares 1 cpu with the crypto thread Ø cpu 0 is retained for the control planeØ see backup slides for vrouter config example
Making the load evenly distributed across multiple RX queues, can more than double pps throughput
15
Hugepages can impact performance by more than 50%
16
à11.3 Mpps no hugepages
à28.4 Mpps with hugepages
Host memory info:
u5_hm> cat /proc/meminfo | grep -i hugeAnonHugePages: 28672 kBHugePages_Total: 120
HugePages_Free: 112 ß VM is using 8GHugePages_Rsvd: 0HugePages_Surp: 0Hugepagesize: 1048576 kB
u5_hm> free -gtotal used free shared buffers cached
Mem: 157G 122G 34G 2.3M 73M 1.1G-/+ buffers/cache: 121G 36GSwap: 63G 0B 63G
Inter-nic performance can be better than intra-nic..
17
• Port-to-port on different nics can be up to 10% better than port–to-port on the same nic
port2port1
port2port1
port2port1
do thisdon’t do that
..other stuff
server
slot 1
slot 2
slot 3
Weird hardware limitations present themselves when you’re least expecting it
18
Here’s one:“Place low latency or high performing PCI-e card in slot 1,2,4,5 or 6 (depending on the type of secondary riser board that might be installed).“
Ok... I guess we should avoid slot 3 then
Yes indeed, we should avoid it:à bare metal, traffic bidirectional with min IPv4 packet size:
- using slot 1 and 3: ~20 Mpps- using slot 1 and 2: ~29 Mpps (100% line rate)
It’s really just hw dependent, but for this particular case:
19
port2port1
port2port1
port2port1
do this
..stuff
server
slot 1
slot 2
slot 3
And, just to quote myself, we’re back to this:“Ironically, knowledge of host HW is mandatory for SDN configs whereforwarding performance is concerned”
don’t do that
Power is another one, in particular, redundant power..
20
• With some HW, redundant power is essential• Without it, periodic drops seriously impacts performance• Adding redundant power improved performance by 50%..
• Bios setting changes:• Dynamic Power Savings Mode à to Static High Performance
And there are many other tweaks that canbe made, only included the ones with the biggest impacts.
A few simple HW adjustments can double pps throughput
21
Different performance tuning parameters matter when using Intel E5-2690 V3
22
SpirentSpirent10G10G
Host OS: Debian 8.7KVM/QEMU: KVM 2.1.2Libvirt: 1.2.9
dp0s7
traffic traffic
mgmt
bridged
dp0s6
em1
vRouter10 vcpus16G RAM
The tuning items that mattered on the platform (using V2 chipset), are not as apparent with a more recent platform, using V3
• The biggest impact was when cpus are pinned to hyperthreaded siblings (negative test), was ~18 % hit
• PCI passthrough versus SRIOV was ~.05% hit
• With and without HugePages, in the noise level, .0004% hit
23
Tuning may or may not matter, depends on the host HW..
24
PCI_PT => PCI PasthroughSRIOV => Single Root I/O Virtualization
Optimal Forwarding Performance
25
© 2017 BROCADE COMMUNICATIONS SYSTEMS, INC.
March 1, 2017 SV Linux Users Group, San Jose, CA
Robyn GutierrezSven-Thorsten DietrichBecca Nitzan
Sofware Tuning for Optimal Forwarding Performance
Vyatta High-Performance Architecture
• NUMA / Memory-bandwidth aware • CPU topology aware• Minimal TLB footprint / huge pages• Tickless Kernel• No system calls or context switches• Zero-copy• Lockless fast table lookup and updates• Real-Time processing to avoid packet drops ???
27© 2017 Brocade Communications Systems, Inc.
Vyatta Real-Time Network Packet Processing
• What’s Real Time?
A: The working definition of a real-time system is “the delay between an event and the program response is known and bounded
• Perfect -- this is exactly what we need! What could possibly go wrong?
A: Programming is a skill best acquired by practice and example rather than from books. – A. Turing
© 2017 BROCADE COMMUNICATIONS SYSTEMS, INC. 28
Vyatta Real-Time Network Packet Processing
• Becca: “when I drive 64 byte packets into the NICs at line-rate, my SSH session locks up and OSPF flaps”• Sven: Excellent. This proves that the real-time scheduler is working
exactly as designed: All cpus cycles are devoted to forwarding packets.• Becca: grumble…
© 2017 BROCADE COMMUNICATIONS SYSTEMS, INC. 29
Vyatta Real-Time Network Packet Processing
• Sven: “This is why we reserve CPU0 for control plane processes. That way the admin console always remains responsive.”• Becca: “I configured an admin console and it locks up too. And my SSH
session is on the admin network and that locks up.”• Sven: “Are you sure you aren’t driving traffic at line rate on the admin
network, causing packets to be dropped and TCP timeouts?”• Becca: “Yes. Files bug: Router won't boot, reboot & console / ssh non-
responsive if high rate traffic is running.”
© 2017 BROCADE COMMUNICATIONS SYSTEMS, INC. 30
Vyatta Real-Time Debugging[ 242.150195] INFO: task sshd:6404 blocked for more than 120 seconds.[ 242.225355] Not tainted 3.14.51-1-amd64-vyatta #1[ 242.288038] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.…[ 242.382015] Call Trace:[ 242.382022] [<ffffffff811ed61a>] ? wait_transaction_locked+0x7a/0xb0[ 242.382025] [<ffffffff81092140>] ? finish_wait+0x90/0x90[ 242.382029] [<ffffffff811ed961>] ? start_this_handle+0x261/0x560[ 242.382032] [<ffffffff8114da6d>] ? __inode_permission+0x2d/0xb0[ 242.382036] [<ffffffff811b129f>] ? ext4_file_open+0x6f/0x1b0[ 242.382039] [<ffffffff8114da6d>] ? __inode_permission+0x2d/0xb0[ 242.382043] [<ffffffff811edf38>] ? jbd2__journal_start+0x128/0x1c0[ 242.382046] [<ffffffff811bc63c>] ? ext4_dirty_inode+0x2c/0x80[ 242.382049] [<ffffffff8116b009>] ? __mark_inode_dirty+0x39/0x240[ 242.382052] [<ffffffff8115cf09>] ? update_time+0x89/0xe0[ 242.382055] [<ffffffff8115a1ea>] ? dput+0x1a/0x110[ 242.382057] [<ffffffff8115cffd>] ? file_update_time+0x9d/0x100[ 242.382059] [<ffffffff811516c0>] ? do_last+0x2d0/0xf10[ 242.382063] [<ffffffff810e64ba>] ? __generic_file_aio_write+0x19a/0x3e0[ 242.382065] [<ffffffff810e675e>] ? generic_file_aio_write+0x5e/0xe0[ 242.382068] [<ffffffff811b1cae>] ? ext4_file_write+0xce/0x420[ 242.382070] [<ffffffff8118bf40>] ? __posix_lock_file+0x210/0x530[ 242.382073] [<ffffffff81142c8a>] ? do_sync_write+0x5a/0x90[ 242.382075] [<ffffffff811437fd>] ? vfs_write+0xbd/0x1f0[ 242.382077] [<ffffffff81143d0b>] ? SyS_write+0x4b/0xb0[ 242.382080] [<ffffffff81517ee7>] ? tracesys+0xdd/0xe2
© 2017 BROCADE COMMUNICATIONS SYSTEMS, INC. 31
Vyatta Real-Time Debugging[ 241.917265] INFO: task vbash:6403 blocked for more than 120 seconds.[ 241.993460] Not tainted 3.14.51-1-amd64-vyatta #1[ 242.150118] Call Trace:[ 242.150122] [<ffffffff81170c00>] ? do_thaw_one+0x60/0x60[ 242.150125] [<ffffffff81513c18>] ? io_schedule+0x88/0xd0[ 242.150127] [<ffffffff81170c09>] ? sleep_on_buffer+0x9/0x10[ 242.150130] [<ffffffff81514292>] ? __wait_on_bit+0x52/0x80[ 242.150133] [<ffffffff812271a8>] ? submit_bio+0x68/0x130[ 242.150135] [<ffffffff81170c00>] ? do_thaw_one+0x60/0x60[ 242.150138] [<ffffffff8151433c>] ? out_of_line_wait_on_bit+0x7c/0xa0[ 242.150142] [<ffffffff810921a0>] ? wake_atomic_t_function+0x30/0x30[ 242.150149] [<ffffffffa003a3a0>] ? squashfs_read_data+0x3a0/0x690 [squashfs][ 242.150153] [<ffffffffa003a7f3>] ? squashfs_cache_get+0x163/0x3a0 [squashfs][ 242.150156] [<ffffffffa003bb0c>] ? squashfs_readpage+0xac/0x8e0 [squashfs][ 242.150159] [<ffffffff810ee0c8>] ? __alloc_pages_nodemask+0x158/0xaa0[ 242.150163] [<ffffffff810e5844>] ? add_to_page_cache_locked+0xc4/0x190[ 242.150165] [<ffffffff810f1208>] ? __do_page_cache_readahead+0x198/0x200[ 242.150168] [<ffffffff810f13ab>] ? ondemand_readahead+0x13b/0x2b0[ 242.150170] [<ffffffff810e5983>] ? pagecache_get_page+0x33/0x1e0[ 242.150173] [<ffffffff810e76e6>] ? generic_file_aio_read+0x4b6/0x6d0[ 242.150176] [<ffffffff81512ed5>] ? schedule_timeout+0x1c5/0x230[ 242.150178] [<ffffffff811428fa>] ? do_sync_read+0x5a/0x90[ 242.150181] [<ffffffff811439d5>] ? vfs_read+0xa5/0x180[ 242.150184] [<ffffffff81148a51>] ? kernel_read+0x41/0x60[ 242.150187] [<ffffffff8114a96b>] ? do_execve_common.isra.35+0x45b/0x610[ 242.150190] [<ffffffff8114ad27>] ? SyS_execve+0x27/0x40[ 242.150193] [<ffffffff81518289>] ? stub_execve+0x69/0xa0
© 2017 BROCADE COMMUNICATIONS SYSTEMS, INC. 32
Vyatta Real-Time Debugging[ 241.442768] INFO: task auditd:4806 blocked for more than 120 seconds.[ 241.519914] Not tainted 3.14.51-1-amd64-vyatta #1[ 241.676300] Call Trace:[ 241.676306] [<ffffffff811ed61a>] ? wait_transaction_locked+0x7a/0xb0[ 241.676308] [<ffffffff81015b95>] ? sched_clock+0x5/0x10[ 241.676312] [<ffffffff81092140>] ? finish_wait+0x90/0x90[ 241.676314] [<ffffffff811ed961>] ? start_this_handle+0x261/0x560[ 241.676316] [<ffffffff81015b95>] ? sched_clock+0x5/0x10[ 241.676318] [<ffffffff810845f6>] ? get_vtime_delta+0x16/0x80[ 241.676320] [<ffffffff81015b95>] ? sched_clock+0x5/0x10[ 241.676322] [<ffffffff81015b3d>] ? native_sched_clock+0x2d/0x80[ 241.676324] [<ffffffff81015b95>] ? sched_clock+0x5/0x10[ 241.676326] [<ffffffff81084ede>] ? arch_vtime_task_switch+0x6e/0x90[ 241.676328] [<ffffffff811edf38>] ? jbd2__journal_start+0x128/0x1c0[ 241.676333] [<ffffffff811bc63c>] ? ext4_dirty_inode+0x2c/0x80[ 241.676335] [<ffffffff8116b009>] ? __mark_inode_dirty+0x39/0x240[ 241.676338] [<ffffffff8115cf09>] ? update_time+0x89/0xe0[ 241.676340] [<ffffffff8115cffd>] ? file_update_time+0x9d/0x100[ 241.676344] [<ffffffff810e64ba>] ? __generic_file_aio_write+0x19a/0x3e0[ 241.676346] [<ffffffff810e675e>] ? generic_file_aio_write+0x5e/0xe0[ 241.676349] [<ffffffff811b1cae>] ? ext4_file_write+0xce/0x420[ 241.676354] [<ffffffff810b4a18>] ? do_futex+0x128/0xb10[ 241.676357] [<ffffffff8126978c>] ? __percpu_counter_sum+0x6c/0x80[ 241.676359] [<ffffffff811c5e12>] ? ext4_statfs+0x112/0x160[ 241.676362] [<ffffffff81142c8a>] ? do_sync_write+0x5a/0x90[ 241.676364] [<ffffffff811437fd>] ? vfs_write+0xbd/0x1f0[ 241.676366] [<ffffffff81143d0b>] ? SyS_write+0x4b/0xb0[ 241.676369] [<ffffffff81517ee7>] ? tracesys+0xdd/0xe2 © 2017 BROCADE COMMUNICATIONS SYSTEMS, INC. 33
Control / Data Plane Task Scheduling (Typical)
34
Control Plane Data Plane Packet Forwarder Threads
Vyatta High Performance User-Space Networking Architecture
Linux Kernel
CPU0
Hardware / Virtualization
CPU1 CPU2 CPU3
CPU 1pkt fwd
CPU 2pkt fwd
CPU 3pkt fwd
© 2017 Brocade Communications Systems, Inc.
auditbashbgp
kworker/0 kworker/1 kworker/2 kworker/3
ribospfsshd
Control Task Migrates, Starts I/O on DP CPU
35
Control Plane Data Plane Packet Forwarder Threads
Vyatta High Performance User-Space Networking Architecture
Linux Kernel
CPU0
Hardware / Virtualization
CPU1 CPU2 CPU3
CPU 1pkt fwd
CPU 2pkt fwd
CPU 3pkt fwd
© 2017 Brocade Communications Systems, Inc.
sshdbash
sshd
kworker/0 kworker/1 kworker/2 kworker/3
audit
bgp
ribospf
Control Task Preempted and Indefinitely Blocked
36
Control Plane Data Plane Packet Forwarder Threads
Vyatta High Performance User-Space Networking Architecture
Linux Kernel
CPU0
Hardware / Virtualization
CPU1 CPU2 CPU3
CPU 1pkt fwd
CPU 2pkt fwd
CPU 3pkt fwd
© 2017 Brocade Communications Systems, Inc.
sshd
auditbash
kworker/0 kworker/1 kworker/2 kworker/3
bgp
ribospf
Vyatta Real-Time Debugging
Priority Live Lock!1. Data plane forwarder on CPU (1, 2 or 3) goes idle (traffic gap)2. Scheduler moves control process from CPU 0 to 1, 2, or 33. Control plane task, does I/O (e.g. write log entry), blocks.4. Data plane forwarder goes active (traffic resumes)5. Control process chain holding I/O locks, but Dataplane forwarder
does no I/O. No opportunity for PI, migration: other processes pile up on the same lock.
6. Solution: Stop Traffic? Drop Real-Time?© 2017 BROCADE COMMUNICATIONS SYSTEMS, INC. 37
Vyatta Packet Forwarder Performance
• RT scheduler requires special handling – see above• So what happens if we just drop Real-Time?
© 2017 BROCADE COMMUNICATIONS SYSTEMS, INC. 38
Under heavy 29 Mpps load, the system get’s into the zone, that is, fewer sleep/wake cycles make it better..
Which brings up short diversion into IMIX distros and pps:
à Some IMIXs have a higher percentage of large packets, and ~1.1 Mpps on a 10Gimix_dnload_l3_pktsize=[48, 128, 256, 576, 1500]
imix_dnload_l3_weight=[ 25, 5, 3, 2, 65]
à And some IMIXs have a higher percentage of small packets, this one is ~3.1 Mpps on a 10Gimix_upload_l3_pktsize=[48, 128, 256, 576, 1500]
imix_upload_l3_weight=[ 70, 5, 3, 2, 20]
39
So seeing drops happening on IMIX, alarms go off..
We mostly run performance regressions with high pps rates and other loads, such as NAT, firewall, etc. The heavy lifters.
But it’s a mistake to ignore the light-weights. For example, one issue seemingly out of the blue, showed drops while running IMIX (this is forbidden)
%rate %drop 1_flow %drop 10_flow %drop 1_flow up_dn_imix_same %drop 10_flow up_dn_imix_same----- ------------ ------------- ---------------------------- -----------------------------
20 0.000 0.018 0.013 0.069
40 0.000 0.008 0.003 0.009
60 0.000 0.000 0.002 0.066
80 0.003 0.000 0.000 0.016
100 0.000 0.001 0.000 0.008
…and the journey continues (Sven)!
40
Vyatta Packet Forwarder Performance
• By dropping real-time we encountered “Scheduling Fairness”• The CFS scheduler penalizes CPU-bound SCHED_OTHER processes• This would lead to the packet drops Becca observed
© 2017 BROCADE COMMUNICATIONS SYSTEMS, INC. 41
Control / Data Plane Task Scheduling (Typical)
42
Control Plane Data Plane Packet Forwarder Threads
Vyatta High Performance User-Space Networking Architecture
Linux Kernel
CPU0
Hardware / Virt
CPU1 CPU2 CPU3
CPU 1pkt fwd
CPU 2pkt fwd
CPU 3pkt fwd
© 2017 Brocade Communications Systems, Inc.
auditbashbgp
kworker/0 kworker/1 kworker/2 kworker/3
ribospfsshd
Dynamic Control Plane / Data Plane Resourcing
43
Control Plane: CPU 0+1 Data Plane Packet Forwarders: CPU 2+3
Vyatta High Performance User-Space Networking Architecture
Linux Kernel
CPU0
Hardware / Virtualization
CPU1 CPU2 CPU3
CPU 2pkt fwd
CPU 3pkt fwd
© 2017 Brocade Communications Systems, Inc.
auditbash
kworker/0 kworker/1 kworker/2 kworker/3
rib
bgpospfsshd
What’s New in OSS and Vyatta R&D?
© 2017 BROCADE COMMUNICATIONS SYSTEMS, INC. 44
Networking: 100 G and beyond• No lack of small packets (Twitter, SMS, IOT messaging)• More network queues and associated spectrum provisioned to drive network traffic
Processor Silicon• CPU clock speed is on the long tail of the asymptote• Semi process shrinks approaching single-atom wires• Architecture tweak gains in 5 – 10% range for revisions• Moore’s law per core count maps to software performance via Ahmdahl’s law
Linux Kernel• Context switch / system call overhead essentially static• Unix socket / Memory copy model not scaling at 10G. Sidelined at 40 and 100.• Offload libraries replicating driver and netstack code, long-game solution needed
All Done!
Thank You
45
Sven-Thorsten [email protected]
Robyn [email protected]
Becca [email protected]
Thank you
© 2017 Brocade Communications Systems, Inc.