Download - VyattaNetwork OS - · PDF fileVyatta High Level Architecture 3 IPv4/IPv6 Unicast Firewall Encrypt / Decrypt Tunnels (GRE, mGRE) Multicast DPDK QoS NAT Etc Data Plane (vPlane) CLI REST

Vyatta Network OS(vRouter)

March 1, 2017 SV Linux Users Group, San Jose, CA

Robyn GutierrezSven-Thorsten DietrichBecca Nitzan

© 2017 BROCADE COMMUNICATIONS SYSTEMS, INC. 2

• High Level Architecture Overview and Constraints• Forwarding Performance Using Intel 2690 V2

• Topo• Interface load distribution

• No load, one flow• Multiple imbalanced flows• Multiple balanced flows• Mutiple balanced flows with interface affinity configs

• Hugepages• Know your HW - Limitations when you’re least expecting it!

• then there’s inter versus intra nic• and Power

• Forwarding Performance Using Intel 2690 V3• Topo• Tuning comparisons

• Out of the zone, and into the hack

Topics

Vyatta High Level Architecture

3

IPv4/IPv6 Unicast

Firewall

Encrypt / Decrypt

Tunnels (GRE, mGRE)

Multicast

DPDK

QoS

NAT

Etc

Data Plane (vPlane)CLI

REST

Netconf

GUI

Script API AAA RoutingProtocols

Hybrid DevOpsData Model

Shadow Interfaces FIBvPlaned

Session State

Control Plane

Vyatta High Performance User-Space Networking Architecture

Shadow Interfaces UIO / VFIOLinux Kernel

StorageMulti-Queue NICs

(up to 40Gb)

Hardware / Virtualization

Console USB

AF_PACKET

WANUSB

© 2017 Brocade Communications Systems, Inc.

Why User-Space / Kernel Offload

© 2016 Brocade Communications Systems, Inc. proprietary and confidential — Discussed under NDA Only 4

Dataplane (basic) Packet Service Architecture

5

DPDK

Data Plane Packet Forwarder Threads


UIO / VFIO

Linux Kernel

CPU0 NIC6


CPU1 CPU2 CPU3 NIC2 NIC3NIC1 NIC4 NIC5

CPU 1pkt fwd

CPU 2pkt fwd

CPU 3pkt fwd


Packet Service TimingPacket arrival / transmit average periods

© 2017 Brocade Communications Systems, Inc. 6

Link Speed Frame Size Time / Packet1 G 64 640 ns

10 G 64 64 ns40 G 64 16 ns

© 2017 BROCADE COMMUNICATIONS SYSTEMS, INC.



Host and Hardware Tuning for Optimal Forwarding Performance

Nominal Forwarding Performance


Using a simple topo for vRouter performance analysis, using intel E5-2690 v2

9

host_u5> grep "model name" /proc/cpuinfo | uniq

model name : Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz

Spirent11/3

Spirent11/2

10G10G

Host OS: Ubuntu 14.04.5Hyperv: KVM

sriovsriovp2p1

traffic traffic

mgmt

bridged

p1p1

em1

vRouter8 vcpus8G RAM

dp0s6

dp0s2

dp0s5

CLI cmd line output of vRouter interface to cpu mapping shows no load and 1 flow load

10

Dataplane CPU activity

Core Interface RX Rate TX Rate--------------------------------------------------------

1 dp0s5 0[crypt] 0

2 dp0s6 0dp0s5 0

3 dp0s6 04 dp0s6 05 dp0s2 66 dp0s2 27 dp0s5 0



1 dp0s5 0[crypt] 0

2 dp0s6 7.4Mdp0s5 7.4M

3 dp0s6 04 dp0s6 7.4M5 dp0s2 56 dp0s2 27 dp0s5 7.4M

No load One flow

With multiple flows per direction, two RX queues per interface are used, improving performance at 20.8 Mpps..

11



1 dp0s5 3.5M[crypt] 0

2 dp0s6 6.9Mdp0s5 10.4M

3 dp0s6 3.5M4 dp0s6 10.4M5 dp0s2 56 dp0s2 17 dp0s5 6.9M

However:1) there are only 3 flows per direction, leading to a statistically imbalanced load2) cpus 1 and 3 have 3.5 Mpps, whereas cpus 1 and 7 carry 6.9 Mpps

With a statistically balanced number of flows, cpu load is more evenly distributed - performance is up at 23.2 Mpps..

12



1 dp0s5 5.8M[crypt] 0

2 dp0s6 5.8Mdp0s5 11.6M

3 dp0s6 5.8M4 dp0s6 11.6M5 dp0s2 66 dp0s2 27 dp0s5 5.8M

While taking into account:1) mgmt interface dp0s2 is assigned to 2 cores (only 1 needed for this test)2) the crypto thread is sharing a cpu with a forwarding interface3) these can be adjusted via configs for better performance

A brief diversion: recall what’s in a minimum sized IP packet over 10G Ethernet..

13

preamble8 bytes

interframe gap12 bytes

mac destination6 bytes

ether type2 bytes

mac source6 bytes

IP minimum 46 bytes

CRC4 bytes

Consider:• 84 bytes total taken up on the wire per minimum sized IP packet• theoretical max pps per direction is 14,880,952, ~29.76 Mpps total bidr• ~70 overhead per packet, 26 bytes data

For a deterministic interface/cpu mapping, it’s possible to configure affinity bits per interface, as a result rates are up to near line rate at 28.4 Mpps

14



1 dp0s2 6dp0s2 1

[crypt] 02 dp0s5 7.1M3 dp0s5 7.1M4 dp0s5 14.2M5 dp0s6 7.1M6 dp0s6 7.1M7 dp0s6 14.2M

Notes:Ø forwarding ints, dp0s5 and sp0s6, use 3 distinct cpus (no longer overlapping cpu 1 or cpu 2)Ø mgmt int dp0s2 now shares 1 cpu with the crypto thread Ø cpu 0 is retained for the control planeØ see backup slides for vrouter config example

Making the load evenly distributed across multiple RX queues, can more than double pps throughput

15

Hugepages can impact performance by more than 50%

16

à11.3 Mpps no hugepages

à28.4 Mpps with hugepages

Host memory info:

u5_hm> cat /proc/meminfo | grep -i hugeAnonHugePages: 28672 kBHugePages_Total: 120

HugePages_Free: 112 ß VM is using 8GHugePages_Rsvd: 0HugePages_Surp: 0Hugepagesize: 1048576 kB

u5_hm> free -gtotal used free shared buffers cached

Mem: 157G 122G 34G 2.3M 73M 1.1G-/+ buffers/cache: 121G 36GSwap: 63G 0B 63G

Inter-nic performance can be better than intra-nic..

17

• Port-to-port on different nics can be up to 10% better than port–to-port on the same nic

port2port1

port2port1

port2port1

do thisdon’t do that

..other stuff

server

slot 1

slot 2

slot 3

Weird hardware limitations present themselves when you’re least expecting it

18

Here’s one:“Place low latency or high performing PCI-e card in slot 1,2,4,5 or 6 (depending on the type of secondary riser board that might be installed).“

Ok... I guess we should avoid slot 3 then

Yes indeed, we should avoid it:à bare metal, traffic bidirectional with min IPv4 packet size:

- using slot 1 and 3: ~20 Mpps- using slot 1 and 2: ~29 Mpps (100% line rate)

It’s really just hw dependent, but for this particular case:

19

port2port1

port2port1

port2port1

do this

..stuff

server

slot 1

slot 2

slot 3

And, just to quote myself, we’re back to this:“Ironically, knowledge of host HW is mandatory for SDN configs whereforwarding performance is concerned”

don’t do that

Power is another one, in particular, redundant power..

20

• With some HW, redundant power is essential• Without it, periodic drops seriously impacts performance• Adding redundant power improved performance by 50%..

• Bios setting changes:• Dynamic Power Savings Mode à to Static High Performance

And there are many other tweaks that canbe made, only included the ones with the biggest impacts.

A few simple HW adjustments can double pps throughput

21

Different performance tuning parameters matter when using Intel E5-2690 V3

22

SpirentSpirent10G10G

Host OS: Debian 8.7KVM/QEMU: KVM 2.1.2Libvirt: 1.2.9

dp0s7

traffic traffic

mgmt

bridged

dp0s6

em1

vRouter10 vcpus16G RAM

The tuning items that mattered on the platform (using V2 chipset), are not as apparent with a more recent platform, using V3

• The biggest impact was when cpus are pinned to hyperthreaded siblings (negative test), was ~18 % hit

• PCI passthrough versus SRIOV was ~.05% hit

• With and without HugePages, in the noise level, .0004% hit

23

Tuning may or may not matter, depends on the host HW..

24

PCI_PT => PCI PasthroughSRIOV => Single Root I/O Virtualization

Optimal Forwarding Performance

25

© 2017 BROCADE COMMUNICATIONS SYSTEMS, INC.



Sofware Tuning for Optimal Forwarding Performance

Vyatta High-Performance Architecture

• NUMA / Memory-bandwidth aware • CPU topology aware• Minimal TLB footprint / huge pages• Tickless Kernel• No system calls or context switches• Zero-copy• Lockless fast table lookup and updates• Real-Time processing to avoid packet drops ???

27© 2017 Brocade Communications Systems, Inc.

Vyatta Real-Time Network Packet Processing

• What’s Real Time?

A: The working definition of a real-time system is “the delay between an event and the program response is known and bounded

• Perfect -- this is exactly what we need! What could possibly go wrong?

A: Programming is a skill best acquired by practice and example rather than from books. – A. Turing



• Becca: “when I drive 64 byte packets into the NICs at line-rate, my SSH session locks up and OSPF flaps”• Sven: Excellent. This proves that the real-time scheduler is working

exactly as designed: All cpus cycles are devoted to forwarding packets.• Becca: grumble…



• Sven: “This is why we reserve CPU0 for control plane processes. That way the admin console always remains responsive.”• Becca: “I configured an admin console and it locks up too. And my SSH

session is on the admin network and that locks up.”• Sven: “Are you sure you aren’t driving traffic at line rate on the admin

network, causing packets to be dropped and TCP timeouts?”• Becca: “Yes. Files bug: Router won't boot, reboot & console / ssh non-

responsive if high rate traffic is running.”


Vyatta Real-Time Debugging[ 242.150195] INFO: task sshd:6404 blocked for more than 120 seconds.[ 242.225355] Not tainted 3.14.51-1-amd64-vyatta #1[ 242.288038] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.…[ 242.382015] Call Trace:[ 242.382022] [<ffffffff811ed61a>] ? wait_transaction_locked+0x7a/0xb0[ 242.382025] [<ffffffff81092140>] ? finish_wait+0x90/0x90[ 242.382029] [<ffffffff811ed961>] ? start_this_handle+0x261/0x560[ 242.382032] [<ffffffff8114da6d>] ? __inode_permission+0x2d/0xb0[ 242.382036] [<ffffffff811b129f>] ? ext4_file_open+0x6f/0x1b0[ 242.382039] [<ffffffff8114da6d>] ? __inode_permission+0x2d/0xb0[ 242.382043] [<ffffffff811edf38>] ? jbd2__journal_start+0x128/0x1c0[ 242.382046] [<ffffffff811bc63c>] ? ext4_dirty_inode+0x2c/0x80[ 242.382049] [<ffffffff8116b009>] ? __mark_inode_dirty+0x39/0x240[ 242.382052] [<ffffffff8115cf09>] ? update_time+0x89/0xe0[ 242.382055] [<ffffffff8115a1ea>] ? dput+0x1a/0x110[ 242.382057] [<ffffffff8115cffd>] ? file_update_time+0x9d/0x100[ 242.382059] [<ffffffff811516c0>] ? do_last+0x2d0/0xf10[ 242.382063] [<ffffffff810e64ba>] ? __generic_file_aio_write+0x19a/0x3e0[ 242.382065] [<ffffffff810e675e>] ? generic_file_aio_write+0x5e/0xe0[ 242.382068] [<ffffffff811b1cae>] ? ext4_file_write+0xce/0x420[ 242.382070] [<ffffffff8118bf40>] ? __posix_lock_file+0x210/0x530[ 242.382073] [<ffffffff81142c8a>] ? do_sync_write+0x5a/0x90[ 242.382075] [<ffffffff811437fd>] ? vfs_write+0xbd/0x1f0[ 242.382077] [<ffffffff81143d0b>] ? SyS_write+0x4b/0xb0[ 242.382080] [<ffffffff81517ee7>] ? tracesys+0xdd/0xe2


Vyatta Real-Time Debugging[ 241.917265] INFO: task vbash:6403 blocked for more than 120 seconds.[ 241.993460] Not tainted 3.14.51-1-amd64-vyatta #1[ 242.150118] Call Trace:[ 242.150122] [<ffffffff81170c00>] ? do_thaw_one+0x60/0x60[ 242.150125] [<ffffffff81513c18>] ? io_schedule+0x88/0xd0[ 242.150127] [<ffffffff81170c09>] ? sleep_on_buffer+0x9/0x10[ 242.150130] [<ffffffff81514292>] ? __wait_on_bit+0x52/0x80[ 242.150133] [<ffffffff812271a8>] ? submit_bio+0x68/0x130[ 242.150135] [<ffffffff81170c00>] ? do_thaw_one+0x60/0x60[ 242.150138] [<ffffffff8151433c>] ? out_of_line_wait_on_bit+0x7c/0xa0[ 242.150142] [<ffffffff810921a0>] ? wake_atomic_t_function+0x30/0x30[ 242.150149] [<ffffffffa003a3a0>] ? squashfs_read_data+0x3a0/0x690 [squashfs][ 242.150153] [<ffffffffa003a7f3>] ? squashfs_cache_get+0x163/0x3a0 [squashfs][ 242.150156] [<ffffffffa003bb0c>] ? squashfs_readpage+0xac/0x8e0 [squashfs][ 242.150159] [<ffffffff810ee0c8>] ? __alloc_pages_nodemask+0x158/0xaa0[ 242.150163] [<ffffffff810e5844>] ? add_to_page_cache_locked+0xc4/0x190[ 242.150165] [<ffffffff810f1208>] ? __do_page_cache_readahead+0x198/0x200[ 242.150168] [<ffffffff810f13ab>] ? ondemand_readahead+0x13b/0x2b0[ 242.150170] [<ffffffff810e5983>] ? pagecache_get_page+0x33/0x1e0[ 242.150173] [<ffffffff810e76e6>] ? generic_file_aio_read+0x4b6/0x6d0[ 242.150176] [<ffffffff81512ed5>] ? schedule_timeout+0x1c5/0x230[ 242.150178] [<ffffffff811428fa>] ? do_sync_read+0x5a/0x90[ 242.150181] [<ffffffff811439d5>] ? vfs_read+0xa5/0x180[ 242.150184] [<ffffffff81148a51>] ? kernel_read+0x41/0x60[ 242.150187] [<ffffffff8114a96b>] ? do_execve_common.isra.35+0x45b/0x610[ 242.150190] [<ffffffff8114ad27>] ? SyS_execve+0x27/0x40[ 242.150193] [<ffffffff81518289>] ? stub_execve+0x69/0xa0


Vyatta Real-Time Debugging[ 241.442768] INFO: task auditd:4806 blocked for more than 120 seconds.[ 241.519914] Not tainted 3.14.51-1-amd64-vyatta #1[ 241.676300] Call Trace:[ 241.676306] [<ffffffff811ed61a>] ? wait_transaction_locked+0x7a/0xb0[ 241.676308] [<ffffffff81015b95>] ? sched_clock+0x5/0x10[ 241.676312] [<ffffffff81092140>] ? finish_wait+0x90/0x90[ 241.676314] [<ffffffff811ed961>] ? start_this_handle+0x261/0x560[ 241.676316] [<ffffffff81015b95>] ? sched_clock+0x5/0x10[ 241.676318] [<ffffffff810845f6>] ? get_vtime_delta+0x16/0x80[ 241.676320] [<ffffffff81015b95>] ? sched_clock+0x5/0x10[ 241.676322] [<ffffffff81015b3d>] ? native_sched_clock+0x2d/0x80[ 241.676324] [<ffffffff81015b95>] ? sched_clock+0x5/0x10[ 241.676326] [<ffffffff81084ede>] ? arch_vtime_task_switch+0x6e/0x90[ 241.676328] [<ffffffff811edf38>] ? jbd2__journal_start+0x128/0x1c0[ 241.676333] [<ffffffff811bc63c>] ? ext4_dirty_inode+0x2c/0x80[ 241.676335] [<ffffffff8116b009>] ? __mark_inode_dirty+0x39/0x240[ 241.676338] [<ffffffff8115cf09>] ? update_time+0x89/0xe0[ 241.676340] [<ffffffff8115cffd>] ? file_update_time+0x9d/0x100[ 241.676344] [<ffffffff810e64ba>] ? __generic_file_aio_write+0x19a/0x3e0[ 241.676346] [<ffffffff810e675e>] ? generic_file_aio_write+0x5e/0xe0[ 241.676349] [<ffffffff811b1cae>] ? ext4_file_write+0xce/0x420[ 241.676354] [<ffffffff810b4a18>] ? do_futex+0x128/0xb10[ 241.676357] [<ffffffff8126978c>] ? __percpu_counter_sum+0x6c/0x80[ 241.676359] [<ffffffff811c5e12>] ? ext4_statfs+0x112/0x160[ 241.676362] [<ffffffff81142c8a>] ? do_sync_write+0x5a/0x90[ 241.676364] [<ffffffff811437fd>] ? vfs_write+0xbd/0x1f0[ 241.676366] [<ffffffff81143d0b>] ? SyS_write+0x4b/0xb0[ 241.676369] [<ffffffff81517ee7>] ? tracesys+0xdd/0xe2 © 2017 BROCADE COMMUNICATIONS SYSTEMS, INC. 33

Control / Data Plane Task Scheduling (Typical)

34

Control Plane Data Plane Packet Forwarder Threads


Linux Kernel

CPU0


CPU1 CPU2 CPU3

CPU 1pkt fwd

CPU 2pkt fwd

CPU 3pkt fwd


auditbashbgp

kworker/0 kworker/1 kworker/2 kworker/3

ribospfsshd

Control Task Migrates, Starts I/O on DP CPU

35



Linux Kernel

CPU0


CPU1 CPU2 CPU3

CPU 1pkt fwd

CPU 2pkt fwd

CPU 3pkt fwd


sshdbash

sshd


audit

bgp

ribospf

Control Task Preempted and Indefinitely Blocked

36



Linux Kernel

CPU0


CPU1 CPU2 CPU3

CPU 1pkt fwd

CPU 2pkt fwd

CPU 3pkt fwd


sshd

auditbash


bgp

ribospf

Vyatta Real-Time Debugging

Priority Live Lock!1. Data plane forwarder on CPU (1, 2 or 3) goes idle (traffic gap)2. Scheduler moves control process from CPU 0 to 1, 2, or 33. Control plane task, does I/O (e.g. write log entry), blocks.4. Data plane forwarder goes active (traffic resumes)5. Control process chain holding I/O locks, but Dataplane forwarder

does no I/O. No opportunity for PI, migration: other processes pile up on the same lock.

6. Solution: Stop Traffic? Drop Real-Time?© 2017 BROCADE COMMUNICATIONS SYSTEMS, INC. 37

Vyatta Packet Forwarder Performance

• RT scheduler requires special handling – see above• So what happens if we just drop Real-Time?


Under heavy 29 Mpps load, the system get’s into the zone, that is, fewer sleep/wake cycles make it better..

Which brings up short diversion into IMIX distros and pps:

à Some IMIXs have a higher percentage of large packets, and ~1.1 Mpps on a 10Gimix_dnload_l3_pktsize=[48, 128, 256, 576, 1500]

imix_dnload_l3_weight=[ 25, 5, 3, 2, 65]

à And some IMIXs have a higher percentage of small packets, this one is ~3.1 Mpps on a 10Gimix_upload_l3_pktsize=[48, 128, 256, 576, 1500]

imix_upload_l3_weight=[ 70, 5, 3, 2, 20]

39

So seeing drops happening on IMIX, alarms go off..

We mostly run performance regressions with high pps rates and other loads, such as NAT, firewall, etc. The heavy lifters.

But it’s a mistake to ignore the light-weights. For example, one issue seemingly out of the blue, showed drops while running IMIX (this is forbidden)

%rate %drop 1_flow %drop 10_flow %drop 1_flow up_dn_imix_same %drop 10_flow up_dn_imix_same----- ------------ ------------- ---------------------------- -----------------------------

20 0.000 0.018 0.013 0.069

40 0.000 0.008 0.003 0.009

60 0.000 0.000 0.002 0.066

80 0.003 0.000 0.000 0.016

100 0.000 0.001 0.000 0.008

…and the journey continues (Sven)!

40

Vyatta Packet Forwarder Performance

• By dropping real-time we encountered “Scheduling Fairness”• The CFS scheduler penalizes CPU-bound SCHED_OTHER processes• This would lead to the packet drops Becca observed


Control / Data Plane Task Scheduling (Typical)

42



Linux Kernel

CPU0

Hardware / Virt

CPU1 CPU2 CPU3

CPU 1pkt fwd

CPU 2pkt fwd

CPU 3pkt fwd


auditbashbgp


ribospfsshd

Dynamic Control Plane / Data Plane Resourcing

43

Control Plane: CPU 0+1 Data Plane Packet Forwarders: CPU 2+3


Linux Kernel

CPU0


CPU1 CPU2 CPU3

CPU 2pkt fwd

CPU 3pkt fwd


auditbash


rib

bgpospfsshd

What’s New in OSS and Vyatta R&D?


Networking: 100 G and beyond• No lack of small packets (Twitter, SMS, IOT messaging)• More network queues and associated spectrum provisioned to drive network traffic

Processor Silicon• CPU clock speed is on the long tail of the asymptote• Semi process shrinks approaching single-atom wires• Architecture tweak gains in 5 – 10% range for revisions• Moore’s law per core count maps to software performance via Ahmdahl’s law

Linux Kernel• Context switch / system call overhead essentially static• Unix socket / Memory copy model not scaling at 10G. Sidelined at 40 and 100.• Offload libraries replicating driver and netstack code, long-game solution needed

All Done!

Thank You

45

Sven-Thorsten [email protected]

Robyn [email protected]

Becca [email protected]

Thank you