VPP Host StackTransport and Session Layers
Florin Coras, Dave Barach
EFFICIENCY
PERFORMANCE
SOFTWARE DEFINED NETWORKING
CLOUD NETWORK SERVICES
LINUX FOUNDATION
VPP - A Universal Terabit Network PlatformFor Native Cloud Network Services
Superior Performance
Most Efficient on the Planet
Flexible and Extensible
Open Source
Cloud Native
Breaking the Barrier of Software Defined Network Services1 Terabit Services on a Single Intel® Xeon® Server !
Motivation: Container networking
FD.io Mini-Summit at KubeCon Europe 2018
FIFO
TCP
IP (routing)
device
send()
FIFO
TCP
IP (routing)
device
recv()
kernel
glibc
PID 1234 PID 4321
Motivation: Container networking
FIFO
PID 1234
TCP
IP (routing)
device
send()
FIFO
PID 4321
TCP
IP (routing)
device
recv()
FIFO
device
FIFO
device
VPP
af_packet/tap
etc etc etcACL, SR, VXLAN, LISP
IP4/6MPLS
Ethernet
dpdk
dpdk
device
af_packet/tap
FD.io Mini-Summit at KubeCon Europe 2018
Why not this?
PID 1234 PID 4321
recv()
FIFOFIFO
TCP
IP
DPDK
send()
Session
FD.io Mini-Summit at KubeCon Europe 2018
VPP
VPP Host Stack
FD.io Mini-Summit at KubeCon Europe 2018
Session
App
Binary API
TCP
IP, DPDK
VPP
shmsegmentrx tx
VPP Host Stack: Session Layer
FD.io Mini-Summit at KubeCon Europe 2018
Session
App
Binary API
TCP
IP, DPDK
VPP
§ Maintains per app state and conveys to/from session events
§ Allocates and manages sessions/segments/fifos§ Isolates network resources via namespacing§ Session lookup tables (5-tuple) and local/global
session rule tables (filters)§ Support for pluggable transport protocols§ Binary/native C API for external/builtin
applications
shmsegmentrx tx
VPP Host Stack: SVM FIFOs
FD.io Mini-Summit at KubeCon Europe 2018
Session
App
Binary API
TCP
IP, DPDK
VPP
§ Allocated within shared memory segments with
or without file backing (ssvm/memfd)
§ Fixed position and size
§ Lock free enqueue/dequeue but atomic size
increment
§ Option to dequeue/peek data
§ Support for out-of-order data enqueues
shm
segmentrx tx
VPP Host Stack: TCP
FD.io Mini-Summit at KubeCon Europe 2018
Session
App
Binary API
TCP
IP, DPDK
VPP
shmsegmentrx tx
§ Clean-slate implementation§ “Complete” state machine implementation§ Connection management and flow control
(window management)§ Timers and retransmission, fast retransmit, SACK§ NewReno congestion control, SACK based fast
recovery§ Checksum offloading§ Linux compatibility tested with IWL TCP protocol
tester
VPP Host Stack: more transports
FD.io Mini-Summit at KubeCon Europe 2018
Session
App
Binary API
TCP
IP, DPDK
VPP
shmsegmentrx tx
§ SCTP§ UDP§ TLS
VPP Host Stack: Comms Library (VCL)
FD.io Mini-Summit at KubeCon Europe 2018
Session
App
Binary API
TCP
IP, DPDK
VPP
§ Comms library (VCL) apps can link against§ LD_PRELOAD library for legacy apps§ epoll
shmsegmentrx tx
Application Attachment
FD.io Mini-Summit at KubeCon Europe 2018
Session
App
TCP
IP, DPDK
VPP
attachbind (server)connect (client)
Binary API
shmsegment
Session Establishment
FD.io Mini-Summit at KubeCon Europe 2018
Session
Client
TCP
IP, DPDK
VPP
Session
Server
TCP
IP, DPDK
VPP
Binary API Binary API
attachbind
listen
Session Establishment
FD.io Mini-Summit at KubeCon Europe 2018
Session
Client
TCP
IP, DPDK
VPP
Session
Server
TCP
IP, DPDK
VPP
Binary API
attachconnect
open
Binary API
attachbind
listen
Session Establishment
FD.io Mini-Summit at KubeCon Europe 2018
Session
Client
TCP
IP, DPDK
VPP
Session
Server
TCP
IP, DPDK
VPP
Binary API
handshake
Binary API
Session Establishment
FD.io Mini-Summit at KubeCon Europe 2018
Session
Client
TCP
IP, DPDK
VPP
Session
Server
TCP
IP, DPDK
VPP
Binary API
handshake
Binary API
new clientconnect succeeded
Session Establishment
FD.io Mini-Summit at KubeCon Europe 2018
Session
Client
TCP
IP, DPDK
VPP
Session
Server
TCP
IP, DPDK
VPP
Binary API
connect reply
Binary API
accept notifyshm
segmentshm
segmentrx tx rx tx
Data Transfer
FD.io Mini-Summit at KubeCon Europe 2018
Session
Client
TCP
IP, DPDK
VPP
Session
Server
TCP
IP, DPDK
VPP
read
copy to buffer copy to fifo
rx tx rx tx
write
Congestion controlReliable transport
Binary API
tx write evt
Binary API
rx write evt
Data Transfer
FD.io Mini-Summit at KubeCon Europe 2018
Session
Client
TCP
IP, DPDK
VPP
Session
Server
TCP
IP, DPDK
VPP
read
copy to buffer copy to fifo
rx tx rx tx
write
Congestion controlReliable transport
Binary API
tx write evt
Binary API
rx write evt
Some rough numbers on a E2699: ~12Gbps/core (1.5k MTU), ~20Gbps/core (9k MTU), ~185k CPS!
Data Transfer: Dgram Transports
FD.io Mini-Summit at KubeCon Europe 2018
Session
Client
UDP
IP, DPDK
VPP
Session
Server
UDP
IP, DPDK
VPP
Binary API
Data and dgram header dequeued from fifo
Binary API
listenshm
segmentshm
segmentrx tx rx tx
Data enqueued in fifowith dgram hdr
“connect”
Redirected Connections (Cut-through)
FD.io Mini-Summit at KubeCon Europe 2018
Session
Client
TCP
IP, DPDK
VPP
Server
bindBinary API
Redirected Connections (Cut-through)
FD.io Mini-Summit at KubeCon Europe 2018
Session
TCP
IP, DPDK
VPP
redirectconnect
Client Server
Binary API
Redirected Connections (Cut-through)
FD.io Mini-Summit at KubeCon Europe 2018
Session
TCP
IP, DPDK
VPP
VPP tracks these sessions, allocates ssvm segments and asks both peers to map them.
Client Server
Binary API
Redirected Connections (Cut-through)
FD.io Mini-Summit at KubeCon Europe 2018
Session
Client
TCP
IP, DPDK
VPP
Server
Binary API
Throughput is memory bandwidth constrained: ~120Gbps!
Multi-threading for stream connections
FD.io Mini-Summit at KubeCon Europe 2018
Session
App1
Binary API
Session
DPDK
rx tx rx tx
TCP
IP
TCP
IP
Core 0 Core 1
§ Connections/sessions ’pinned’ to a thread
§ Per-thread data structures/state
Features: Namespaces
FD.io Mini-Summit at KubeCon Europe 2018
Session
App
Binary API
TCP
VPP
Session
TCP
Session
TCP
IP IP IP
ns1 ns2 ns3
fib1 fib2
Request access to vpp ns + secret
Namespaces are configured independently and associate applications to network layer resources like interfaces and fib tables
Features: Session Tables
FD.io Mini-Summit at KubeCon Europe 2018
NS Local Session Table
Binary API
TCP
NS Local Session Table
TCP
ns1 ns2
fib1
Global Session Table
App1
Request access to global and/or local scope
Features: Session Tables
FD.io Mini-Summit at KubeCon Europe 2018
NS Local Session Table
Binary API
TCP
NS Local Session Table
TCP
ns1 ns2
fib1
Global Session Table
§ Both table have “rules table” that can be used for filtering
§ Local tables are namespace specific and can be used for egress filtering
§ Global tables are fib table specific and can be used for ingress filtering
App1
TLS App
FD.io Mini-Summit at KubeCon Europe 2018
App
TLS App
App Session
TCP
TLS Engine(openssl, mbedtls)
TLS context
rx tx
rx tx
§ TLS App registers as transport at VPP init time
§ TLS protocol implementation handled by plugin “engines”. We support openssl and mbedtls
§ Client app registers key and certificate via api and requests tls as session transport
§ CA certs read at TLS app init time. Defaults to reading /etc/ssl/certs/ca-certificates.crt
§ Ping and Ray from Intel working on accelerating the openssl engine with QAT cards
TLS App
FD.io Mini-Summit at KubeCon Europe 2018
App
TLS App
App Session
TCP
TLS Engine(openssl, mbedtls)
TLS context
rx tx
rx tx
§ TLS App registers as transport at VPP init time
§ TLS protocol implementation handled by plugin “engines”. We support openssl and mbedtls
§ Client app registers key and certificate via api and requests tls as session transport
§ CA certs read at TLS app init time. Defaults to reading /etc/ssl/certs/ca-certificates.crt
§ Ping and Ray from Intel working on accelerating the openssl engine with QAT cards
Some rough OpenSSL numbers on a E2699: ~1Gbps/core (no hw accel)
Ongoing work
• Overall integration with k8s• Istio/Envoy
• TCP• Rx policer/tx pacer• TSO • New congestion control algorithms• PMTU discovery• Optimization/hardening/testing
FD.io Mini-Summit at KubeCon Europe 2018
• Get the Code, Build the Code, Run the Code• Session layer: src/vnet/session• TCP: src/vnet/tcp• SVM: src/svm• VCL: src/vcl
• Read/Watch the Tutorials• Read/Watch VPP Tutorials• Join the Mailing Lists
FD.io Mini-Summit at KubeCon Europe 2018
Next steps – Get involved
Thank you!
FD.io Mini-Summit at KubeCon Europe 2018
? Florin Corasemail:([email protected])irc: florinc