+ All Categories
Home > Technology > LinuxCon 2015 Linux Kernel Networking Walkthrough

LinuxCon 2015 Linux Kernel Networking Walkthrough

Date post: 06-Jan-2017
Category:
Upload: thomas-graf
View: 8,726 times
Download: 110 times
Share this document with a friend
20
Kernel Networking Walkthrough LinuxCon 2015, Seale Thomas Graf Kernel & Open vSwitch Team Noiro Networks (Cisco)
Transcript

Kernel Networking Walkthrough

LinuxCon 2015, Seattle

Thomas Graf Kernel & Open vSwitch Team Noiro Networks (Cisco)

Agenda

● Getting packets from/to the NIC

● NAPI, Busy Polling, RSS, RPS, XPS, GRO, TSO

● Packet processing

● RX Handler, IP Processing, TCP Processing, TCP Fast Open

● Queuing from/to userspace

● Socket Buffers, Flow Control, TCP Small Queues

● Q&A

Touring the Network Stack

Expectation Reality

How does a packet get in and out ofthe Network Stack?

Receive & Transmit Process

Ring Buffer

DMA

ParseL2 & IP

ParseTCP/UDP

Socket Buffer

Task /Container

read()

Ring Buffer

ConstructIP

ConstructTCP/UDP

Local?

Socket Buffer

Forward

Route?write()

NIC Network Stack(Kernel Space)

Process(User Space)

The 3 ways into the Network Stack

Ring Buffer

NetworkStack

Interrupt Driven

A

Ring Buffer

NetworkStack

NAPI based Polling poll()B

Ring Buffer NetworkStack

Busy Polling busy_poll()

TaskC

RSS – Receive Side Scaling

● NIC distributes packets across multiple RX queuesallowing for parallel processing.

● Separate IRQ per RX queue, thus selects CPU to runhardware interrupt handler on.

RX-queue-1

RX-queue-2

RX-queue-3

RX-queue-4

CPU 1

CPU 2

CPU 1

CPU 2

filter

RPS – Receive Packet Steering

● Software filter to select CPU # for processing

● Use it to ...

RX-queue-1

RX-queue-2

RX-queue-3

RX-queue-4

CPU 1

CPU 2

CPU 3

CPU 1

CPU 2

CPU 3

... redo queue - CPU mapping ... distribute single queue to multiple CPUs

Hardware Offload

● RX/TX Checksumming

● Perform CPU intensive checksumming inhardware.

● Virtual LAN filtering and tag stripping

● Strip 802.1Q header and store VLAN IDin network packet meta data.

● Filter out unsubscribed VLANs.

● Segmentation Offload

Generic Receive Offload(ethtool -K eth0 gro on)

Ring BufferNetwork

Stack

poll()

NAPI based GRO

MTU

GRO

Up to 64K

It's more effective to process 1x64K bytes packetinstead of 40x1500 bytes packets.

Segmentation Offload(ethtool -K eth0 tso on)(ethtool -K eth0 gso on)

Ring Buffer

NetworkStack

Generic Segmentation Offload (GSO)

ethtool -K eth0 gso on

MTU

TCP Segmentation Offload (TSO)

ethtool -K eth0 tso on

MTU

Up to 64K

How does a packet get through theNetwork Stack?

(c) Karen Sagovac

Packet ProcessingLink Layer

Ingress QoS

Proto Handler

IPv4

IPv6

ARP

IPX

...Drop

The Feast!

RX Handler

Open vSwitch

Team

Bonding

Bridge

macvlan

macvtap

Packet SocketETH_P_ALL

tcpdump

IP Processing

IPHandler Route Lookup

PREROUTING

IPv4Construction

Route Lookup

Local Output

OUTPUT

POSTROUTINGLink Layer

FORWARD

Forwarding L4(TCP, ...)

Local Delivery

INPUT

UserSpace

TCP Processing

IP

Socket Filter

Receive TCP

Parse TCPLookup Socket

Backlogsocket locked

Receive Socket Buffer

Prequeuetask exists

process context ← softirq

Task

poll()read()

TCP Fast Open(net.ipv4.tcp_fastopen)

2nd Req SYN

SYN+ACK

ACK+HTTP GET

Data

2x RTT

SYN+Cookie+HTTP GET

SYN+ACK+Data

2nd Req

1x RTT

Client Server

SYN

SYN+ACK

ACK+HTTP GET

1st Req

Data

2x RTT2x RTT

Regular

Client Server

SYN

SYN+ACK+Cookie

ACK+HTTP GET

1st Req

Data

2x RTT

Fast Open

Memory Accounting & Flow Control

Socket Buffers & Flow Control(net.ipv4.tcp_{r|w}mem)

ssh

TX Ring Buffer

TCP/IP

Socket Buffer

wmemoverlimit?

Block or EWOULDBLOCK

wmem += packet-size

ssh

RX Ring Buffer

TCP/IP

Socket Buffer

rmem -= packet-size

rmemoverlimit?

Reduce TCP Window

rmem += packet-size

wmem -= packet-size

write()

TCP Small Queues(net.ipv4.tcp_limit_output_bytes)

ssh

TX Ring Buffer

Driver

TCP/IP

Socket Buffer

write()

Queuing Discipline

torrent

Socket Buffer

write()

TSQ: max 128Kb in flight per socket

Q&A

Contact:

● E-Mail: [email protected]

● Twitter: @tgraf__


Recommended