+ All Categories
Home > Documents > Washington WASHINGTON UNIVERSITY IN ST LOUIS

Washington WASHINGTON UNIVERSITY IN ST LOUIS

Date post: 06-May-2015
Category:
Upload: networksguy
View: 620 times
Download: 6 times
Share this document with a friend
26
1 Washington WASHINGTON UNIVERSITY IN ST LOUIS Mike Wilson – 15 March 2005 Networking in the Linux Kernel Networking in the Linux Kernel
Transcript
Page 1: Washington WASHINGTON UNIVERSITY IN ST LOUIS

1WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Mike Wilson – 15 March 2005Networking in the Linux Kernel

Networking in the Linux Kernel

Page 2: Washington WASHINGTON UNIVERSITY IN ST LOUIS

2WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Mike Wilson – 15 March 2005Networking in the Linux Kernel

IntroductionOverview of the Linux Networking implementation:

Covered:• Data path through the kernel• Quality of Service features• Hooks for extensions (netfilter, KIDS, protocol demux placement)• VLAN Tag processing• Virtual Interfaces

Not covered:• Kernels prior to 2.4.20, or 2.6+• Specific protocol implementations• Detailed analysis of existing protocols, such as TCP. This is covered

only in enough detail to see how they link to higher/lower layers.

Page 3: Washington WASHINGTON UNIVERSITY IN ST LOUIS

3WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Mike Wilson – 15 March 2005Networking in the Linux Kernel

OSI Model

The Linux kernel adheres closely to the OSI 7-layer networking model

ApplicationPresentation

SessionTransportNetwork

Data LinkPhysical

Application(Above socket)

(HTTP, SSH, etc.)TCP/UDP

Internet (IP)Data Link

(802.x, PPP, SLIP)

Page 4: Washington WASHINGTON UNIVERSITY IN ST LOUIS

4WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Mike Wilson – 15 March 2005Networking in the Linux Kernel

OSI Model (Interplay)

Layers generally interact in the same manner, no matter where placed

Layer N+1 Data

Layer N+1 Data

Layer N+1Control

Add header and/or trailer

Pass to layer N as raw data

Layer N Data

Page 5: Washington WASHINGTON UNIVERSITY IN ST LOUIS

5WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Mike Wilson – 15 March 2005Networking in the Linux Kernel

Socket BufferWhen discussing the data path through the linux kernel, the

data being passed is stored in sk_buff structures (socket buffer).

• Packet Data• Management Information

• The sk_buff is first created incomplete, then filled in during passage through the kernel, both for received packets and for sent packets.

• Packet data is normally never copied. We just pass around pointers to the sk_buff and change structure members

Page 6: Washington WASHINGTON UNIVERSITY IN ST LOUIS

6WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Mike Wilson – 15 March 2005Networking in the Linux Kernel

Socket Buffer

nextprevlist

headdatatailend

sk_buff

skdev_rxdev

sk_buff

MAC HeaderIP Header

TCP Headerdata

associated device

source device

socket

All sk_buff’s are members of a queue

Packet Data

cloned sk_buff’s share data, but not control

struct sk_buff is defined in: include/linux/skbuff.h

Page 7: Washington WASHINGTON UNIVERSITY IN ST LOUIS

7WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Mike Wilson – 15 March 2005Networking in the Linux Kernel

Socket Buffersk_buff features:

• Reference counts for cloned buffers• Separate allocation pool and support• Functions for manipulating the data space• Very “feature-rich” – this is a very complex, detailed

structure, encapsulating information from protocols at multiple layers

There are also numerous support functions for queues of sk_buff’s.

Page 8: Washington WASHINGTON UNIVERSITY IN ST LOUIS

8WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Mike Wilson – 15 March 2005Networking in the Linux Kernel

Data Path Overview

kernel

kernel

hardware

user

NetworkDevice

Driver

……

net_rx_action()

protocol protocol IP

TCP UDP

socket socket

Layer 3 protocol demux

Layer 4 protocol demux

DMA rings

socket demux

QueueDiscipline… …

softirq

Page 9: Washington WASHINGTON UNIVERSITY IN ST LOUIS

9WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Mike Wilson – 15 March 2005Networking in the Linux Kernel

OSI Layers 1&2 – Data LinkThe code presented resides mostly in the following files:

• include/linux/netdevice.h

• net/core/skbuff.c

• net/core/dev.c

• net/dev/core.c

• arch/i386/irq.c

• drivers/net/net_init.c

• net/sched/sch_generic.c

• net/ethernet/eth.c (for layer 3 demux)

Page 10: Washington WASHINGTON UNIVERSITY IN ST LOUIS

10WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Mike Wilson – 15 March 2005Networking in the Linux Kernel

Data Link – Data Path

kernel

hardware NetworkDevice

Driver…

DMA

DMA Rings

net_interrupt(net_rx, net_tx, net_error)

netif_rx_schedule()

Add device pointer to poll_queue

poll_queue

netif_receive_skb()

IP

enqueue()

softirq

dev->poll()

net_rx_action()Layer 3

Layer 2

QueueDiscipline… …

Page 11: Washington WASHINGTON UNIVERSITY IN ST LOUIS

11WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Mike Wilson – 15 March 2005Networking in the Linux Kernel

Data Link – Features• NAPI

– Old API would reach interrupt livelock under 60 MBps– New API ensures earliest possible drop under overload

1. Packet received at NIC2. NIC copies to DMA ring (struct skbuff *rx_ring[])3. NIC raises interrupt via netif_rx_schedule()4. Further interrupts are blocked5. Clock-based softirq calls softirq_rx(), which calls dev->poll()6. dev->poll() calls netif_receive_skb(), which does protocol demux (usually

calling ip_rcv() )• Backward compatibility for non-DMA interfaces maintained

• All legacy devices use the same backlog (equivalent to DMA ring)• Backlog queue is treated just like all other modern devices

• Per-CPU poll_list of devices to poll– Ensures no packet re-ordering necessary

• No memory copies in kernel – packet stays in the sk_buff at the same memory location until passed to user space

Page 12: Washington WASHINGTON UNIVERSITY IN ST LOUIS

12WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Mike Wilson – 15 March 2005Networking in the Linux Kernel

Data Link – transmission• Transmission

1. Packet sent from IP layer to Queue Discipline2. Any appropriate QoS in qdisc – discussed later3. qdisc notifies network driver when it’s time to send –

calls hard_start_xmit()1. Place all ready sk_buff pointers in tx_ring2. Notifies NIC that packets are ready to send3. NIC signals (via interrupt) when packet(s) successfully

transmitted. (Highly variable on when interrupt is sent!)4. Interrupt handler queues transmitted packets for deallocation

4. At next softirq, all packets in completion_queue are deallocated

Page 13: Washington WASHINGTON UNIVERSITY IN ST LOUIS

13WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Mike Wilson – 15 March 2005Networking in the Linux Kernel

Data Link – VLAN Features

• Still dependent on individual NICs– Not all NICs implement VLAN filtering

• A partial list is available at need (not included here)

– For non-VLAN NICs, linux filters in software and passes to the appropriate virtual interface for ingress priotization and layer 3 protocol demux• net/8021q/vlan_dev.c (and others in this directory)

• Virtual interface passes through to real interface

– No VID-based demux needed for received packets, as different VLANs are irrelevant to the IP layer.

– Some changes in 2.6 – still need to research this

Page 14: Washington WASHINGTON UNIVERSITY IN ST LOUIS

14WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Mike Wilson – 15 March 2005Networking in the Linux Kernel

OSI Layer 3: InternetThe code presented resides mostly in the following files:

• net/ipv4/ip_input.c – process packet arrivals

• net/ipv4/ip_output.c – process packet departures

• net/ipv4/ip_forward.c – process packet traversal

• net/ipv4/ip_fragment.c – IP packet fragmentation

• net/ipv4/ip_options.c – IP options

• net/ipv4/ipmr.c – IP multicast

• net/ipv4/ipip.c – IP over IP, also good virtual interface example

Page 15: Washington WASHINGTON UNIVERSITY IN ST LOUIS

15WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Mike Wilson – 15 March 2005Networking in the Linux Kernel

Internet: Data Path

Note: chart copied from DataTag’s“A Map of the Networking Code in the Linux Kernel”

Page 16: Washington WASHINGTON UNIVERSITY IN ST LOUIS

16WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Mike Wilson – 15 March 2005Networking in the Linux Kernel

Internet: FeaturesNetfilter hooks in many places

– INPUT, OUTPUT, FORWARD (iptables)

– NF_IP_PRE_ROUTING – ip_rcv()

– NF_IP_LOCAL_IN – ip_local_deliver()

– NF_IP_FORWARD – ip_forward()

– NF_IP_LOCAL_OUT – ip_build_and_send_pkt()

– NF_IP_POST_ROUTING – ip_finish_output()

• Connection tracking in IPv4, not in TCP/UDP/ICMP.– Used for NAT, which must maintain connection state in violation

of OSI Layering

– Can also gather statistics for networking usage, but all of this functionality comes from the netfilter module

Page 17: Washington WASHINGTON UNIVERSITY IN ST LOUIS

17WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Mike Wilson – 15 March 2005Networking in the Linux Kernel

Socket Structure and System Call Mapping

The following files are useful:• include/linux/net.h• net/socket.c

There are two significant data structures involved, the socket and the net_proto_family. Both involve arrays of function pointers to handle each system call type that is relevant.

Page 18: Washington WASHINGTON UNIVERSITY IN ST LOUIS

18WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Mike Wilson – 15 March 2005Networking in the Linux Kernel

System Call: socket1. From user space, an application calls socket(family,type, protocol)2. The kernel calls sys_socket(), which calls sock_create()3. sock_create references net_families[family], an array of network protocol families, to

find the corresponding protocol family, loading any modules necessary on the fly.• If the module is loaded, it is loaded as “net_pf_<num>”, where the protocol family number

is used directly in the string. For TCP, the family is PF_INET (was: AF_INET), and the type is SOCK_STREAM

• Note: linux has a hard limit of 32 protocol families. (These include PF_INET, PF_PACKET, PF_NETLNK, PF_INET6, etc.)

• Layer 4 Protocols are registered in inet_add_protocol() (include/net/protocol.h), and socket interfaces are registered by inet_register_protosw(). Raw IP datagram sockets are registered like any other Layer 4 protocol.

4. Once the correct family is found, sock_create allocates an empty socket, obtains a mutex, and calls net_families[family]->create(). This is protocol-specific, and filles in the socket structure. The socket structure includes another function array, ops, which maps all system calls valid on file descriptors.

5. sys_socket() calls sock_map_fd() to map the new socket to a file descriptor, and returns it.

Page 19: Washington WASHINGTON UNIVERSITY IN ST LOUIS

19WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Mike Wilson – 15 March 2005Networking in the Linux Kernel

Other socket System CallsSubsequent socket system calls are passed to the

appropriate function in socket->ops[]. These include (exhaustive list):

•release•bind•connect•socketpair•accept•getname•poll•ioctl

•listen•shutdown•setsockopt•getsockopt•sendmsg•recvmsg•mmap•sendpage

Technically, Linux offers only one socket system call, sys_socket-call(), which multiplexes to all other system calls via the first parameter. This means that socket-based protocols could provide new and different system calls via a library and a mux, although this is never done in practice.

Page 20: Washington WASHINGTON UNIVERSITY IN ST LOUIS

20WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Mike Wilson – 15 March 2005Networking in the Linux Kernel

PF_PACKETA brief word on the PF_PACKET Protocol family

PF_PACKET creates a socket bound directly to a network device. The call may specify a packet type. All packets sent to this socket are sent directly over the device, and all incoming packets of this type are delivered directly to the socket. No processing is done in the kernel. Thus, this interface can – and is – used to create user-space protocol implementations. (E.g., PPPoE uses this with packet type ETH_P_PPP_DISC)

Page 21: Washington WASHINGTON UNIVERSITY IN ST LOUIS

21WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Mike Wilson – 15 March 2005Networking in the Linux Kernel

Quality of Service MechanismsLinux has two QoS mechanisms:

– Traffic Control• Provides for multiple queues and priority schemes within those

queues between the IP layer and the network device• Defaults are 100-packet queues with 3 priorities and a FIFO ordering.

– KIDS (Karlsruhe Implementation architecture of Differentiated Services)

• Designed to be component-extensible at runtime.• Consists of a set of components with similar interfaces that can be

plugged together in almost arbitrarily complex constructions

Neither mechanism implements the higher-level traffic agreements, such as Traffic Conditioning Agreements (TCA’s). MPLS is offered in Linux 2.6.

Page 22: Washington WASHINGTON UNIVERSITY IN ST LOUIS

22WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Mike Wilson – 15 March 2005Networking in the Linux Kernel

Traffic ControlTraffic Control consists of three types of components:1. Queue Disciplines

• These implement the actual enqueue() and dequeue()• Also has child components

2. Filters• Filters classify traffic received at a Queue Discipine into Classes• Normally children of a Queuing Discipline

3. Classes• These hold the packets classified by Filters, and have associated

queuing disciplines to determine the queuing order.• Normally children of a Filter and parents of Queuing Displines

Components are connected into structures called “trees,” although technically they aren’t true trees because they allow upward (cyclical) links.

Page 23: Washington WASHINGTON UNIVERSITY IN ST LOUIS

23WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Mike Wilson – 15 March 2005Networking in the Linux Kernel

Traffic Control: Example

Queuing Discipline 1:0

enqueue dequeue

Filter Filter Default. . .

Class 1:1

Class 1:2

Queuing Discipline

2:0

Queuing Discipline

3:0

This is a typical TC tree.

The top-level Queuing Discipline is the only access point from the outside, the “outer queue.” From external access, this is a single queue structure.

Internally, packets eceived at the outer queue are matched against each filter in order. The first match wins, with a final default case.

Dequeue requests to the outer queue are passed along recursively to the inner queues to find a packet ready for sending.

Page 24: Washington WASHINGTON UNIVERSITY IN ST LOUIS

24WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Mike Wilson – 15 March 2005Networking in the Linux Kernel

Traffic Control (Cont’d)The TC architecture supports a number of pre-built

filters, classes, and disciplines, found in net/sched/cls_* are filters, whereas sch_* are disciplines (classes collocated with disciplines).

Some disciplines:

• ATM• Class-Based

Queuing• Clark-Shenker-

Zhang• Differentiated

Services mark• FIFO

• RED• Hierarchical Fair

Service Curve (SIGCOMM’97)

• Hierarchical Token Bucket

• Network Emulator (For protocol testing)

• Priority (3 levels)• Generic RED• Stochastic

Fairness Queuing• Token Bucket• Equalizer (for

equalizing line rates of different links)

Page 25: Washington WASHINGTON UNIVERSITY IN ST LOUIS

25WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Mike Wilson – 15 March 2005Networking in the Linux Kernel

KIDSKIDS establishes 5 general component types (by interface)• Operative Components – receive a packet and runs an algorithm on it.

The packet may be modified or simply examined. E.g., Token Buckets, RED, Shaper

• Queue Components – Data structures used to enqueue/dequeue. Includes FIFO, “Earliest-Deadline-First” (EDF), etc.

• Enqueuing Components – enqueue packets based on special methods: tail-enqueue, head-enqueue, EDF-enqueue, etc.

• Dequeuing Components – dequeue based on special methods• Strategic Components – strategies for dequeue requests. E.g., WFQ,

Round Robin

Page 26: Washington WASHINGTON UNIVERSITY IN ST LOUIS

26WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Mike Wilson – 15 March 2005Networking in the Linux Kernel

KIDS (Cont’d)• KIDS has 8 different hook points in the linux kernel, 5 at

the IP layer and 3 at Layer 2:– IP_LOCAL_IN – just prior to delivery to Layer 4

– IP_LOCAL_OUT – just after leaving Layer 4

– IP_FORWARD – packet being forwarded (router)

– IP_PRE_ROUTING – Packet newly arrived at IP layer from interface

– IP_POST_ROUTING – Packet routed from IP to Layer 2

– L2_INPUT_<dev> – Packet has just arrived from interface

– L2_ENQUEUE_<dev> – Packet is being queued at Layer 2

– L2_DEQUEUE_<dev> – Packet is being transmitted by Layer 2


Recommended