+ All Categories
Home > Technology > NUSE (Network Stack in Userspace) at #osio

NUSE (Network Stack in Userspace) at #osio

Date post: 02-Jul-2015
Category:
Upload: hajime-tazaki
View: 2,440 times
Download: 7 times
Share this document with a friend
Description:
NUSE talk at #osio http://operatingsystems.io/
42
Network Stack in Userspace (NUSE) Hajime Tazaki Ryo Nakamura (University of Tokyo) New Directions in Operating Systems London, 2014
Transcript
Page 1: NUSE (Network Stack in Userspace) at #osio

Network Stack in Userspace (NUSE)

!Hajime TazakiRyo Nakamura

(University of Tokyo)!

New Directions in Operating SystemsLondon, 2014

Page 2: NUSE (Network Stack in Userspace) at #osio

MotivationImplementation of the Internetis not finished yet

!

!

Faster evolution of OSes (network stack)

OS personalization

2

Page 3: NUSE (Network Stack in Userspace) at #osio

I have a new Layer-3/4 protocol! Yay!

I have new, great Layer-3/4 protocol ! It will change the WORLD !

Replace network stack ?

No: destroy my life ?! (experimental ? not tested ?)

Yes: I wanna be your slave.

Slow evolution of network stack ?

VM on personal device ?3

Page 4: NUSE (Network Stack in Userspace) at #osio

Virtual Machine ?

4

Jon Howell, Galen Hunt, David Molnar, and Donald E. Porter, Living Dangerously: A Survey of Software Download Practices, no. MSR-TR-2010-51, May 2010

Poll: “When you download and run software, how often do you use a virtual machine (to reduce security risks)?”

Page 5: NUSE (Network Stack in Userspace) at #osio

Rekindling Network Protocol Innovation with User-Level

Stacks

Michio Honda⇤, Felipe Huici⇤, Costin Raiciu†, Joao Araujo‡, Luigi Rizzo§k

NEC Europe Ltd.⇤, Universitatea Politehnica Bucuresti†, University College London‡, Università di Pisa§,International Computer Science Institute, Berkeley, CAk

{first.last}@neclab.eu, [email protected], [email protected], [email protected]

ABSTRACTRecent studies show that more than 86% of Internet pathsallow well-designed TCP extensions, meaning that it is stillpossible to deploy transport layer improvements despite theexistence of middleboxes in the network. Hence, the blamefor the slow evolution of protocols (with extensions takingmany years to become widely used) should be placed on endsystems.

In this paper, we revisit the case for moving protocolsstacks up into user space in order to ease the deploymentof new protocols, extensions, or performance optimizations.We present MultiStack, operating system support for user-level protocol stacks. MultiStack runs within commodityoperating systems, can concurrently host a large number ofisolated stacks, has a fall-back path to the legacy host stack,and is able to process packets at rates of 10Gb/s.

We validate our design by showing that our mux/de-mux layer can validate and switch packets at line rate (upto 14.88 Mpps) on a 10 Gbit port using 1-2 cores, andthat a proof-of-concept HTTP server running over a basicuserspace TCP outperforms by 18–90% both the same serverand nginx running over the kernel’s stack.

Categories and Subject DescriptorsC.2.2 [Computer-communication Networks]: NetworkProtocols; D.4.4 [Operating Systems]: CommunicationsManagement

Keywordstransport protocols, operating systems, deployability

1. INTRODUCTIONThe TCP/IP protocol suite has been mostly implemented

in the operating system kernel since the inception of UNIXto ensure performance, security and isolation between userprocesses. Over time, new protocols and features have ap-peared (e.g., SCTP, DCCP, MPTCP, improved versions ofTCP), many of which have become part of mainstream OSesand distributions. Fortunately, the Internet is still able toaccommodate the evolution of protocols: a recent study [10]has shown that as many as 86% of Internet paths still allowTCP extensions despite the existence of a large number ofmiddleboxes.

However, the availability of a feature does not imply wide-spread, timely deployment. Being part of the kernel, newprotocols/extensions have system-wide impact, and are typ-ically enabled or installed during OS upgrades. These hap-

0.00

0.25

0.50

0.75

1.00

2007 2008 2009 2010 2011 2012Date

Ratio

of f

lows

OptionSACKTimestampWindowscale

DirectionInboundOutbound

Figure 1: TCP options deployment over time.

pen infrequently not only because of slow release cycles, butalso due to their cost and potential disruption to existingsetups. If protocol stacks were embedded into applications,they could be updated on a case-by-case basis, and deploy-ment would be a lot more timely.

For example, Mac OS, Windows XP and FreeBSD stilluse a traditional Additive Increase Multiplicative Decrease(AIMD) algorithm for TCP congestion control, while Linuxand Windows Vista (and later) use newer algorithms thatachieve better bandwidth utilization and mitigate RTT un-fairness [21, 25]. From a user’s point of view there is noreason not to adopt such new algorithms, but they do notbecause it can only be done via OS upgrades that are oftencostly or unavailable. Even if they are available, OS defaultsettings that disable such extensions or modifications canfurther hinder timely deployment.

Figure 1 shows another example, the usage of thethree most pervasive TCP extensions: Window Scale(WS) [12], Timestamps (TS) [12] and Selective Acknowledg-ment (SACK) [16]⇤. For example, despite WS and TS beingavailable since Windows 2000 and on by default since Win-dows Vista in 2006, as late as 2012 more than 30% and 70%of flows still did not negotiate these options (respectively),showing that it can take a long time to actually upgrade orchange OSes and thus the network stacks in their kernels.We see wider deployment for SACK in 2007 (70%) comparedto the other options thanks to it being on by default sinceWindows 2000, but even with this, 20% of flows still didnot use this option as late as 2011. The argument remains⇤We used a set of daily traces from the WIDE backbonenetwork which provides connectivity to universities and re-search institutes in Japan [3].

ACM SIGCOMM Computer Communication Review 53 Volume 44, Number 2, April 2014

Slow evolution of network stackHonda et al., Rekindling Network Protocol Innovation with User-Level Stacks, ACM SIGCOMM CCR, Vol.44, Num. 2, April 2014

Page 6: NUSE (Network Stack in Userspace) at #osio

Meanwhile inFilesystem world..

There is,

Filesystem in Userspace (FUSE)

Userspace code can host new filesystem (sshfs, GmailFS, etc)

Performance is bad, but doesn’t matter

Flexibility and functionality do matter

6

http://fuse.sourceforge.net/

Page 7: NUSE (Network Stack in Userspace) at #osio

AlternativesContainer (LXC, OpenVZ, vimage)

share kernel with host operating system (no flexibility)

Library OS

full scratch: mtcp, Mirage, lwIP

Porting: OSv, Sandstorm, libuinet (FreeBSD), Arrakis (lwIP), OpenOnload (lwIP?)

Glue-layer: LKL (Linux-2.6), rumpkernel (NetBSD)7

Page 8: NUSE (Network Stack in Userspace) at #osio

Network Stack in Userspace

Page 9: NUSE (Network Stack in Userspace) at #osio

What’s NUSE ?Network stack in Userspace

A library operating system

Library version of network stack (of monolithic kernel)

Linux (latest), FreeBSD (plan)

(UNIX) Process-based virtualization

9

TCP/IPARP/ndisc

NIC

glibc

libnuse

nuse example

userspace

kernel

raw socknetmap

DPDK (etc)

kernel bypassed

Page 10: NUSE (Network Stack in Userspace) at #osio

Why NUSE ?

minimized porting effortLinux (net-next) changes frequently

!

full functional network stack fornetmapDPDK(any kernel-bypass technology)

10

Page 11: NUSE (Network Stack in Userspace) at #osio

Application

ARPQdisc

TCP UDP DCCP SCTPICMP IPv4IPv6

NetlinkBridgingNetfilter

IPSec Tunneling

Kernel layer

NUSE core

POSIX glue

bottom halves/rcu/timer/interrupt

struct net_device

RAW DPDK netmap ...

NIC

petit-scheduler

How it works1.(monolithic) kernel

source

2. scheduler

3. POSIX glue

redirect system calls

4. network I/O

raw socket, DPDK, netmap, etc..

11

Page 12: NUSE (Network Stack in Userspace) at #osio

Application

ARPQdisc

TCP UDP DCCP SCTPICMP IPv4IPv6

NetlinkBridgingNetfilter

IPSec Tunneling

Kernel layer

NUSE core

POSIX glue

bottom halves/rcu/timer/interrupt

struct net_device

RAW DPDK netmap ...

NIC

petit-scheduler

1) kernel build

patch to kernel tree

with new (hw independent) arch (arch/sim)

robust to (frequent) mainstream changes

12

Page 13: NUSE (Network Stack in Userspace) at #osio

Application

ARPQdisc

TCP UDP DCCP SCTPICMP IPv4IPv6

NetlinkBridgingNetfilter

IPSec Tunneling

Kernel layer

NUSE core

POSIX glue

bottom halves/rcu/timer/interrupt

struct net_device

RAW DPDK netmap ...

NIC

petit-scheduler

2) scheduleroffer alternate context primitives

interrupts, timer, thread, bottom halves (tasklet, workqueue, waiter, etc)

wrap with POSIX thread

easily debuggable

ucontext fiber for low overhead (not yet)

13

Page 14: NUSE (Network Stack in Userspace) at #osio

Application

ARPQdisc

TCP UDP DCCP SCTPICMP IPv4IPv6

NetlinkBridgingNetfilter

IPSec Tunneling

Kernel layer

NUSE core

POSIX glue

bottom halves/rcu/timer/interrupt

struct net_device

RAW DPDK netmap ...

NIC

petit-scheduler

3) POSIX glue code

Hijack function calls

socket => nuse_socket

read => nuse_read

apps not aware of

LD_PRELOAD=libnuse.so ..

14

Page 15: NUSE (Network Stack in Userspace) at #osio

Application

ARPQdisc

TCP UDP DCCP SCTPICMP IPv4IPv6

NetlinkBridgingNetfilter

IPSec Tunneling

Kernel layer

NUSE core

POSIX glue

bottom halves/rcu/timer/interrupt

struct net_device

RAW DPDK netmap ...

NIC

petit-scheduler

4) network I/Oconnect NUSE to NIC

options

raw socket (default)

DPDK (if available)

netmap (if available)

Tap

15

Page 16: NUSE (Network Stack in Userspace) at #osio

Usagedownload

git clone git://github.com/libos-nuse/net-next-nuse

compile

make library ARCH=sim (NETMAP=yes) (DPDK=yes)

execute

sudo NUSECONF=nuse.conf ./nuse (application)16

Page 17: NUSE (Network Stack in Userspace) at #osio

configs

17

# Interface definition.!interface eth0! address 192.168.0.10! netmask 255.255.255.0! macaddr 00:01:01:01:01:01! viftype TAP!!interface p1p1! address 172.16.0.1! netmask 255.255.255.0! macaddr 00:01:01:01:01:02!!# route entry definition.!route! network 0.0.0.0! netmask 0.0.0.0! gateway 192.168.0.1

Page 18: NUSE (Network Stack in Userspace) at #osio

(possible) use casesNew protocol deployment

Chrome + Linux mptcp (on NUSE)

Process-level virtual instance

% NUSE-linux-ovs | NUSE-freebsd-NAT | NUSE-router | NUSE-nginx!

VM chaining via UNIX command line

18

Page 19: NUSE (Network Stack in Userspace) at #osio

Limitation (ongoings)

no fork(2)/exec(2) support

no multi-processes

no sysctl/proc

(inefficient) thread scheduling

19

Page 20: NUSE (Network Stack in Userspace) at #osio

Experiments

1. Can we benefit with OS personalization?

present a custom (NUSE) kernel with an application (OS personalization)

2. How much overhead does NUSE add?

Simple performance measurements

20

Page 21: NUSE (Network Stack in Userspace) at #osio

Tested applications

working

ping, iperf, nginx (partially), sleep,

need patches

nc, wget, dig, host

21

Page 22: NUSE (Network Stack in Userspace) at #osio

Setup: Performance measurement

2210G10G

NUSE node Tx/Rx nodes

CPU Xeon E5-2650v2 @ 2.60GHz

(16 core)

Xeon L3426 @ 1.87GHz (8 core)

Memory 32GB 4GBNIC Intel X520 Intel X520

ping!flowgen

vnstat!(packet count)

Tx NUSE Rx

ping!flowgen

Page 23: NUSE (Network Stack in Userspace) at #osio

Host Tx (NUSE->Receiver)

23

avg max mindpdk! 2.610 8.000 0.156

netmap 0.370 0.494 0.252raw 0.396 0.501 0.290tap 0.397 0.538 0.303

RxNUSE

0 50

100 150 200 250 300 350 400 450 500

dpdk netmap raw tap

Thro

ughp

ut (M

bps)

ping (RTT) throughput(1024byte,UDP)

0

1

2

3

4

5

6

7

8

dpdk netmap raw tap

RTT

(ms)

Page 24: NUSE (Network Stack in Userspace) at #osio

L3 RoutingSender->NUSE->Receiver

24

avg max mindpdk! 11.998 27.700 0.252

netmap 0.664 0.741 0.556raw 0.663 0.761 0.575tap 0.694 0.749 0.602

Tx RxNUSE

ping (RTT)

0 50

100 150 200 250 300 350 400 450 500

netmap raw tap

Thro

ughp

ut (M

bps)

throughput(1024byte,UDP)

0

5

10

15

20

25

30

dpdk netmap raw tap

RTT

(ms)

Page 25: NUSE (Network Stack in Userspace) at #osio

Discussions

not so bad performance

we don’t care much about performance

network stack is full functional

but supplemental tools are not sufficient

25

Page 26: NUSE (Network Stack in Userspace) at #osio

Network Simulator Integration (ns-3)

network stack +ns-3 network simulator!

Direct Code Execution (DCE)Established by Mathieu Lacage (2006)part of ns-3 project

!

Featuresreproducible (deterministic clock)controllable (simulator’s facility)

26

http://www.nsnam.org/overview/projects/direct-code-execution/

Page 27: NUSE (Network Stack in Userspace) at #osio

27

Page 28: NUSE (Network Stack in Userspace) at #osio

NUSE vs DCE

28

NUSE DCE

kernel library ARCH=sim ARCH=sim

scheduler (host) pthread simulator’s scheduler!(deterministic)

POSIX hijack hijacknetwork I/O raw/netmap/DPDK/tap ns3:NetDevice

execution LD_PRELOADdlmopen(3)!

single proc/multi-instances

shared

Page 29: NUSE (Network Stack in Userspace) at #osio

DCE Architecture

29

ARP

Qdisc

TCP UDP DCCP SCTP

ICMP IPv4IPv6

Netlink

BridgingNetfilter

IPSec Tunneling

Kernel layer

Heap Stack

memory

Virtualization Corelayer

ns-3 (network simulation core)

POSIX layer

Application(ip, iptables, quagga)

bottom halves/rcu/timer/interruptstruct net_device

DCE

ns-3 applicati

on

ns-3TCP/IP stack

3) POSIX!Layer

1) Core!Layer

2) Kernel!Layer

Page 30: NUSE (Network Stack in Userspace) at #osio

Bug reproducibility

30

Wi-Fi Wi-Fi

Home Agent

AP1 AP2

handoff

ping6

mobile node

correspondentnode

(gdb) b mip6_mh_filter if dce_debug_nodeid()==0Breakpoint 1 at 0x7ffff287c569: file net/ipv6/mip6.c, line 88. <continue> (gdb) bt 4 #0  mip6_mh_filter (sk=0x7ffff7f69e10, skb=0x7ffff7cde8b0) at net/ipv6/mip6.c:109 #1  0x00007ffff2831418 in ipv6_raw_deliver (skb=0x7ffff7cde8b0, nexthdr=135) at net/ipv6/raw.c:199 #2  0x00007ffff2831697 in raw6_local_deliver (skb=0x7ffff7cde8b0, nexthdr=135) at net/ipv6/raw.c:232 #3  0x00007ffff27e6068 in ip6_input_finish (skb=0x7ffff7cde8b0) at net/ipv6/ip6_input.c:197

Page 31: NUSE (Network Stack in Userspace) at #osio

DebuggingMemory error detection

among distributed nodes

in a single process

using Valgrind

!

!

31

==5864== Memcheck, a memory error detector ==5864== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al. ==5864== Using Valgrind-3.6.0.SVN and LibVEX; rerun with -h for copyright info ==5864== Command: ../build/bin/ns3test-dce-vdl --verbose ==5864== ==5864== Conditional jump or move depends on uninitialised value(s) ==5864== at 0x7D5AE32: tcp_parse_options (tcp_input.c:3782) ==5864== by 0x7D65DCB: tcp_check_req (tcp_minisocks.c:532) ==5864== by 0x7D63B09: tcp_v4_hnd_req (tcp_ipv4.c:1496) ==5864== by 0x7D63CB4: tcp_v4_do_rcv (tcp_ipv4.c:1576) ==5864== by 0x7D6439C: tcp_v4_rcv (tcp_ipv4.c:1696) ==5864== by 0x7D447CC: ip_local_deliver_finish (ip_input.c:226) ==5864== by 0x7D442E4: ip_rcv_finish (dst.h:318) ==5864== by 0x7D2313F: process_backlog (dev.c:3368) ==5864== by 0x7D23455: net_rx_action (dev.c:3526) ==5864== by 0x7CF2477: do_softirq (softirq.c:65) ==5864== by 0x7CF2544: softirq_task_function (softirq.c:21) ==5864== by 0x4FA2BE1: ns3::TaskManager::Trampoline(void*) (task-manager.cc:261) ==5864== Uninitialised value was created by a stack allocation ==5864== at 0x7D65B30: tcp_check_req (tcp_minisocks.c:522) ==5864==

Page 32: NUSE (Network Stack in Userspace) at #osio

Fine-grained parameter coverage

32

Code coverage measurement with DCEWith fine-grained network, node, protocol parameters

Page 33: NUSE (Network Stack in Userspace) at #osio

Continuous integration

33

http://ns-3-dce.cloud.wide.ad.jp/jenkins/job/daily-net-next-sim/

Page 34: NUSE (Network Stack in Userspace) at #osio

SummaryNUSE (Network Stack in Userspace)

OS personalization (fast evolution, easy deployment)

DCE (Direct Code Execution)

Flexible network experiment/test with deterministic clock

34

Page 35: NUSE (Network Stack in Userspace) at #osio

github: https://github.com/libos-nuse/net-next-nuse

DCE: http://bit.ly/ns-3-dce

twitter: @thehajime

35

Page 36: NUSE (Network Stack in Userspace) at #osio

Backups

Page 37: NUSE (Network Stack in Userspace) at #osio

Potentials of Userspace Networking

High-performance networking

Useful debugging facilities

Operating system personalization

37

Page 38: NUSE (Network Stack in Userspace) at #osio

1) kernel build

build kernel source tree w/ the patch

make menuconfig ARCH=sim

make library ARCH=sim

➔ libnuse-linux-3.17-rc1.so

38

Page 39: NUSE (Network Stack in Userspace) at #osio

Example: How timer works

39

add_timer()

TIMER_SOFTIRQ

timer_list

run_timer_softirq ()

timer handler

timer thread(timer_create (2))

Page 40: NUSE (Network Stack in Userspace) at #osio

3) POSIX glue code

40

https://github.com/libos-nuse/net-next-nuse/blob/nuse/arch/sim/nuse-glue.c

extern int sim_sock_socket (int,int,int, struct socket **);int socket (int family, int type, int proto){ sim_update_jiffies (); struct socket *kernel_socket = sim_malloc (sizeof (struct socket)); memset (kernel_socket, 0, sizeof (struct socket)); int ret = sim_sock_socket (family, type, proto, &kernel_socket); g_fd_table[curfd++] = kernel_socket; sim_softirq_wakeup (); return curfd - 1;}

Page 41: NUSE (Network Stack in Userspace) at #osio

Tx callgraph

41

sendmsg () (socket API) sim_sock_sendmsg () (NUSE) sock_sendmsg () ip_send_skb () ip_finish_output2 () dst_neigh_output () (ex-kernel) neigh_resolve_output () arp_solicit () dev_queue_xmit () sim_dev_xmit () (NUSE) nuse_vif_raw_write ()

Page 42: NUSE (Network Stack in Userspace) at #osio

start_thread () (pthread) nuse_netdev_rx_trampoline () nuse_vif_raw_read () (NUSE) sim_dev_rx () netif_rx () (ex-kernel)

Rx callgraph

42

start_thread () (pthread) do_softirq () (NUSE) net_rx_action () process_backlog () (ex-kernel) __netif_receive_skb_core () ip_rcv ()

vNIC!rx

softirq!rx


Recommended