BEBA Behavioural Based Forwarding Deliverable … · BEBA Behavioural Based Forwarding Deliverable...

BEBA Behavioural Based

Forwarding Grant Agreement: 644122

BEBA/WP5 -‐ D5.1 Version: 1.0 Page 1 of 50

BebaBEhavioural BAsed forwarding

BEBA

Behavioural Based Forwarding

Deliverable Report

D3.3 – BEBA Data Software Design, Implementation and Acceleration

Project co-funded by the European Commission within the Horizon 2020 (H2020) Programme

DISSEMINATION LEVEL PU Public X PP Restricted to other programme participants (including the Commission

Deliverable title BEBA Data Software Design, Implementation and Acceleration

Version 1.0 Due date of deliverable (month) January 2016

Actual submission date of the deliverable (dd/mm/yyyy) 01/02/2015

Start date of project (dd/mm/yyyy) 01/01/2015

Duration of the project 27 months Work Package WP3 Task T3.3 Leader for this deliverable NEC

Other contributing partners - 6WIND - CNIT - CESNET

Authors F. Huici, F. Schmidt (NEC), V. Puš, M. Špinler, P. Benacek (CESNET), G. Procissi, N. Bonelli, N. Blefari, M. Bonola, S. Pontarelli (CNIT), Q. Monnet (6WIND)

Deliverable reviewer(s) F. Huici (NEC)

Deliverable abstract This deliverable presents a number of acceleration frameworks aimed at implementing a high performance BEBA software switch.

Keywords High performance, software switch, implementation





Services) RE Restricted to a group specified by the consortium (including the

Commission Services)

CO Confidential, only for members of the consortium (including the Commission Services)

REVISION HISTORY

Revision Date Author Organisation Description 1.0 31/01/2016 F. Huici NEC Full deliverable

PROPRIETARY RIGHTS STATEMENT This document contains information, which is proprietary to the BEBA consortium. Neither this document nor the information contained herein shall be used, duplicated or communicated by any means to any third party, in whole or in parts, except with the prior written consent of the BEBA consortium. This restriction legend shall not be altered or obliterated on or from this document.

STATEMENT OF ORIGINALITY This deliverable contains original unpublished work except where clearly indicated otherwise. Acknowledgement of previously published material and of the work of others has been made through appropriate citation, quotation or both.





Table of Contents

Table of Contents ....................................................................................................................................... 3 1. Introduction ........................................................................................................................................ 5 2. ofsoftswitch Acceleration with PFQ ...................................................................................... 6 2.1. The ofsoftswitch Dataplane Implementation ............................................................................ 6 2.2. ofsoftswitch Acceleration Directions ........................................................................................... 7 2.2.1. Generic I/O Acceleration ..................................................................................................................................... 7 2.2.2. Multi-‐‑core Packet Processing ............................................................................................................................ 8

2.3. PFQ I/O framework ................................................................................................................................ 10 2.3.1. High Speed Packet I/O ....................................................................................................................................... 12 2.3.2. The Processing Layer ......................................................................................................................................... 13 2.3.3. Application Programming Interfaces ......................................................................................................... 14 2.3.4. PFQ performance ................................................................................................................................................. 17

2.4. Conclusions and Future Work ............................................................................................................ 18

3. Netmap-‐‑accelerated Open vSwitch ........................................................................................... 20 3.1. mSwitch Architecture ............................................................................................................................ 20 3.2. Base Performance ................................................................................................................................... 21 3.3. Accelerating Open vSwitch with mSwitch (mSwitch-‐‑OVS) ....................................................... 22 3.4. Evaluation Results .................................................................................................................................. 23 3.5. Future Steps .............................................................................................................................................. 25

4. DPDK @ 100Gb ................................................................................................................................ 26 4.1. Using the szedata2 Device ................................................................................................................... 26 4.2. Straight Zero-‐‑copy Interface (SZE2) ................................................................................................. 26 4.2.1. SZE2 Architecture ............................................................................................................................................... 27 4.2.2. Format of Data Segments ................................................................................................................................. 27

4.3. PMD Implementation Details .............................................................................................................. 28 4.3.1. Initialization of RX and TX queues ............................................................................................................... 28 4.3.2. RTE Ethernet Device API .................................................................................................................................. 29 4.3.3. Receiving packets ................................................................................................................................................ 29 4.3.4. Sending Packets ................................................................................................................................................... 33

4.4. Performance Measurements ............................................................................................................... 36 4.5. Conclusions and Future Work ............................................................................................................ 37

5. Openstate on Open vSwitch ......................................................................................................... 39 5.1. Porting the BEBA Architecture to OVS ............................................................................................. 39 5.1.1. Implementing OpenState into OVS: Several Possible Approaches ................................................ 39 5.1.2. Reusing OVS Features to Implement BEBA’s Basic Forwarding Abstraction ........................... 39

5.2. Improving the OVS-‐‑based BEBA switch ........................................................................................... 44 5.2.1. Accelerating Packet Processing ..................................................................................................................... 44 5.2.2. New paradigm, new features: eBPF ............................................................................................................ 45

5.3. Conclusions and Future Work ............................................................................................................ 47

6. Conclusions ....................................................................................................................................... 49





7. Bibliography ..................................................................................................................................... 50





1. Introduction In this deliverable we report on a number of on-going activities by a number of BEBA partners with a common goal: the implementation and acceleration of a switch supporting BEBA’s abstractions. The approach here is not design and implement a software switch from scratch; instead, we target a couple of well-known projects in order to maximize the chances of BEBA’s concepts making it out into real-world deployments. Further, since we are not re-inventing the wheel, we can concentrate the project’s resources on improving the performance of existing frameworks in order to implement a BEBA switch that can cope with real-world traffic loads. Along these lines, we are targeting two specific, existing platforms: ofsoftswitch and Open vSwitch. Regarding the former, while its implementation is very clean and so a suitable candidate for implementing BEBA’s abstractions, its performance is known to be less than optimal. As a result, we are working on leveraging the PFQ packet I/O acceleration framework to improve ofsoftswitch. With respect to Open vSwitch, we are taking a two-pronged approach: on one hand, we are working on porting the BEBA switch proof of concept work done by CNIT in WP2 to it. In parallel, we are making progress towards accelerating Open vSwitch with the mSwitch modular switch which is based on the netmap packet I/O acceleration framework. Finally, as a complementary activity, BEBA is working towards being able to accelerate software switches with specialized hardware (i.e., FPGAs). More specifically, we have implemented DPDK support to a COMBO-100G hardware accelerator card that allows us to reach 100 Gb/s speeds. The rest of this document is organized as follows. In section 2 we describe our efforts towards accelerating with ofsoftswitch PFQ. Section 3 then discusses the acceleration of Open vSwitch using the mSwitch modular switch. Next, section 4 describes the implementation of DPDK support for the COMBO-100G card and section 5 the porting of BEBA’s proof-of-concept work to Open vSwitch. Finally, section 6 concludes.





2. ofsoftswitch Acceleration with PFQ The software switch ofsoftswitch [1] is one of the candidate platforms to integrate the advanced network functionalities proposed in BEBA. ofsoftswitch is an OpenFlow 1.3 compatible user-space software switch implementation based on the Ericsson TrafficLab 1.1 softswitch implementation, with changes in the forwarding plane to support OpenFlow 1.3. While the ofsoftswitch implementation is very clean and so provides a suitable platform for the integration of BEBA functionalities, its data plane performance is known to be very limited. In the following, the main reasons for low ofsoftswitch performance are presented together with a set of alternative plans for its acceleration. One of this alternatives relies on the use of the PFQ I/O framework that, in fact, is presented next by providing its architecture and showing its ability in accelerating new and legacy applications.

2.1. The ofsoftswitch Dataplane Implementation The overall architecture of ofsoftswitch is shown in Figure 2.1 and reflects the underlying OF principles. Data plane and control plane are handled by two distinct processes:

• the ofprotocol is in charge of communicating with the external OF controller and to set OF configurations;

• the ofdatapath implements the actual switching operations. The ofdatapath module is designed as a single-process application whose structure is depicted in Figure 2.2.

Figure 2.1: ofsoftswitch architecture

ofprotocol OF Controller

ofdatapathPhysical ports

Switchingnetdev

AF_PACKET socketsnetdev netdev netdev

OF Tables OF meters





Figure 1.2: The ofdatapath software module

The netdev library shipped with the package implements an abstraction for network devices. As such, it contains all functions for opening/closing devices, receiving/transmitting packets, managing queueing disciplines, reading stats from devices, and so on. However, at the link layer, the netdev library relies on standard Linux AF_PACKET sockets for receiving and transmitting packets. Two main performance limitations emerge from this design:

• The use of AF_PACKET sockets is well known to be inefficient in terms of I/O speed in case of multi-gigabit network cards. In addition, it makes the software switch platform dependent (the AF_PACKET sockets run on Linux OS only);

• The overall software switch runs in a single process/thread fashion, hence it cannot take advantage of multi-core acceleration.

2.2. ofsoftswitch Acceleration Directions These limitations provide the natural area of investigation for accelerating the switching platform. As such, two nearly independent directions can be pursued: one to provide a more efficient I/O framework for traffic handling and the second to extend the software suite to support multi-core processing.

2.2.1. Generic I/O Acceleration The first step to improve the data plane performance of ofsoftswitch consists of replacing the underlying AF_PACKET sockets for packet I/O operations. To this aim, the netdev abstraction built around the Linux sockets must be modified in order to virtualize packet I/O operations (Figure 2.3) thus enabling the use of software accelerated frameworks such as PF_RING ZC [2], netmap [3], DPDK [4] or PFQ [5].

Figure 2.3: Virtualized netdev in ofdatapath

However, the functions in the new virtualized netdev library must be given a concrete implementation to cope with each specific acceleration engine running the actual I/O. An interesting intermediate step consists of providing a virtualized implementation to the standard pcap library (Figure 2.4). Indeed, the pcap library is a cross-platform interface for packet I/O and it is supported by all of the aforementioned accelerated engines.

Switchingvirt.netdev

Accelerated I/O socketsvirt.netdev virt.netdev virt.netdev

OF Tables OF meters





As such, the netdev virtualization to the pcap interfaces allows: • a platform independent build of ofdatapath; • the transparent selection of different I/O accelerating engines.

Figure 2.4: pcap layer in ofdatapath

2.2.2. Multi-core Packet Processing As already mentioned, the current ofsoftswitch implementation is built as a single process application. However, today’s commodity PCs are equipped with multi-core CPUs while multi-gigabit network cards support multiple hardware queues for parallel packet processing. Therefore, the second natural direction for performance improvement is to modify the software switch architecture to take full advantage of parallelism. Although this activity is theoretically orthogonal to the previously presented generic I/O acceleration, the two aspects are somewhat connected in practice as we will elaborate. The migration from a single process to a fully fledged multithreaded application can be addressed in several steps. The first approach is graphically depicted in Figure 2.5, where multiple instances of both the ofdatapath and ofprotocol processes (roughly, multiple instances of the overall application) run in parallel/different cores.

Switchingvirt.netdev virt.netdev virt.netdev virt.netdev

OF Tables OF meters

Accelerated I/O sockets

libpcap

ofprotocol

OF Controller

ofdatapath

Physical ports

ofprotocol

ofdatapath

ofprotocol

ofdatapath

ofprotocol

ofdatapath

PFQ





Figure 2.5: Multiple independent instances of ofsoftswitch

From the controller point of view, the system appears as a number of distinct OF switches. From the data plane point of view, each instance of ofdatapath acts as an independent switch that operates on a portion of the overall traffic. However, the way traffic is distributed across switches requires an extra in-kernel layer to be provided by the I/O packet engine. Out of the above-mentioned packet I/O accelerator, the only one supporting fine-grained programmable packet distribution is PFQ. The details about PFQ will be given in the next section: at this stage, it is sufficient to point out that PFQ is capable of distributing traffic across groups of native sockets or pcap ones according to a wide and customizable set of dispatching policies specified through its own language (Pfq-lang). Although pure stateless switching may allow even simple packet distribution across cores like the one based on RSS technology, the BEBA stateful processing requires more refined procedures to guarantee full isolation among multiple instances of the datapath. In particular, each instance of the datapath must be able to elaborate packets in isolation with respect to its extended Finite State Machines with no data sharing across processes. Packet distribution, in turn, strictly depends on the OF configuration itself, and such a granularity cannot be achieved by coarse dispatching policies such as RSS.

Figure 2.6: Multiple instances of ofdatapath with single control module

Figure 2.6 represents the next step evolution of the previous scheme (Figur) where the multiple instances of the ofprotocol controller are gathered in a single instance. This requires partial code modification, but the advantages are twofold:

1. from the OF controller point of view, the switch announces itself as a single node; 2. the ofprotocol will be able to directly instrumenting the underlying PFQ layer instead

of using a third party application. This will allow dynamic (re)configuration of the packet distribution process on the basis of either messages from the BEBA controller or upon internal table elaboration.

ofprotocol

OF Controller

ofdatapath

Physical ports

ofdatapath ofdatapath ofdatapath

PFQ





Figure 2.7: Multi-threaded ofsoftswitch

The final natural step (Figure 2.7) consists of grouping multiple instances of ofdatapath in a single multithreaded application where each thread runs in isolation and possibly share (when unavoidable) common state with other concurrent thread instances. Notably, while the previous schemes allow to reuse most of the original code, this last scheme requires major code rewriting (e.g., enforce const-correctness, function re-entrancy, global to per-thread data, etc.) especially as far as the data plane is concerned.

2.3. PFQ I/O framework This section provides an overview of the PFQ I/O framework and describes its practical use in accelerating both native and legacy network applications.

The architecture of PFQ as a whole is shown in Figure 2.8. In a nutshell, PFQ is a Linux kernel module designed to retrieve packets from one or more traffic sources, processes them by means of functional blocks (the λi blocks in the picture) and finally delivers them to one or more endpoints.

ofprotocol

OF Controller

thread 1

Physical ports

thread 2 thread 3 thread ...

PFQ

ofdatapath





Figure 2.8: PFQ system at-a-glance

Traffic sources (at the bottom end of the figure) are represented by Network Interface Cards (NICs) or – in case of multi–queue cards – by single hardware queues of network devices.

Endpoints, instead, can be either sockets from user–space applications (top end of the figure), network devices, or even the networking stack of the operating system itself (right end of the figure) for ordinary operations (e.g., traditional routing, switching etc.). The low level management of NICs is totally left under the OS control as the network device drivers are not modified.

A typical PFQ scenario is acceleration of multi–threaded network applications in charge of monitoring traffic over one or more network interfaces. Each thread opens a single socket to the group of devices to monitor and receives a quota (or all) of the packets tapped from the selected interfaces. On their way to the application, packets come across functional blocks, that may implement part of the application processing machinery. The execution of this early stage processing is instantiated by the application itself through a functional DSL language. As a result of this elaboration stage packets are finally delivered to the selected endpoints.

It is worth noticing that PFQ does not bypass the Linux kernel, but simply sits alongside it. Packets directed to networking applications are actually treated by PFQ exclusively, whereas packets devoted to other system operations can be transparently passed to the kernel stack, on a per packet basis.

NIC

App

Kernel Network Stack

Path to endpoint

Path from source

NIC

NIC

App

λ1

App

NIC

λ1 λ1 λ2

batch queue

socket queue

NIC

Sources

endpointsendpoints





Figure 2.9: PFQ software stack

Figure 2.9 depicts the complete software stack of the PFQ package. The kernel module includes a reusable pool of socket buffers and the implementation of a functional language along with the related processing engine. In user-space, the stack includes native libraries for C++11 and C, as well as bindings for Haskell, the accelerated pcap library and two implementations of the eDSL for both C++ and Haskell.

2.3.1. High Speed Packet I/O Overall, the design philosophy of PFQ is to leave the network device drivers and their interface toward the operating system totally untouched. While the use of vanilla drivers allows complete compatibility with a large number of network devices, it could raise performance issues. Indeed, the standard OS handling of packet capture cannot guarantee decent performance on high speed links and successful projects like PF RING ZC or Netmap have proven the effectiveness of driver modifications.

However, the software acceleration techniques implemented in PFQ make it possible to achieve top class capture figures while retaining full compliance with normal driver data structures and operations. Basically, the main internal mechanisms adopted by PFQ to improve packet I/O are:

• Interception and replacement of the OS functions invoked by the device driver with accelerated routines. Kernel operations triggered by the arrival of a packet are bypassed and the packet itself gets under the control of PFQ. This procedure does not require any modification to the source code of NIC drivers which, in turn, only need to be compiled against a PFQ header to overload at compile time the relevant system–calls that i) pass packets to kernel (namely netif_receive_skb, netif_rx and napi_gro_receive) and ii) are in charge of allocating memory for the packet (e.g., netdev_alloc_skb, alloc_skb, dev_alloc_skb, etc.). The whole operation is made easier by the pfq-omatic tool included in the PFQ package that automates the

PFQC++11

pfq-lang DLS

libpcap

PFQ libc

Haskell FFI

pfq-lang DLS

socket buffer pool pfq-lang

Network Device Drivers

λ (functional engine)

kernel space

user space





compilation and only needs the original source code of the vendor device drivers.

• Using pools of skbuff, namely pre–allocated caches of packets (Figure 2.10). This guarantees full compatibility with standard driver operations and allows packet delivery to the system OS (when it acts as an endpoint) while accelerating the skbuff memory allocations

Figure 2.10: Pool of skbuffs

• Use of batch queues (per active core). Batch queues are standard FIFO systems used to place packets before they are processed. Batch processing proves to be very effective for at least two reasons. The first reason is that batching operations always improve the temporal locality of memory accesses that, in turn, reduce the probability of cache misses. The second (and biggest) reason is determined by the dramatic amortization of the cost of the atomic operations involved in the processing of packets.

2.3.2. The Processing Layer Packets backlogged in the batch queues wait their turn to be processed by functional engines. Each functional engine executes up to 64 distinct computations instantiated by upstream applications through a functional language, on a per–packet basis.

Computations are executed in kernel space despite being instantiated in user–space through the specially-developed embedded Domain Specific Language named pfq–lang. The use of computations is specially targeted at offloading upstream applications by providing an early stage of in–kernel processing.

All computations instantiated on a functional engine are executed sequentially on each packet of the batch queue and in parallel with respect to the other instances of the same computations running on the other cores.

Currently, PFQ engines integrate about a hundred primitive functions that can be roughly classified as: protocol filters, conditional functions, logging functions, forwarding (to kernel or to NIC) and fanout functions. The last category of function is particularly relevant as it defines which (and how) applications endpoints will receive packets, through the concept of socket

P(PFQ/Kernel)

C(Driver)

SKB SKB SKB

P(PFQ/Kernel)

C(Driver)

SKB SKB SKB

Core 1

Core 1

Core 3

Core N





groups. Each endpoint opens a socket and registers the socket to a group. The group is bound to a set of data sources (physical devices or a specific subset of their hardware queues). In addition, each group defines its own (unique) computation; hence, each socket participating in the group receives packets processed by the same computation in the functional engine.

The endpoints participating to the same group receive packets according to the fanout primitives earlier introduced. The two basic delivery modes are:

• Broadcast: a copy of each packet is sent to all of the sockets of the group;

• Steering: packets are delivered to the group sockets by using a hash-based load balancing algorithm. Both the algorithm and the hash keys are defined by the application through the computation instantiated in the functional engine. For example, the function steer_flow spreads traffic according to a symmetric hash that preserves the coherency of bi–directional flows, while the function steer_ip steers traffic according to a hash function that use source and destination IP address fields as mega–flows.

Further, PFQ provides the additional abstraction of classes, defined as a subset of sockets of the group (in fact, a subgroup) that receive specific traffic as a result of in–kernel computations. Again, sockets belonging to the same class may receive traffic either in broadcast (Deliver) mode or in load balancing (Dispatch) mode.

Groups access policy. Although the concept of group allows different sockets to participate and share common monitoring operation, for security and privacy reasons not all sockets should be able to freely access any active group. To this aim, PFQ implements three different access policies so that, at the time of their creation, groups can be:

• private: the group can only be joined by the socket that created the group;

• restricted: the group can be joined by sockets belonging to the process that created the group (hence the group is open to different threads of execution of the process);

• shared: the group is publicly joinable by any active socket on the system.

2.3.3. Application Programming Interfaces PFQ is a polyglot framework that exposes native libraries for the C and C++11 languages, as well as bindings for Haskell. Moreover, PFQ additionally includes an embedded Domain Specific Language (pfq–lang) that allows users to program the kernel–space computations from within C++ and Haskell network applications. Finally, for compatibility with a large number of traditional legacy applications, PFQ also exposes an adaptation interface towards the standard pcap library.





Native APIs. Native PFQ libraries include a rich set of functions to control the underlying PFQ mechanisms, to handle traffic capture/transmission and to retrieve statistics. In particular, the native API functions are in charge of handling the following operations:

• groups and classes management; • socket control; • socket parameters setup; • packet capture; • packet transmission; • statistics and counters; • in–kernel computation injection. •

The following code shows a practical example of using the PFQ native library. #include <pfq/pfq.h>

// Open PFQ socket pfq_t *p = pfq_open(64, 4096, 1024); if (p == NULL) {

printf("error: %s\n", pfq_error(p)); return -‐1;

}

// Bind socket to capture interface/HW queue if (pfq_bind(p, “eth1”, Q_ANY_QUEUE) < 0) {

printf("error: %s\n", pfq_error(p)); return -‐1; }

// Bind socket to transmission interface/HW queue using kthread index number if (pfq_bind_tx(p, dev, queue, kthread) < 0) { fprintf(stderr, "%s\n", pfq_error(q)); return -‐1; }

// Enable socket for capture/transmission if (pfq_enable(p) < 0) { printf("error: %s\n", pfq_error(p)); return -‐1; } // Invoke dispatch callback on each packet for(;;) { int many = pfq_dispatch(p, dispatch_callback, 1000000, NULL); } … …

… // Send a copy of packet to socket p pfq_send(p, packet, sizeof(packet), 1, 1)

// Close PFQ socket pfq_close(p);





The Pfq–lang language. The in–kernel processing computations executed by functional engines can be instantiated by applications themselves through the pfq–lang functional language, an embedded DSL that provides the formal grammar for composing the elementary primitives running in the functional engine. A computation expressed as a pfq–lang program can be either a single function or the composition of two or more operations. pfq–lang provides a rich set of primitive operations as well as conditional functions and predicates to implement a basic code control flow. As an example of a pfq–lang expression, a simple function that filters IP packets and dispatches them to a group of endpoints (e.g. sockets) by means of a steering algorithm is described as:

composition = ip >-> steer_ip

where ip is a filter that drops all the packets but IP ones, and steer_ip is a function that performs a symmetric hash with IP source and destination. pfq–lang implements filters for the most common protocols such as ip, udp, tcp, icmp, ip6, icmp6, rtp (heuristic), gtp (both v1 and v2) and so forth. In addition, each filter is complemented with a predicate whose name begins with is_ or has_ by convention. Conditional functions allow to change the behavior of the computation depending on a property of the processed packet, as in the following example:

composition = ip >-> when is_tcp forward "eth1" >-> steer_flow

The function drops all non–IP packets, forwards a copy of TCP packets to eth1, and then dispatches packets to the group of registered PFQ sockets in steering mode.

Libpcap adaptation layer. Legacy applications using pcap library can also be accelerated by using the pcap adaptation layer that has been extended to support PFQ sockets. As an example (very useful in accelerating ofsoftswitch), the availability of the pcap interface allows multiple instances of single-threaded, legacy applications to run in parallel as PFQ groups can be joined by single processes.

However, in order to keep full compatibility with legacy applications, the pcap adaptation layer is designed to maintain the original pcap semantics and leaves the APIs unchanged. Therefore, some specific options needed by PFQ native libraries, such as the ones associated with groups/classes handling, computation instantiations, etc., are passed to PFQ through either environment variables or configuration files.

Pcap acceleration is activated depending on the name of physical interfaces: if they begin with pfq: the library automatically switches to pfq sockets; otherwise it rolls back to traditional sockets. In addition, multiple capturing devices can be specified by interposing the colon symbol (:) between the names of the interfaces (e.g., pfq:eth0:eth1). It is worth noticing that PFQ is totally transparent to legacy pcap applications running on top of it.

In practice, to run on top of PFQ, an arbitrary pcap application such as tcpdump should





equivalently i) be compiled against the pfq–pcap library or ii) if previously compiled against the shared version of traditional pcap library, be executed by preloading the pfq–pcap library through the environment variable LD_PRELOAD.

The following example shows four sessions of tcpdump sniffing TCP packets from network interfaces eth0 and eth1. The four sessions run in parallel on group 42 and receive a load balanced quota of traffic that preserves the flow coherency. The (first) master process sets group number, computation (steer flow) and binding to devices. The additional three instances specify the PFQ GROUP only in that computation, network devices and filters must be the same as the four sessions belong to the same group (42).

PFQ_GROUP=42 PFQ_LANG=”main = steer_flow” \ tcpdump -n -i pfq:eth0:eth1 tcp PFQ_GORUP=42 tcpdump -n -i pfq PFQ_GORUP=42 tcpdump -n -i pfq PFQ_GORUP=42 tcpdump -n -i pfq

2.3.4. PFQ performance This section presents two examples of performance acceleration provided by PFQ. The first experiment aims at evaluating the effectiveness of PFQ in accelerating the pcap library. A direct comparison with classic pcap library is possible, in that the underlying Linux AF_PACKET socket can take advantage of the multi–queue support provided by the RSS technology if interrupt affinity is configured accordingly.

Figure 2.11: libpcap acceleration

Figure 2.11 shows the results achieved by using a small multi–threaded pcap application that simply counts packets when running on top of the standard AF_PACKET socket and on top of PFQ, under different traffic packet sizes and different RSS values. The performance improvement is evident and clearly demonstrates that PFQ can effectively accelerate legacy network applications traditionally based on the pcap library. In addition, it is worth noting that the multi–queue support provided by the Linux kernel does not significantly improve capturing

200 400 600 800 1000 1200 1400Packet Size (Bytes)

2

4

6

8

10

12

14

16

Pack

et R

ate

(Mpp

s)

Line Ratepcap, RSS = 1pcap, RSS = 2pcap, RSS = 3pcap, RSS = 4pcap+pfq RSS = 1pcap+pfq RSS = 2pcap+pfq RSS = 3





performance that indeed barely reaches a 2 Mp/s rate with 4 hardware queues (RSS = 4).

The second experiment refers to packet transmission and aims at assessing the effectiveness of PFQ in accelerating the well known traffic generator Ostinato. Ostinato is a highly configurable open source traffic generator that supports a wide variety of protocol templates for packet crafting. It is based on a client–server architecture and the server (drone) runs each engine as a single–thread and uses the pcap library for packet transmission.

Figure 2.12: Ostinato acceleration

Figure 2.12 shows the result of the experiment. Ostinato was first run alone with the optimal value of 4 hardware queues (although, as shown in the previous example, the number of hardware queues used by the standard AF_PACKET socket does not make significant differences). The results show that Ostinato alone reaches close to full rate generation speed only for 1500B packets. In all other cases its performance is significantly far from the theoretical maximum.

The use of PFQ significantly accelerates the application performance, although line rate is achieved for packet sizes of at least 128B. However, even in the worst case, PFQ allows us to bring the Ostinato performance above 10 Mp/s transmission rate (i.e., yielding an acceleration factor slightly larger than 7) with 3 transmitting kernel threads and setup affinity that preserves the engines from running the Ostinato drone itself. Conversely, the figure also shows that no significant improvement can be noticed by increasing the number of transmitting cores beyond 4.

2.4. Conclusions and Future Work In this section we reported on the ongoing research activities for the acceleration of ofsoftswitch, one of the possible candidate for implementing the BEBA advanced functionalities. Starting from the analysis of the main reasons for its known low data plane performance we identified distinct directions for software acceleration together with a step-by-step roadmap.

200 400 600 800 1000 1200 1400Packet Size (Bytes)

2

4

6

8

10

12

14

Pack

et R

ate

(Mpp

s)

Line RateOstinatoOstinato + PFQ, RSS = 1Ostinato + PFQ, RSS = 2Ostinato + PFQ, RSS = 3Ostinato + PFQ, RSS = 4





The plan essentially focuses on the use of PFQ to speed up I/O operations as well as to enable the migration towards a multi-core architecture. The use of PFQ is described along with its main features and justified by its performance results in accelerating legacy applications. The described actions to accelerate ofsoftswitch will be sequentially undertaken until the end of WP3 activities. In addition, we intend to explore new theoretical directions in parallel traffic dispatching in order to efficiently handle advanced stateful processing on multi-core CPUs.





3. Netmap-‐‑accelerated Open vSwitch Open vSwitch (OVS) is the de-facto standard for software switches, and so and obvious candidate for implementing the BEBA abstractions. However, while widely used, OVS is infamous for yielding relatively low performance (although efforts are on-going to improve this). In this section we describe our work to transparently accelerate OVS by leveraging mSwitch [REF], a high performance, modular software switch based on the netmap API. We first describe mSwitch’s architecture and give some baseline measurements for it. We then describe how we can, with a relatively small number of code changes, turn OVS’ datapath into a module compatible with mSwitch. With this in place, we provide a performance evaluation of this accelerated OVS, which we call mSwitch-OVS. Finally, we discuss future work, including plans to further accelerate packet I/O between NICs and processes/VMs by leveraging the ptnetmap framework [17].

3.1. mSwitch Architecture mSwitch [15] is based around the principle of a split data plane consisting of a switch fabric in charge of moving packets quickly between ports; and the switch logic, the modular part of mSwitch which looks at incoming packets and decides which their destination port(s) should be. This split allows mSwitch to provide a fast fabric that yields high throughput, while giving users the ability to program the logic without having to worry about the intricacies of high performance packet I/O.

Figure 3.1: mSwitch architecture: the switch fabric handles efficient packet delivery between ports, while the switch logic (forwarding decisions, filtering etc.) is implemented through loadable kernel modules.

Figure 3.1 shows mSwitch's overall architecture. mSwitch can attach virtual or physical interfaces as well as the protocol stack of the system's operating system. From the switch's point of view, all of these are abstracted as ports, each of which is given a unique index in the switch. When a packet arrives at a port, mSwitch switches it by relying on the (pluggable) packet processing stage to tell it which port(s) to send the packet to. In addition, multiple

app1/vm1

Switching fabricNIC

Switching logic

KernelUserapps

OS stackVirtual

interfaces

appN/vmN

Net

map

API

. . .

. . .

Socket API

Net

map

API





instances of mSwitch can be run simultaneously on the same system, and ports can be created and connected to a switch instance dynamically. Virtual ports are accessed from user space using the netmap API: they can be connected to virtual machines (e.g., QEMU instances), or generic netmap-enabled applications. The netmap API uses shared memory between the application and the kernel, but each virtual port uses a separate address space to prevent interference between clients. Physical ports are also connected to the switch using a modified version of the netmap API, providing much higher performance than the default device drivers. Each port can also contain multiple packet rings which can be assigned to separate CPU cores in order to scale the performance of a port. Ports connected to the host network stack are linked to a physical (NIC) port. Traffic that the host stack would have sent to the NIC is diverted to mSwitch instead, and from here can be passed to the NIC, or to one of the virtual ports depending on the packet processing logic running on the switch. Likewise, traffic from other ports can reach the host stack if the switch logic decides so. Packet forwarding is always performed within the kernel and in the context of the thread that generated the packet. This is a user application thread (for virtual ports), a kernel thread (for packets arriving on a physical port), or either for packets coming from a host port, depending on the state of the host stack protocols. Several sending threads may thus contend for access to destination ports. mSwitch always copies data from the source to the destination netmap buffer. In principle, zero-copy operation could be possible (and cheap) between physical ports and/or the host stack, but this seems an unnecessary optimization: for small packet sizes, once packet headers are read (to make a forwarding decision), the copy comes almost for free; for large packets, the packet rate is much lower so the copy cost (in terms of CPU cycles and memory bandwidth) is not a bottleneck. Cache pollution might be significant, so we may revise this decision in the future.

3.2. Base Performance In this experiment we run mSwitch on a server with an Intel Xeon [email protected] 12-core CPU (3.2 GHz with Turbo Boost), 32GB of DDR3-1600 RAM (quad channel) and a dual-port, 10 Gbit/s Intel X520-T2 NIC. An additional machine is used for external traffic sources/sinks as needed. In this section the operating system is FreeBSD 10, though mSwitch also runs in Linux and the OVS module is Linux-based. Further, we disable hyper-threading and enable direct cache access (DCA). To generate and count packets we use pkt-gen, a fast generator that uses the netmap API and so can drive both NICs and mSwitch's virtual ports. Packet sizes in the text and graphs do not include the 4-byte Ethernet checksum; this is for consistency with virtual ports experiments, where there is no CRC.





Figure 3.2: Throughput between 10 Gbit/s NICs and virtual ports for different batch sizes.

To derive the baseline performance of the switch fabric, we use a simple switch logic that for each input port returns a statically configured output port; the processing cost (just an indirect function call) is thus practically negligible. We then evaluate mSwitch's throughput for different packet sizes and combinations of NICs and virtual ports. Forwarding rates between NICs are bounded by CPU or NIC hardware, and mSwitch achieves higher forwarding rates with increasing CPU frequency (Figure 3.2 (a)) and number of CPU cores (Figure 3.2 (b)). Note that the rates for 1 CPU core in Figure 3.2 (b) are slightly lower than those in Figure 3.2 (a) due to costs associated with our use of Intel’s Flow Director, in charge of distributing packets to multiple CPU cores. Packet forwarding between virtual ports exhibits similar characteristics, except that the rates are not bounded by the NIC hardware. Packet rates scale fairly linearly with increasing number of rings and CPU cores, achieving a maximum of about 75 million packet per second for minimum-sized packets, a rate of 390 Gb/s for maximum-sized ones and a maximum rate of 466 Gb/s for 8K and bigger frames.

3.3. Accelerating Open vSwitch with mSwitch (mSwitch-‐‑OVS) We used OVS 2.4's code (the current release) to implement the switch logic of what we term mSwitch-OVS. We modified only OVS’ datapath part, which is implemented as a Linux kernel module and consists of about 10,000 LoC. This required the implementation of some glue code (approximately 100 LoC) to create an mSwitch instance on startup, and an additional 400 LoC (including 70 modified lines) to hook the OVS code to the mSwitch switching logic. In essence, mSwitch-OVS replaces OVS’ datapath, which normally uses Linux's standard packet I/O, with mSwitch’s fast packet I/O. As a result, we can avoid expensive, per-packet sk_buff allocations and deallocations.





3.4. Evaluation Results In this section we evaluate the performance of mSwitch-OVS in three different scenarios: when forwarding packets between two 10Gb/s NICs; when forwarding packets between two virtual ports (e.g., two processes) and when forwarding packets between an input NIC, up to a virtual port, and back out to another NIC. For these experiments we use a computer with an Intel Xeon E3-1620 v2 3.7 GHz (4 cores) and an Intel x540 10Gb/s two-port card. We use a separate server as traffic generator and sink running netmap.

Figure 3.3: Throughput when forwarding packets between two NICs through the accelerated version of Open vSwitch (mswitch-OVS).

Figure 3.3 shows the results from the first scenario, where packets arrive at a NIC and mSwitch-OVS immediately forwards them back out to the outgoing NIC. As a reference point, the work in [16] reports that the kernel version of Open vSwitch can forward around 300 Kp/s for minimum-sized packets; that number’s almost for sure low given improvements in the last year, but it gives a rough idea of what OVS can do out of the box. Compared to this, we see important improvements: mswitch-OVS can forward packets at rates of about 4.5 Mp/s, significantly higher than standard OVS.

0

2

4

6

8

10

12

14

16

60 124 252 508 1020 1514 0

2

4

6

8

10

12

Thro

ughp

ut [G

pack

ets/

s]

Thro

ughp

ut [G

bit/s

]

Packet Size

Throughput [Gbit/s]Throughput [Gpackets/s]





Figure 3.4: Throughput when forwarding packets from NIC to virtual port and back to another NIC through mswitch-OVS.

Next, we perform measurements when forwarding packets from a NIC, up to a virtual port (which runs a simple application to bounce packets back) and out to another NIC. Figure 3.4 shows the results: mswitch-OVS is able to forward packets at almost 4 Mp/s for minimum-sized packets.

Figure 3.5: Throughput when forwarding packets from NIC to a QEMU virtual machine and back to another NIC through mswitch-OVS.

0

2

4

6

8

10

12

14

16

60 124 252 508 1020 1514 0

2

4

6

8

10

12

Thro

ughp

ut [G

pack

ets/

s]

Thro

ughp

ut [G

bit/s

]

Packet Size


0

2

4

6

8

10

12

14

16

60 124 252 508 1020 1514 0

2

4

6

8

10

12

Thro

ughp

ut [G

pack

ets/

s]

Thro

ughp

ut [G

bit/s

]

Packet Size






In the final experiment, we forward packets from a NIC to a virtual port and back to a NIC as in the previous experiment, except this time we attach a QEMU virtual machine to the virtual port as opposed to the simple bounce application. As shown in Figure 3.5, this scenario causes the throughput to plummet down to about 500 Kp/s, demonstrating the I/O bottleneck between the switch and the virtual machine.

3.5. Conclusions and Future Steps We have described the mSwitch modular switch framework and our work to modify Open vSwitch as a module for it. Because mSwitch is optimized for performance, we have shown that this porting effort allows us to (mostly transparently) accelerate Open vSwitch. As shown from the results, however, forwarding traffic up to a VM and then back yields rather poor performance. Given that one of the main uses of software switches nowadays is as back-ends to virtualization technologies (something that would certainly also be the case for the BEBA switch), it is paramount that we remove this bottleneck. Towards this goal, we are looking into applying ptnetmap [17], which implements virtual device passthrough for high speed VM networking and is based on the netmap API, to our mSwitch framework. Once the performance is high, the aim is to be able to merge this with the proof-of-concept BEBA implementation being developed by other partners.





4. DPDK @ 100Gb With the intention of creating a high-speed prototype of the BEBA switch, careful attention must be paid to all performance aspects of the packet processing chain. The interface between hardware and software is indeed a place where performance is crucially important and an inefficient implementation may result in poor overall performance. We use a COMBO-100G1 hardware accelerator [6] by Netcope Technologies to evaluate the interface from FPGA to the software running at host CPU. The COMBO-100G card uses Xilinx Virtex-7 FPGA for basic packet receiving and transmitting, as well as for additional packet processing tasks, such as packet header parsing, matching, filtering, editing etc. The COMBO-100G1 uses custom SZE2 interface to transfer data over PCI Express with minimal bus overhead. We show that it is possible to implement a layer to convert data in bus-friendly format (SZE2) to CPU-friendly format (DPDK). The use of a FPGA accelerator is in line with the FPGA-based hardware proof of concept prototype, described in D2.2. The szedata2 poll mode driver creates a layer between the DPDK and SZE2 interface. The PMD is implemented as a PMD_VDEV type, which means that the PCI device is not recognized during the PCI probing phase during the initialization of the DPDK application. Instead, a virtual device is initialized according to command-line options.

4.1. Using the szedata2 Device The EAL command-line option --vdev is used to create a virtual device as follows:

--vdev “eth_szedata20,dev_path=/dev/szedataII0,rx_ifaces=0x1,tx_ifaces=0x1” The --vdev option has to contain:

-‐ A unique name for the virtual device (for example eth_szedata20), where eth_szedata2 is a name of driver and 0 is an index of the device.

-‐ The path to the SZE2 character device file (for example /dev/szedataII0), which is specified as the value for the key dev_path.

-‐ The mask identifying RX DMA channels which are used for the virtual device (for example 0x1). The mask is specified as the value for the key rx_ifaces. Each bit in the mask represents one DMA channel.

-‐ The mask identifying TX DMA channels which are used for the virtual device (for example 0x1). The mask is specified as the value for the key tx_ifaces. Each bit in the mask represents one DMA channel.

4.2. Straight Zero-‐‑copy Interface (SZE2) SZE2 interface provides fast straight zero-copy DMA transfers between FPGA-based network interface card and the host system RAM. Data are transferred between the ring buffers placed in the FPGA-based NIC and the ring buffers placed in RAM. DMA transfers are managed by DMA controllers in the FPGA-based NIC. Initialization of ring buffers and communication between FPGA-based NIC and the host system is arranged by kernel modules. The API of the libsze2 library provides access to the SZE2 operations from the user space applications.





4.2.1. SZE2 Architecture The FPGA-based network interface card can generally contain several RX and TX DMA channels. Every DMA channel has its own ring buffer in the NIC, in the host RAM and set of pointers which define start and end of data in both ring buffers. The structure of the ring buffer in RAM is depicted in Figure 4.1. The ring buffer is composed of DMA-capable memory pages which contain packet data. These pages are linked to the continuous ring buffer through descriptors. Descriptors are 64 bit values – pointers to memory pages. There are two types of descriptors, determined by the least significant bit: 1 Descriptors pointing to the memory blocks with packet data. 2 Descriptors pointing to the memory blocks with more descriptors.

The last descriptor in the memory block with descriptors is of type 2 and points to the next memory block with descriptors. Since the DMA controller needs access to the descriptors, memory blocks with descriptors are DMA capable memory pages. The structure of the ring buffer (shown in Figure 4.2) determines the way packets can be manipulated. The packets have to be processed and released one by one because one unprocessed packet would block up receiving new packets. The area of unprocessed packets has to be continuous.

Figure 4.1: SZE2 ring buffer architecture

4.2.2. Format of Data Segments The segments of data (i.e., packets) transferred through SZE2 have generally variable length. There is a header attached before every data part of segment. Both the header part and the data part are aligned to eight-byte address boundaries. If the length of the header or the data block is not a multiple of eight, blank space is inserted between the parts so that every segment starts at an eight-byte boundary. The header contains:

1. Length of the whole segment (segment size, 2 byte value): header length aligned to eight bytes + data length.





2. Length of the hardware header (header size, 2 byte value): header length without the first 4 bytes (segment size and header size), not aligned to eight bytes.

3. Additional application-specific control information. The placement of SZE2 segments in a ring buffer is shown in Figure 4.2.

Figure 4.2: SZE2 data segments

4.3. PMD Implementation Details The szedata2 poll mode driver (PMD) is linked against the libsze2 library. The libsze2 library API provides operations for DMA channels initialization and data transfers through SZE2 layer.

4.3.1. Initialization of RX and TX queues There is one RX (and TX) queue for each RX (and TX) DMA channel created during the poll mode driver initialization routine according to values from the --vdev command-line option. For each queue a SZE2 device is opened and initialized by calls to the functions szedata_open(), szedata_subscribe3() and szedata_start(). The path to the SZE2 character device file obtained from the --vdev command-line option is passed as a parameter to the function szedata_open(). The following code snippet illustrates the calling of these functions during the initialization routine:

internals->rx_queue[num_sub].sze = szedata_open(internals->sze_dev);





if (internals->rx_queue[num_sub].sze == NULL) return -1; ret = szedata_subscribe3(internals->rx_queue[num_sub].sze, &x, &tx); if (ret) { szedata_close(internals->rx_queue[num_sub].sze); internals->rx_queue[num_sub].sze = NULL; return -1; } ret = szedata_start(internals->rx_queue[num_sub].sze); if (ret) { szedata_close(internals->rx_queue[num_sub].sze); internals->rx_queue[num_sub].sze = NULL; return -1; }

Closing is done by a call to the function szedata_close().

4.3.2. RTE Ethernet Device API Not all functions from Ethernet Device API are exported. Functions which are exported are shown in the following structure definition:

static struct eth_dev_ops ops = { .dev_start = eth_dev_start, .dev_stop = eth_dev_stop, .dev_close = eth_dev_close, .dev_configure = eth_dev_configure, .dev_infos_get = eth_dev_info, .rx_queue_setup = eth_rx_queue_setup, .tx_queue_setup = eth_tx_queue_setup, .rx_queue_release = eth_queue_release, .tx_queue_release = eth_queue_release, .link_update = eth_link_update, .stats_get = eth_stats_get, .stats_reset = eth_stats_reset, .mac_addr_set = eth_mac_addr_set, };

4.3.3. Receiving packets There are two callback functions for retrieving packets:

• eth_szedata2_rx() • eth_szedata2_rx_scattered()

The first function handles only packets which can be stored in one mbuf. The second function handles all packets and chains mbufs if needed. Decisions as to which function is set as





callback for the RTE Ethernet API are made according to the configuration option enable_scatter. The packet data in DPDK are stored in mbuf structures. As PMD has no control over releasing of received packets, and SZE2 segments have to be released before processing of other segments, the packets from SZE2 ring buffers have to be copied to the mbuf structure during the receiving routine. While it may seem ineffective to copy every single packet from the continuous SZE2 buffer to a separate mbuf, our measurements show that this overhead is merely 12.5% (RX) and 16.7% (TX) of CPU time for 64B packets. That was measured with no real packet processing application running, therefore the real relative overhead will be even smaller and is certainly acceptable. The function szedata_rx_lock_data() from the libsze2 API is called to retrieve the area of the ring buffer with data ready to be processed in the RX callback function as follows:

sze->ct_rx_lck = szedata_rx_lock_data(sze_q->sze, ~0U); If there is an area with unprocessed packets, it is locked and retrieved by the call shown above. After that, the SZE2 segments are parsed step by step. The packet data is stored in an mbuf structure as shown in the following code snippet where the callback function processes only non-scattered packets.

if (packet_size <= buf_size) { /* sze packet will fit in one mbuf, go ahead and copy */ rte_memcpy(rte_pktmbuf_mtod(mbuf, void *), packet_ptr1, packet_len1); if (packet_ptr2 != NULL) { rte_memcpy((void *)(rte_pktmbuf_mtod(mbuf, uint8_t *) + packet_len1), packet_ptr2, packet_len2); } mbuf->data_len = (uint16_t)packet_size; mbuf->pkt_len = packet_size; mbuf->port = sze_q->in_port; bufs[num_rx] = mbuf; num_rx++; num_bytes += packet_size; } else { /* * sze packet will not fit in one mbuf, * scattered mode is not enabled, drop packet */ RTE_LOG(ERR, PMD, "SZE segment %d bytes will not fit in one mbuf " "(%d bytes), scattered mode is not enabled, " "drop packet!!\n", packet_size, buf_size);





rte_pktmbuf_free(mbuf); }

This code snippet shows the callback function processing all packets.

if (packet_size <= buf_size) { /* sze packet will fit in one mbuf, go ahead and copy */ rte_memcpy(rte_pktmbuf_mtod(mbuf, void *), packet_ptr1, packet_len1); if (packet_ptr2 != NULL) { rte_memcpy((void *) (rte_pktmbuf_mtod(mbuf, uint8_t *) + packet_len1), packet_ptr2, packet_len2); } mbuf->data_len = (uint16_t)packet_size; } else { /* * sze packet will not fit in one mbuf, * scatter packet into more mbufs */ struct rte_mbuf *m = mbuf; uint16_t len = rte_pktmbuf_tailroom(mbuf); /* copy first part of packet */ /* fill first mbuf */ rte_memcpy(rte_pktmbuf_append(mbuf, len), packet_ptr1, len); packet_len1 -= len; packet_ptr1 = ((uint8_t *)packet_ptr1) + len; while (packet_len1 > 0) { /* fill new mbufs */ m->next = rte_pktmbuf_alloc(sze_q->mb_pool); if (unlikely(m->next == NULL)) { rte_pktmbuf_free(mbuf); /* * Restore items from sze structure * to state after unlocking (eventually * locking). */ sze->ct_rx_lck = ct_rx_lck_backup; sze->ct_rx_rem_bytes = ct_rx_rem_bytes_backup; sze->ct_rx_cur_ptr = ct_rx_cur_ptr_backup; goto finish; } m = m->next;





len = RTE_MIN(rte_pktmbuf_tailroom(m), packet_len1); rte_memcpy(rte_pktmbuf_append(mbuf, len), packet_ptr1, len); (mbuf->nb_segs)++; packet_len1 -= len; packet_ptr1 = ((uint8_t *)packet_ptr1) + len; } if (packet_ptr2 != NULL) { /* copy second part of packet, if exists */ /* fill the rest of currently last mbuf */ len = rte_pktmbuf_tailroom(m); rte_memcpy(rte_pktmbuf_append(mbuf, len), packet_ptr2, len); packet_len2 -= len; packet_ptr2 = ((uint8_t *)packet_ptr2) + len; while (packet_len2 > 0) { /* fill new mbufs */ m->next = rte_pktmbuf_alloc( sze_q->mb_pool); if (unlikely(m->next == NULL)) { rte_pktmbuf_free(mbuf); /* * Restore items from sze * structure to state after * unlocking (eventually * locking). */ sze->ct_rx_lck = ct_rx_lck_backup; sze->ct_rx_rem_bytes = ct_rx_rem_bytes_backup; sze->ct_rx_cur_ptr = ct_rx_cur_ptr_backup; goto finish; } m = m->next; len = RTE_MIN(rte_pktmbuf_tailroom(m), packet_len2); rte_memcpy( rte_pktmbuf_append(mbuf, len), packet_ptr2, len);





(mbuf->nb_segs)++; packet_len2 -= len; packet_ptr2 = ((uint8_t *)packet_ptr2) + len; } } } mbuf->pkt_len = packet_size; mbuf->port = sze_q->in_port; bufs[num_rx] = mbuf; num_rx++; num_bytes += packet_size;

When all packets from the locked area are processed, the area is unlocked (and free to receive new packets) by a call to the function szedata_rx_unlock_data() as follows:

szedata_rx_unlock_data(sze_q->sze, sze->ct_rx_lck_orig);

4.3.4. Sending Packets The callback function eth_szedata2_tx() for sending packets handles both scattered and non-scattered packets. The function szedata_tx_lock_data() from libsze2 API is called to get a free area from the SZE2 ring buffer as follows:

lck = szedata_tx_lock_data(sze_q->sze, RTE_ETH_SZEDATA2_TX_LOCK_SIZE, sze_q->tx_channel); While the locked area is large enough the packets can be copied to the ring buffer. The locked area can be divided into two parts. Information from the mbuf structure is parsed and the SZE2 segment header and packet data are copied to the ring buffer. The following code snippet shows how the packet is stored into the ring buffer when it fits into one part of the locked area:

/* write packet length at first 2 bytes in 8B header */ *((uint16_t *)dst) = htole16( RTE_SZE2_PACKET_HEADER_SIZE_ALIGNED + pkt_len); *(((uint16_t *)dst) + 1) = htole16(0); /* copy packet from mbuf */ tmp_dst = ((uint8_t *)(dst)) + RTE_SZE2_PACKET_HEADER_SIZE_ALIGNED; if (mbuf_segs == 1) { /* * non-scattered packet, * transmit from one mbuf */ rte_memcpy(tmp_dst, rte_pktmbuf_mtod(mbuf, const void *), pkt_len); } else { /* scattered packet, transmit from more mbufs */ struct rte_mbuf *m = mbuf; while (m) {





rte_memcpy(tmp_dst, rte_pktmbuf_mtod(m, const void *), m->data_len);

tmp_dst = ((uint8_t *)(tmp_dst)) + m->data_len; m = m->next; } } dst = ((uint8_t *)dst) + hwpkt_len; unlock_size += hwpkt_len; lock_size -= hwpkt_len; rte_pktmbuf_free(mbuf); num_tx++;

The following code snippet is responsible for storing a packet into the ring buffer when the packet has to be copied into two parts of a locked area:

/* write packet length at first 2 bytes in 8B header */ *((uint16_t *)dst) = htole16(RTE_SZE2_PACKET_HEADER_SIZE_ALIGNED + pkt_len); *(((uint16_t *)dst) + 1) = htole16(0); /* * If the raw packet (pkt_len) is smaller than lock_size * get the correct length for memcpy */ write_len = pkt_len < lock_size - RTE_SZE2_PACKET_HEADER_SIZE_ALIGNED ? pkt_len : lock_size - RTE_SZE2_PACKET_HEADER_SIZE_ALIGNED; rem_len = hwpkt_len - lock_size; tmp_dst = ((uint8_t *)(dst)) + RTE_SZE2_PACKET_HEADER_SIZE_ALIGNED; if (mbuf_segs == 1) { /* * non-scattered packet, * transmit from one mbuf */ /* copy part of packet to first area */ rte_memcpy(tmp_dst, rte_pktmbuf_mtod(mbuf, const void *),

write_len); if (lck->next) dst = lck->next->start; /* copy part of packet to second area */ rte_memcpy(dst, (const void *)(rte_pktmbuf_mtod(mbuf, const uint8_t

*) + write_len), pkt_len - write_len); } else { /* scattered packet, transmit from more mbufs */ struct rte_mbuf *m = mbuf;





uint16_t written = 0; uint16_t to_write = 0; bool new_mbuf = true; uint16_t write_off = 0; /* copy part of packet to first area */ while (m && written < write_len) { to_write = RTE_MIN(m->data_len, write_len - written); rte_memcpy(tmp_dst, rte_pktmbuf_mtod(m, const void *),

to_write); tmp_dst = ((uint8_t *)(tmp_dst)) + to_write; if (m->data_len <= write_len - written) { m = m->next; new_mbuf = true; } else { new_mbuf = false; } written += to_write; } if (lck->next) dst = lck->next->start; tmp_dst = dst; written = 0; write_off = new_mbuf ? 0 : to_write; /* copy part of packet to second area */ while (m && written < pkt_len - write_len) { rte_memcpy(tmp_dst, (const void *) (rte_pktmbuf_mtod(m, uint8_t *) + write_off), m->data_len - write_off); tmp_dst = ((uint8_t *)(tmp_dst)) + (m->data_len - write_off); written += m->data_len - write_off; m = m->next; write_off = 0; } } dst = ((uint8_t *)dst) + rem_len; unlock_size += hwpkt_len; lock_size = lock_size2 - rem_len; lock_size2 = 0; rte_pktmbuf_free(mbuf); num_tx++;





When the whole burst is copied to the ring buffer or there is not enough space for the next packet, the function szedata_tx_unlock_data() from the libsze2 API is called as follows:

szedata_tx_unlock_data(sze_q->sze, lck, unlock_size); After this the appropriate size of the ring buffer is sent to the NIC and transmitted.

4.4. Performance Measurements We have measured the performance of the implementation on a dual Xeon(R) CPU E5-2660 v3 CPU @2.60GHz. Full details can be found in [7]; we present only the most important results here. In the graphs below, C/T notation represents number of CPU physical cores (C) and active packet processing threads (T). RX throughput reaches theoretical maximum of 100 Gbps Ethernet when using four or more cores (Figure 4.3).

Figure 4.3: RX Throughput

TX throughput gets close to theoretical limits when using eight cores (Figure 4.4).





Figure 4.4: TX Throughput

Finally, combined RX and TX throughput is also pretty close to the theoretical limit of the 100 Gbps Ethernet interface when eight CPU cores are active (Figure 4.5).

Figure 4.5: RX+TX Throughput

4.5. Conclusions and Future Work We introduced and described details of our implementation of DPDK on the COMBO-100G1 platform. We use the SZE2 interface for fast data transfer from out implementation platform over PCI Express with minimal overhead. We also showed that is possible to implement a layer to convert data from a bus-friendly format (SZE2) to a CPU-friendly format (DPDK) without loss of high throughput (we are almost able to process packets at theoretical limits). This





adaptation layer is fully implemented and integrated into the DPDK library. The next step is to use our DPDK implementation in real BEBA applications.





5. Openstate on Open vSwitch

5.1. Porting the BEBA Architecture to OVS

5.1.1. Implementing OpenState into OVS: Several Possible Approaches In WP2, a proof of concept of the BEBA switch was implemented (by CNIT) and tested. It was based on ofsoftswitch13, a reference switch originally developed by CPqD (Centro de Pesquisa e Desenvolvimento em Telecomunicações, Brazil) [8] and designed to provide a simple implementation of the features described by OpenFlow 1.3. ofsoftswitch13 is primarily targeted at developers willing to test new switch functionalities. It has a simple code base, which makes it suitable for a proof of concept, but lacks the performance of production-quality software such as Open vSwitch. Open vSwitch (OVS) is an open source and performance-oriented multilayer software switch [9]. It supports standard management interfaces and is designed to enable effective network automation through programmatic extensions. It is made of a standalone multi-threaded fast dataplane. Open vSwitch also supports the userspace IO drivers of the DPDK libraries. The stability, the widespread adoption, and particularly the performance of this software makes it an ideal choice to attain software acceleration with the BEBA architecture. Several approaches can be considered to implement the OpenState extensions over OVS. With ofsoftswitch13, the BEBA basic forwarding abstraction has been developed as an OpenFlow experimenter extension. Thus a first possibility is to reuse and to adapt this code to OVS. A second approach relies on the use of the features already implemented in OVS. Indeed, it appears that stateful processing is already supported to some extent in Open vSwitch.

5.1.2. Reusing OVS Features to Implement BEBA’s Basic Forwarding Abstraction We prefer to reuse existing OVS features to port the BEBA architecture to OVS. This choice is motivated by two reasons. First, the code base of OVS is more complex than the one of ofsoftswitch13, and porting the previously developed extensions may not be trivial. The second motivation is that we intend to propose the BEBA features to the Open vSwitch community as part of the dissemination task (T7), but it seems unlikely that they would accept some new code implementing functionalities that can already be obtained with the existing base. The features allowing us to envision stateful processing are some of the Nicira vendor extensions that, as of this writing, are only known to be implemented by Open vSwitch. We use the following ones: learn() action

When a packet matches a flow containing a learn() action, it triggers the addition or the modification (if a flow with similar matching fields exists), in any OpenFlow table, of the flow provided as an argument. This enable dynamic addition or modification of the flow table, without requiring the assistance of the controller. Note that it cannot be used to delete flows or to update (either soft or hard) timeouts of existing flows, though.





registers

Registers are temporary storage for intermediate metadata. OVS handles eight registers, which are set at 0 for all packets when they enter the switch. They can be used as arguments for actions, but also to store a state number between two submissions of the packet to OpenFlow tables.

resubmit() action

The resubmit() action is used to re-search an OpenFlow table (current table or any other one) for a flow matching the packet, and executes associated actions. In our case it can be used to perform actions associated to a given state once this state has been determined, or even once a new flow has been learnt.

These enable us to implement the equivalent of the state table of BEBA’s abstraction layer. To illustrate the use of these features, we provide a detailed example. Port knocking has already been introduced as a demonstration for the BEBA switch prototype in deliverable D2.2 (section 4.2). Here we unroll the steps of its execution on an OVS switch. The topology used is the following:

+------------+ +------------+ +------------+ | | | | | | | Sender | 10.100.0.1 | OVS switch | 10.100.0.2 | Receiver | | |---------------| + fastpath |---------------| | | | | | | | +------------+ +------------+ +------------+

Sender wants to open a UDP connection on port 22 on Receiver host. The sequence for port knocking is: ports 10 11 12 13 then 22 (UDP). For each state we give the output of the following command run on the switch: ovs-ofctl dump-flows br0 Output comes from an Ubuntu 14.04 virtual machine. Some non-relevant fields were removed. The following indicators have been added left of the flow to visualize the evolution of the tables: > first pass of packet >> second pass >>> third pass !! new rule ^^ new rule, replaces the one above A “pass” represents, in this document, the passage of a packet inside a flow table. There may be several passes for a given packets if the set of actions for matched flow includes one or more resubmit() actions.





Here are the two initial tables, set up by the controller. Table 0 is associated with BEBA’s state table in the abstraction model, while table 1 is roughly equivalent to the flow table. table=0, priority=100, arp actions=FLOOD table=0, priority=10, reg1=0 actions=load:0x1->NXM_NX_REG1[], resubmit(,0), resubmit(,1) table=0, priority=10, reg1=0x1 actions=load: 0->NXM_NX_REG0[] table=1, priority=100, udp, reg0=0, tp_dst=10 actions=learn(table=0, priority=100, eth_type=0x800, NXM_OF_IP_SRC[], reg1=0x1, load:0x1->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x1, tp_dst=11 actions=learn(table=0, priority=100, eth_type=0x800, NXM_OF_IP_SRC[], reg1=0x1, load:0x2->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x2, tp_dst=12 actions=learn(table=0, priority=100, eth_type=0x800, NXM_OF_IP_SRC[], reg1=0x1, load:0x3->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x3, tp_dst=13 actions=learn(table=0, priority=100, eth_type=0x800, NXM_OF_IP_SRC[], reg1=0x1, load:0x4->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x4, tp_dst=22 actions=output:2 table=1, priority=10, ip actions=learn(table=0, priority=100, eth_type=0x800, NXM_OF_IP_SRC[], reg1=0x1, load: 0->NXM_NX_REG0[])

We note that ARP packets are flooded every time (they match the first rule, which has the highest priority). Thus from now on we focus on IP packets only. Let’s start with the port knocking sequence. Sender sends the first packet of the secret sequence (e.g., with the command: echo -n '.' | nc -u -q1 10.100.0.2 <port>) This first packet arrives to the table with reg1 == 0, thus the switch sets reg1 to 1 and resubmits the packet first on table 0 (then table 1). table=0, priority=100, arp actions=FLOOD > table=0, priority=10, reg1=0 actions=load:0x1->NXM_NX_REG1[], resubmit(,0), resubmit(,1) table=0, priority=10, reg1=0x1 actions=load: 0->NXM_NX_REG0[] table=1, priority=100, udp, reg0=0, tp_dst=10 actions=learn(table=0, priority=100, eth_type=0x800, NXM_OF_IP_SRC[], reg1=0x1, load:0x1->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x1, tp_dst=11 actions=learn(table=0, priority=100, eth_type=0x800, NXM_OF_IP_SRC[], reg1=0x1, load:0x2->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x2, tp_dst=12 actions=learn(table=0, priority=100, eth_type=0x800, NXM_OF_IP_SRC[], reg1=0x1, load:0x3->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x3, tp_dst=13 actions=learn(table=0, priority=100, eth_type=0x800, NXM_OF_IP_SRC[], reg1=0x1, load:0x4->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x4, tp_dst=22 actions=output:2 table=1, priority=10, ip actions=learn(table=0, priority=100, eth_type=0x800, NXM_OF_IP_SRC[], reg1=0x1, load: 0->NXM_NX_REG0[])

We can observe that the rule hit by the packet has two resubmit actions, but resubmitting to table 0 might look useless at first. It would be faster to have this rule directly loading the correct state number into register reg9 and only submitting to table 1, and then to update that rule with next state number. But in OVS it is not possible to have a resubmit action inside a learn action (and hence to update at runtime the rule we hit here). This is why we have the double resubmit in this rule: first resubmit to table 0 so as to load state number, then resubmit to table 1 to match the flow associated with this state number. For the second pass into table 0, the first packet now has reg1 == 1, so load 0 (state number) into reg0; after that, packet was to be resubmitted to table 1 (flow table). table=0, priority=100, arp actions=FLOOD table=0, priority=10, reg1=0 actions=load:0x1->NXM_NX_REG1[], resubmit(,0), resubmit(,1) >> table=0, priority=10, reg1=0x1 actions=load: 0->NXM_NX_REG0[] table=1, priority=100, udp, reg0=0, tp_dst=10 actions=learn(table=0, priority=100, eth_type=0x800, NXM_OF_IP_SRC[], reg1=0x1, load:0x1->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x1, tp_dst=11 actions=learn(table=0, priority=100, eth_type=0x800, NXM_OF_IP_SRC[], reg1=0x1, load:0x2->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x2, tp_dst=12 actions=learn(table=0, priority=100, eth_type=0x800, NXM_OF_IP_SRC[], reg1=0x1, load:0x3->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x3, tp_dst=13 actions=learn(table=0, priority=100, eth_type=0x800, NXM_OF_IP_SRC[], reg1=0x1, load:0x4->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x4, tp_dst=22 actions=output:2 table=1, priority=10, ip actions=learn(table=0, priority=100, eth_type=0x800, NXM_OF_IP_SRC[], reg1=0x1, load: 0->NXM_NX_REG0[])

As this packet is part of the correct port knocking sequence (UDP, port 10), it triggers first rule of table 1, and a new rule is learnt in table 0. No further action is required, and the packet is dropped. table=0, priority=100, arp actions=FLOOD !! table=0, priority=100, ip, reg1=0x1, nw_src=10.100.0.1 actions=load:0x1->NXM_NX_REG0[] table=0, priority=10, reg1=0 actions=load:0x1->NXM_NX_REG1[], resubmit(,0), resubmit(,1) table=0, priority=10, reg1=0x1 actions=load: 0->NXM_NX_REG0[] >>> table=1, priority=100, udp, reg0=0, tp_dst=10 actions=learn(table=0,priority=100,eth_type=0x800,NXM_OF_IP_SRC[],reg1=0x1,load:0x1->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x1, tp_dst=11 actions=learn(table=0,priority=100,eth_type=0x800,NXM_OF_IP_SRC[],reg1=0x1,load:0x2->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x2, tp_dst=12 actions=learn(table=0,priority=100,eth_type=0x800,NXM_OF_IP_SRC[],reg1=0x1,load:0x3->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x3, tp_dst=13 actions=learn(table=0,priority=100,eth_type=0x800,NXM_OF_IP_SRC[],reg1=0x1,load:0x4->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x4, tp_dst=22 actions=output:2 table=1, priority=10, ip actions=learn(table=0,priority=100,eth_type=0x800,NXM_OF_IP_SRC[],reg1=0x1,load: 0->NXM_NX_REG0[])





Then the second packet of the sequence arrives (UDP, port 11). The first pass in table 0 is identical: it hits the rule with the two resubmit actions. But on the second pass, packet triggers the new rule, loading state number 1 in reg0, and causing it to reach the rule associated with state 1 in table 1. This adds another rule replacing the latter one in table 0: now instead of loading number associated with state 1 in reg0, state 2 will be used. The packet is dropped. table=0, priority=100, arp actions=FLOOD >> table=0, priority=100, ip, reg1=0x1, nw_src=10.100.0.1 actions=load:0x1->NXM_NX_REG0[] ^^ table=0, priority=100, ip, reg1=0x1, nw_src=10.100.0.1 actions=load:0x2->NXM_NX_REG0[] > table=0, priority=10, reg1=0 actions=load:0x1->NXM_NX_REG1[], resubmit(,0), resubmit(,1) table=0, priority=10, reg1=0x1 actions=load: 0->NXM_NX_REG0[] table=1, priority=100, udp, reg0=0, tp_dst=10 actions=learn(table=0,priority=100,eth_type=0x800,NXM_OF_IP_SRC[],reg1=0x1,load:0x1->NXM_NX_REG0[]) >>> table=1, priority=100, udp, reg0=0x1, tp_dst=11 actions=learn(table=0,priority=100,eth_type=0x800,NXM_OF_IP_SRC[],reg1=0x1,load:0x2->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x2, tp_dst=12 actions=learn(table=0,priority=100,eth_type=0x800,NXM_OF_IP_SRC[],reg1=0x1,load:0x3->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x3, tp_dst=13 actions=learn(table=0,priority=100,eth_type=0x800,NXM_OF_IP_SRC[],reg1=0x1,load:0x4->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x4, tp_dst=22 actions=output:2 table=1, priority=10, ip actions=learn(table=0,priority=100,eth_type=0x800,NXM_OF_IP_SRC[],reg1=0x1,load: 0->NXM_NX_REG0[])

So the third packet of the sequence (UDP, port 12) passes twice in table 0, triggers the new rule, and reaches the flow associated with state 3 in table 1. Again, this updates the state-assignation rule in table 0. Then packet is dropped. table=0, priority=100, arp actions=FLOOD >> table=0, priority=100, ip, reg1=0x1, nw_src=10.100.0.1 actions=load:0x2->NXM_NX_REG0[] ^^ table=0, priority=100, ip, reg1=0x1, nw_src=10.100.0.1 actions=load:0x3->NXM_NX_REG0[] > table=0, priority=10, reg1=0 actions=load:0x1->NXM_NX_REG1[], resubmit(,0), resubmit(,1) table=0, priority=10, reg1=0x1 actions=load: 0->NXM_NX_REG0[] table=1, priority=100, udp, reg0=0, tp_dst=10 actions=learn(table=0,priority=100,eth_type=0x800,NXM_OF_IP_SRC[],reg1=0x1,load:0x1->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x1, tp_dst=11 actions=learn(table=0,priority=100,eth_type=0x800,NXM_OF_IP_SRC[],reg1=0x1,load:0x2->NXM_NX_REG0[]) >>> table=1, priority=100, udp, reg0=0x2, tp_dst=12 actions=learn(table=0,priority=100,eth_type=0x800,NXM_OF_IP_SRC[],reg1=0x1,load:0x3->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x3, tp_dst=13 actions=learn(table=0,priority=100,eth_type=0x800,NXM_OF_IP_SRC[],reg1=0x1,load:0x4->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x4, tp_dst=22 actions=output:2 table=1, priority=10, ip actions=learn(table=0,priority=100,eth_type=0x800,NXM_OF_IP_SRC[],reg1=0x1,load: 0->NXM_NX_REG0[])

Same thing happens with the fourth packet (UDP, port 13), the state number is increased again. table=0, priority=100, arp actions=FLOOD >> table=0, priority=100, ip, reg1=0x1, nw_src=10.100.0.1 actions=load:0x3->NXM_NX_REG0[] ^^ table=0, priority=100, ip, reg1=0x1, nw_src=10.100.0.1 actions=load:0x4->NXM_NX_REG0[] > table=0, priority=10, reg1=0 actions=load:0x1->NXM_NX_REG1[], resubmit(,0), resubmit(,1) table=0, priority=10, reg1=0x1 actions=load: 0->NXM_NX_REG0[] table=1, priority=100, udp, reg0=0, tp_dst=10 actions=learn(table=0,priority=100,eth_type=0x800,NXM_OF_IP_SRC[],reg1=0x1,load:0x1->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x1, tp_dst=11 actions=learn(table=0,priority=100,eth_type=0x800,NXM_OF_IP_SRC[],reg1=0x1,load:0x2->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x2, tp_dst=12 actions=learn(table=0,priority=100,eth_type=0x800,NXM_OF_IP_SRC[],reg1=0x1,load:0x3->NXM_NX_REG0[]) >>> table=1, priority=100, udp, reg0=0x3, tp_dst=13 actions=learn(table=0,priority=100,eth_type=0x800,NXM_OF_IP_SRC[],reg1=0x1,load:0x4->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x4, tp_dst=22 actions=output:2 table=1, priority=10, ip actions=learn(table=0,priority=100,eth_type=0x800,NXM_OF_IP_SRC[],reg1=0x1,load: 0->NXM_NX_REG0[])

At last a UDP packet sent on port 22 passes twice in table 0 and is sent to the flow associated with state 4 in table 1. No rule is updated this time in the state table: instead of loading a new state number, the flow has a simple output:2 action that is used to send the packet towards the Receiver. So this packet is forwarded, port knocking is over. table=0, priority=100, arp actions=FLOOD >> table=0, priority=100, ip, reg1=0x1, nw_src=10.100.0.1 actions=load:0x4->NXM_NX_REG0[] > table=0, priority=10, reg1=0 actions=load:0x1->NXM_NX_REG1[], resubmit(,0), resubmit(,1) table=0, priority=10, reg1=0x1 actions=load: 0->NXM_NX_REG0[] table=1, priority=100, udp, reg0=0, tp_dst=10 actions=learn(table=0,priority=100,eth_type=0x800,NXM_OF_IP_SRC[],reg1=0x1,load:0x1->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x1, tp_dst=11 actions=learn(table=0,priority=100,eth_type=0x800,NXM_OF_IP_SRC[],reg1=0x1,load:0x2->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x2, tp_dst=12 actions=learn(table=0,priority=100,eth_type=0x800,NXM_OF_IP_SRC[],reg1=0x1,load:0x3->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x3, tp_dst=13 actions=learn(table=0,priority=100,eth_type=0x800,NXM_OF_IP_SRC[],reg1=0x1,load:0x4->NXM_NX_REG0[]) >>> table=1, priority=100, udp, reg0=0x4, tp_dst=22 actions=output:2 table=1, priority=10, ip actions=learn(table=0,priority=100,eth_type=0x800,NXM_OF_IP_SRC[],reg1=0x1,load: 0->NXM_NX_REG0[])

If at any step an IP packet “breaks” the correct sequence (below example: third packet of the sequence is anything but UDP on port 12), then the first flow of table 1 is triggered. This flow reinitialize the rule loading the state number in table 0 (now loading 0 into reg0), causing the port knocking process to start again from scratch.





table=0, priority=100, arp actions=FLOOD >> table=0, priority=100, ip, reg1=0x1, nw_src=10.100.0.1 actions=load:0x2->NXM_NX_REG0[] ^^ table=0, priority=100, ip, reg1=0x1, nw_src=10.100.0.1 actions=load: 0->NXM_NX_REG0[] > table=0, priority=10, reg1=0 actions=load:0x1->NXM_NX_REG1[], resubmit(,0), resubmit(,1) table=0, priority=10, reg1=0x1 actions=load: 0->NXM_NX_REG0[] table=1, priority=100, udp, reg0=0, tp_dst=10 actions=learn(table=0,priority=100,eth_type=0x800,NXM_OF_IP_SRC[],reg1=0x1,load:0x1->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x1, tp_dst=11 actions=learn(table=0,priority=100,eth_type=0x800,NXM_OF_IP_SRC[],reg1=0x1,load:0x2->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x2, tp_dst=12 actions=learn(table=0,priority=100,eth_type=0x800,NXM_OF_IP_SRC[],reg1=0x1,load:0x3->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x3, tp_dst=13 actions=learn(table=0,priority=100,eth_type=0x800,NXM_OF_IP_SRC[],reg1=0x1,load:0x4->NXM_NX_REG0[]) table=1, priority=100, udp, reg0=0x4, tp_dst=22 actions=output:2 >>> table=1, priority=10, ip actions=learn(table=0,priority=100,eth_type=0x800,NXM_OF_IP_SRC[],reg1=0x1,load: 0->NXM_NX_REG0[])

With similar setups, it is possible to emulate BEBA abstraction layer for the other use cases. In this example we have directly provided the flows in their “final form”, as they are interpreted by the switch (that is to say, we instructed the switch to register the initial flows displayed on the first output trace). Manually handling the learn actions and the registers operations is complex and error prone, and very difficult to transpose to production applications. For a widespread adoption of such a switch, it is required that some work be performed upstream. In particular, the SDN controller in use with the OVS switch should be designed so as to enable fast programming of these features and to set up the flows accordingly in the switch over the control path. Happily the adaptation of the controller has already been realized as part of task T2: this is the Ryu controller wrapper for BEBA (see deliverable D2.2, section 4.2). This wrapper can be reused to control an OVS-based switch. Thus, it is possible for the network administrator to use the BEBA abstraction layer to easily describe new applications and to make an advanced use of stateful processing while limiting the risk of errors. Still, there is some work to do since bending OVS features to use stateful processing as we do currently has some limitations. Some of them seem to be related to OVS’ internal functioning and could probably be patched, while others are due to OpenFlow’s specification. Two major issues have been observed:

• There seems to be a race condition inside OVS, and for the port knocking use case, sending the messages of the correct sequence in burst sometimes results in the state being reset at 0 (and thus port 22 being unreachable). It could be an issue to do with multi-threading and parallelism.

• Granularity of idle and hard timeouts in OVS are in seconds, while BEBA use cases need microseconds (idle age is reset each time the flow is hit, hard age is not). This is because on this model, the state entries are based on OpenFlow entries, and the OpenFlow specification defines the timeouts in seconds [10].

Also, three minor issues can be summarized as follows:

• There is no support for FSM with “dynamic timeouts” on transitions (no way to change the value of timeouts set at creation of flows). The learn action cannot modify these timeouts since it acts as a modify request in the same way as the command ovs-ofctl --strict mod-flows would do. And from the OpenFlow specifications, timeouts are left unchanged for such requests.

• Transition with both timeouts (idle and hard) are not supported. There is no direct way to know what timeout would be exceeded first.





• learn action executed from the group table does not work (a group table consists of group entries; the ability for a flow entry to point to a group enables OpenFlow to represent additional methods of forwarding [10]). This could be an OVS bug.

At the time of this writing, investigation on these issues is still a work in progress.

5.2. Improving the OVS-‐‑based BEBA switch

5.2.1. Accelerating Packet Processing Open vSwitch is production-oriented software, and it has good performances. Assuming we manage to solve the aforementioned issues, we will obtain a fast BEBA-compliant switch. Furthermore, it has been experimentally verified that applications such as port knocking seamlessly run with the fast path included in the 6WINDGate™ product from 6WIND, and whose architecture is provided on Figure. So OVS processing can be offloaded within a DPDK environment, even when acting as a BEBA switch.

Figure 5.1: 6WINDGate™ architecture

Additional experiments are to be conducted in order to provide statistics about performances in terms of packet throughput, as well as memory usage. In parallel with our efforts to improve performance, there is also a need to review the security of the system. In particular, resilience and scalability under heavy load are paramount, since BEBA architecture is expected to run applications such as use case UC12: protection against DDoS. Our purpose is not to focus on improving the security of the switch itself, but rather to ensure that using the BEBA abstraction layer over an OVS switch does not expand its attack surface. This evaluation will be closely related to performance monitoring: we intend to analyze the memory overhead induced by the use of the “state” tables, as well as processing time needed for the creation (through learn() actions) of these rules.





5.2.2. New paradigm, new features: eBPF Aside from usability, another way to unlock new possibilities for the administrator of the switch could reside in technologies such as P4 and eBPF. Indeed, the following was stated in BEBA’s deliverable D2.2 (section 3.3.3):

One of the major limitations of OpenFlow is that it requires either a software, firmware or even hardware update in order to support new actions (BEBA supports new encapsulation protocols such as GENEVE, new headers into existing protocols such as NSH or GBP for VxLAN). In order to get back agility with SDN dataplanes, eBPF is being defined in order to inject any dataplane actions. eBPF is an extension of BPF with some Just-In-Time CPU optimizations . As demonstrated with some ongoing work on OVS [11], eBPF can be combined to enhance OVS’ dataplanes, but it is suffering of its generic’s performance issue based on the needs of a BPF engines into the datapath (kernel, fast path, ASICs). Most of the CPU (or ASIC clock cycles) for eBPF programs are spent in packet header parsing, so there is an attempt to mix both eBPF and P4 with many objectives:

• Transform P4’s OVS needs toward eBPF (http://openvswitch.org/support/slides/p4.pdf),

• Have some native P4 blocks (ASIC blocks, C based fast path/dataplane blocks) that can pre-define some packet processing without the use of eBPF.

These two options into the vSwitching dataplanes shall allow to create new SDN features with a fair performance (eBPF only) while ultimate performance can be achieved using the same P4 description but getting it hard-wired (hard-coded) into the dataplanes.

More specifically, BPF (Berkeley Packet Filter) was designed in 1992 to act as a socket filter and allows working with raw link-layer packets. It consists of a virtual machine executing the bytecode, and of a specific language, somewhat close to assembly. It is possible to write programs in P4 or in a subset of C and to compile them into BPF language. This language has been designed to provide easy manipulation of captured packets. It also results from strong concerns regarding safety and security: a BPF program always ends successfully, there can be no infinite loop. In order to attain this goal, several mechanisms have led the conception of BPF. Some examples include:

• Backward jumps are not allowed (no loops). • Program length is limited to 4096 instructions. • Instruction set has some built-in safety (no exposed stack pointer, instead load

instruction has mem modifier). • Provides dynamic packet boundary checks.

Such measures enable user-defined BPF programs to be run by the in-kernel BPF machine. Running inside the kernel, in many situations, allows for performance gains since undesired packets can be dropped before even being copied to user space. In addition, BPF programs can be compiled right before execution (JIT, Just-In-Time compiling), hence benefiting from CPU optimizations. Thus BPF is a very efficient tool to deal with packets, with performance roughly equivalent to native x86 code, but it has a limited set of instructions. Several extensions have





been added over the years; but they are not to be confused with the extended version eBPF, which has been evolving since 2013. eBPF extends BPF possibilities but preserves its safety measures, making it safe to run in the kernel of production systems. It redefines and extends the set of instructions, relying on a common subset from several assembly languages. It can map some memory space so that it can be shared between user and kernel space. It also makes it possible to call certain kernel functions from programs. The JIT compiling is preserved. For instance, on x86 systems, the code is turned by the user space compiler into some “simplified x86 assembly”, which is in turn verified in the kernel, and then each “simplified” instruction is translated into real x86 by the JIT compiler. The process is efficient, as:

• All registers map one-to-one • Most of instructions map one-to-one • BPF call instruction maps to x86 call

These powerful features make eBPF suitable not only for packet filtering, but also for general networking, event tracing, kernel optimizations, and some other possible use cases [12]. The BEBA abstraction layer could also benefit from the flexibility, the expressiveness and the performance of eBPF. Transposing its architecture to an eBPF program would provide an interesting reference to obtain a relative benchmark with the version based on “vanilla” OVS. We intend to experiment in this direction, and to develop a proof of concept of an eBPF-running BEBA switch. In this regard, adding eBPF support to OVS software would seem the more direct way to do so. However, there are a number of reasons that hold us back:

• Implementing eBPF in OVS is far from trivial; • OVS core developers have already considered adding eBPF support [11], and actually

they even implemented most of it on a dedicated Git branch [13], but it has not been merged so far, and there must be good reasons for that. We believe it best to let them lead the path in this direction.

• OVS does not include a fast path, and eBPF actions would need more time to be launched.

A fast path built on top of the standard stack and of OVS is available with the 6WINDGate™ product. It relies on standard Linux tools for management. As it happens, eBPF support has recently been added into the tc command (traffic control, part of the iproute2 package) [14]. Thus we intend to:

• Verify that the BEBA abstraction layer can be expressed in the form of eBPF programs, which have their own limitations;

• Port the eBPF virtual machine directly into the fast path and control the injection of an eBPF BEBA switch through the tc command (see Figure 5.2).





Figure 5.2: eBPF inclusion in BEBA switch—system architecture

Assuming this works, it will provide significant results to compare with the BEBA implementation based on OVS learn() actions. It will help identifying the best approach in terms of performance, and proposing feedback to the Open vSwitch community could potentially pave the way for eBPF integration into OVS.

5.3. Conclusions and Future Work Switching from ofsoftswitch13 to a production oriented virtual switch such as OVS enables us to obtain much better performance. Thanks to the work performed by CNIT we know that we can run the BEBA architecture on Open vSwitch with little to no modification to the code base: use cases such as port knocking have been successfully tested on the “vanilla” OVS version. Furthermore, those applications also appear to run seamlessly with a fast path solution, thus benefiting from further software acceleration. This integration with Open vSwitch and with the fast path is not one hundred percent complete, though. There are a number of issues to solve that are still work in progress. Besides addressing these, we mean to provide numerical results so as to evaluate the memory overhead as well as the proper performance gain obtained through the use of OVS, and through the fast path, respectively. After that, we intend to explore other acceleration possibilities with eBPF, an alternative technology on which we could potentially base the BEBA





abstraction layer. With high performance features such as JIT compiling, we hope that eBPF, once in the fast path, will constitute a stepping stone to attaining better processing speeds.





6. Conclusions In this deliverable we’ve described a number of on-going activities targeted at implementing BEBA’s switch abstraction using existing software switch frameworks such as ofsoftswitch and Open vSwitch, as well as implementation and evaluation results aimed at accelerating the performance of these software switches. The work done so far is promising: for instance, in this document we have shown that porting Open vSwitch to mswitch (we call it mswitch-OVS) gives us a significant speed-up, while support for DPDK for the COMBO-100G1 hardware accelerator card allows us to almost reach theoretical maximums at 100 Gb/s. As future work we are looking into a number of directions. Regarding the mswitch-OVS effort, we are planning on looking into leveraging the ptnetmap framework to accelerate communications between the switch and virtual ports/virtual machines, a common and useful scenario in today’s cloud and data center deployments. With respect to ofsoftswitch, the next step is to adapt it to PFQ and to add multi-core support to it. We further intend to explore new theoretical directions in parallel traffic dispatching in order to efficiently handle advanced stateful processing on multi-core CPUs. Regarding the COMBO-100G1 work, the adaptation layer is fully implemented and integrated into the DPDK library; we are now concentrating on using our DPDK implementation in real BEBA applications. The final main activity regarding future work is the integration of CNIT’s proof-of-concept implementation into Open vSwitch. We further intend to explore other acceleration possibilities with eBPF, an alternative technology on which we could potentially base the BEBA abstraction layer.





7. Bibliography

[1] ofsoftswitch13 project homepage: https://github.com/CPqD/ofsoftswitch13 [2] PF RING project homepage: http://www.ntop.org/products/packet-

capture/pf_ring/pf_ring-zc-zero-copy/ [3] L. Rizzo, “Netmap: a novel framework for fast packet i/o,” in Proc. of USENIX

ATC’2012. USENIX Association, 2012, pp. 1–12. [4] DPDK, http://dpdk.org [5] PFQ project homepage: http://www.pfq.io [6] Netcope. NFB-100G1 FPGA Board. [Online] https://www.invea.com/en/products-

and-se rvices/fpga-cards/nfb-100g. [7] Liberouter. DPDK performance measurement using SZEDATA2 PMD. [Online]

https://www.liberouter.org/wp-content/uploads/2015/09/pmd_szedata2_dpdk_measurement-2015-09-11.pdf.

[8] ofsoftswitch13 project homepage. [Online] https://github.com/CPqD/ofsoftswitch13.

[9] Open vSwitch project homepage. [Online] http://openvswitch.org. [10] Open Networking Fundation. OpenFlow Switch Specification, version 1.4.0. [Online]

https://www.opennetworking.org/images/stories/downloads/sdn-resources/onf-specifications/openflow/openflow-spec-v1.4.0.pdf.

[11] Gross, Jesse. OVS Micro Summit Notes. ovs-dev mailing list. [Online] October 2014. http://openvswitch.org/pipermail/dev/2014-October/047421.html.

[12] Starovoitov, Alexei. BPF – in-kernel virtual machine. [Online] 2015. https://events.linuxfoundation.org/sites/events/files/slides/bpf_collabsummit_2015feb20.pdf.

[13] Pfaff, Ben. OVS commit on a non-master branch: Implement p4-to-bpf translation, a BPF interpreter, and flow extraction. [Online] 2015. https://github.com/blp/ovs-reviews/commit/54f921adac7b1d44f3f691edbade277badff1f4b.

[14] tc-bpf(8) ‘man’ page. [Online] https://git.kernel.org/cgit/linux/kernel/git/shemminger/iproute2.git/tree/man/man8/tc-bpf.8.

[15] mSwitch: A Highly-Scalable, Modular Software Switch, Michio Honda, Felipe Huici, Giuseppe Lettieri and Luigi Rizzo, ACM SIGCOMM Symposium on SDN Research (SOSR), June 2015

[16] Transparent acceleration of software packet forwarding using netmap Luigi Rizzo, Marta Carbone, Gaetano Catalli. INFOCOM 2012.

[17] Virtual device passthrough for high speed VM networking, Stefano Garzarella, Giuseppe Lettieri, Luigi Rizzo, IEEE/ACM ANCS 2015, May 2015

Date post:	31-Aug-2018
Category:	Documents
Upload:	vuonganh
View:	234 times
Download:	0 times

BEBA Behavioural Based Forwarding Deliverable … · BEBA Behavioural Based Forwarding Deliverable...

Documents