Download - Making a Packet-value Based AQM on a Programmable Switch ...

Making a Packet-value Based AQM on a Programmable Switchfor Resource-sharing and Low Latency Ludwig Toresson

Faculty of Health, Science and Technology

Subject: Computer Science

Points: 30 hp

Supervisor: Andreas Kassler

Examiner: Karl-Johan Grinnemo

Date: 210125

Making a Packet-value Based AQM on a

Programmable Switch for Resource-sharing

and Low Latency

Ludwig Toresson

<ludwig [email protected]>

c© 2021 The author(s) and Karlstad University

Abstract

There is a rapidly growing number of advanced applications running over the internet that

requires ultra-low latency and high throughput. Bufferbloat is one of the most known

problems which add delay in the form of packets being enqueued into large buffers before

being transmitted. This has been solved with the developments of various Active Queue

Management (AQM) schemes to control how large the queue buffers are allowed to grow.

Another aspect that is important today is how the available bandwidth can be shared be-

tween applications with different priorities. The Per-Packet Value (PPV) concept has been

presented as a solution for resource-sharing by marking packets according to predefined

marking policies. The packet value will be taken into consideration to make drop/mark

decisions, which leads to higher packet values being prioritized at bottleneck links.

In this thesis, a design of a packet value-based AQM on a programmable Barefoot

Tofino switch will be presented. It will use a combination of the Proportional Integral

Controller (PIE) AQM scheme and the PPV concept to make drop decisions when queuing

delay is discovered. Packet value statistics are collected through the P4 programmable data

plane to maintain knowledge about the distribution of packet values. With the dropping

probability calculated through the PIE AQM scheme, a decision can be made about which

packets should be dropped.

An evaluation shows that with the implemented PV AQM, a low queuing delay can

be achieved by dropping an appropriate amount of packets. It also shows that the PV

AQM controls the resource-sharing between different traffic flows according to a predefined

marking policy.

Keywords— PPV, PIE, SDN, AQM, Resource-sharing

i

Sammanfattning

Det finns ett snabbt vaxande antal avancerade applikationer som kors over internet som

kraver extremt lag latens och hog throughput. Bufferbloat ar ett av de mest kanda prob-

lemen som ger fordrojning i form av paket som satts in i stora buffertar innan de skickas

vidare. Detta har losts med utvecklingen av olika AQM (Active Queue Management)-

scheman for att kontrollera hur stora kobuffertarna far vaxa. En annan aspekt som ar viktig

idag ar hur den tillgangliga bandbredden kan delas mellan applikationer med olika prior-

iteringar. Per-Packet Value (PPV)-konceptet har presenterats som en losning for resurs-

delning genom att markera paket enligt fordefinierade markningsprinciper. Paketvardet

kommer att tas i beaktande for att fatta drop/mark beslut, vilket leder till att hogre

paketvarden prioriteras vid flaskhalslankar.

I denna avhandling presenteras en design av en paketvarde baserad AQM pa en pro-

grammerbar Barefoot Tofino-switch. Den kommer att anvanda en kombination av Propor-

tional Integral Controller (PIE) AQM-schemat och PPV-konceptet for att fatta slappbeslut

nar kofordrojning upptacks. Paketvardesstatistik samlas grundligt i det P4 programmer-

bara dataplanet for att uppratthalla kunskap om fordelningen av paketvarden. Med den

sannolikhet som beraknas genom PIE AQM-schemat kan ett beslut fattas om vilka paket

som ska slappas.

En utvardering visar att med denna implementerade AQM, kan en lag kofordrojning

uppnas genom att man slapper en lamplig mangd paket. Det visar ocksa att AQM:en styr

resursdelningen mellan olika trafikfloden enligt en fordefinierad markeringspolicy.

ii

Acknowledgement

I would like to thank Prof. Andreas Kassler, my supervisor at Karlstad University, for his

support. For the possibility to work on this project and for his great feedback and ideas.

I would also like to thank Jonathan Langlet for providing initial guidance to get started

with the hardware setup; my colleague Maher Shaker for making this project more enjoy-

able and for his support; Szilveszter Nadas at Ericsson Research for his feedback and

knowledge; and, finally, my family for their never-ending support.

iii

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Objectives and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Ethics and Sustainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 4

2.1 Software-Defined Networking . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 P4: Programming the data plane . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.1 Architecture Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.2 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.3 Control Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.4 Deparsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Related Work 9

3.1 PIE AQM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2 PPV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3 PVPIE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Design of the PV AQM 13

4.1 Design Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.2 Design Challenges and Decisions . . . . . . . . . . . . . . . . . . . . . . . . 15

4.3 PV AQM Design Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.3.1 Uniform Packet Value Histograms . . . . . . . . . . . . . . . . . . . 19

4.3.2 Packet Value Distribution and ECDF . . . . . . . . . . . . . . . . . 21

5 Implementation 22

iv

5.1 Data Plane Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.1.1 P4 Ingress Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.1.2 P4 Ingress Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.1.3 P4 Ingress Deparser . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.1.4 P4 Egress Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.2 Control Plane Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.2.1 CTV Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.2.2 Inverse ECDF Update . . . . . . . . . . . . . . . . . . . . . . . . . 27

6 Tools 28

6.1 Control Plane Interaction API . . . . . . . . . . . . . . . . . . . . . . . . . 28

6.2 Pipeline Traffic Manager API . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.3 Iperf3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.4 Flent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6.5 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

7 Control Plane Measurements 30

7.1 Histogram Registers vs. Counters . . . . . . . . . . . . . . . . . . . . . . . 31

7.2 CTV Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

7.3 Packet Value Distribution Update . . . . . . . . . . . . . . . . . . . . . . . 38

8 Evaluation 39

8.1 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

8.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

8.2.1 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

8.2.2 Queuing delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

8.2.3 Resource-sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

8.3 Evaluation of Traffic Without AQM . . . . . . . . . . . . . . . . . . . . . . 43

8.4 Evaluation of the PV AQM With Uniform Ranges . . . . . . . . . . . . . . 45

v

9 Conclusion 51

10 Future Work 51

References 52

A Throughput and delay with up to 40 flows per TVF 57

B Silver flow with 8 times less throughput 58

C Reading Register Instances 59

D Pseudo code 60

D.1 Linear search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

D.2 Binary search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

D.3 ECDF Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

vi

List of Figures

2.1 P4 packet processing pipeline. Courtesy of Menth et al. [1]. . . . . . . . . 6

2.2 Parser example state diagram . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1 Throughput value functions. Courtesy of ELTE [2]. . . . . . . . . . . . . . 11

3.2 PVPIE scheme. Courtesy of Laki et al. [3]. . . . . . . . . . . . . . . . . . . 13

4.1 Abstract overview of the control plane loops. . . . . . . . . . . . . . . . . . 18

4.2 Data - and control plane interaction. Interval TA: Update ECDF and Inter-

val TB: update CTV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.3 Uniform packet value ranges . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.4 ECDF diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.1 P4 Ingress processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

7.1 Reading 128 register values from hardware versus from software. . . . . . . 33

7.2 Reading 512 register values from hardware versus from software. . . . . . . 34

7.3 Difference between reading 512 values from a register versus counter. . . . 34

7.4 CTV with linear search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

7.5 CTV with linear search(Scaled). . . . . . . . . . . . . . . . . . . . . . . . . 36

7.6 CTV calculations with binary search (Scaled). . . . . . . . . . . . . . . . . 37

7.7 Total time to update CTV. . . . . . . . . . . . . . . . . . . . . . . . . . . 37

7.8 Time to get 256 histograms counters. . . . . . . . . . . . . . . . . . . . . . 38

7.9 Time to update the ECDF curve. . . . . . . . . . . . . . . . . . . . . . . . 39

8.1 Evaluation setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

8.2 Marker TVF functions. Courtesy of Maher Shaker [4]. . . . . . . . . . . . . 42

8.3 Multiplying the number of flows with two every 15 seconds without AQM. 44

8.4 Queue delay without AQM. . . . . . . . . . . . . . . . . . . . . . . . . . . 45

8.5 Throughput when multiplying the number of flows every 15 seconds. . . . . 47

8.6 Throughput per TVF when multiplying the number of flows every 15 seconds. 47

8.7 Queue delay when multiplying the number of flows every 15 seconds. . . . 48

vii

8.8 CTVs when multiplying the number of flows every 15 seconds. . . . . . . . 49

8.9 ECDF for a different number of flows. . . . . . . . . . . . . . . . . . . . . . 50

A.1 Multiplying the number of flows with 2 every 15 seconds starting with 5

flows per TVF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

B.1 2 gold, 2 silver flows using a silver TVF with 8 times less throughput than

the gold TVF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

C.1 Reading register instances from switch hardware. Reading times grow lin-

early with the number of instances. . . . . . . . . . . . . . . . . . . . . . . 59

viii

List of Tables

8.1 Client/server setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

8.2 Evaluation setup parameters. . . . . . . . . . . . . . . . . . . . . . . . . . 41

8.3 PV AQM parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

ix

1 Introduction

The Internet is very complex today, and it has many different applications running through

it, and with this, different constraints are needed for different applications. One major

problem discussed for years has been the bufferbloat problem [5], where single buffer queues

on the internet are allowed to grow, which creates unnecessary delays. Applications like

remote brain surgery and financial trading require ultra-low delays. Networking devices

such as switches handling regular TCP without any effective AQM at bottlenecks can cause

users considerable delays. Various AQM schemes such as PI2 [6], RED [7], CoDel [8] and

PIE [9] have been developed as a solution to the bufferbloat problem, to remove much of

the queue delay added by large queue buffers.

Quality of Service (QoS) management [10, 11] and resource-sharing is another important

research area. Different types of network traffic have different throughput and latency

requirements. Per-Packet Value (PPV) [12] is a concept created for QoS management and

resource-sharing to define how the available bandwidth should be shared among traffic

flows. With PPV, packets will be marked at an edge node with a value according to

operator-defined marking policies. These policies can be defined to mark different traffic

types with different priorities. This concept has been combined with the PIE AQM scheme

to create a packet-value based PIE AQM (PVPIE) to provide both queue delay control and

resource-sharing. At the AQM node, the packet values will be taken into account when

making the drop decisions.

Implementing the PVPIE concept on a fixed-function commodity switch would be dif-

ficult due to the non-configurable packet processing pipeline. This thesis project relates to

a relatively new emerging concept of Software-Defined Networking (SDN). Through data

plane programmable switches, a programmer can modify how packets will be managed

through the packet processing pipeline without any modification of the hardware [13].

These concepts are possible due to the programmable nature of switches that are being

developed and released at the moment. The programmability is possible due to the re-

1

configurable match-action pipeline, where the packet processing can be defined to perform

different tasks depending on the application. Previously, the core scheduling functionality

has not been modifiable; at most, there may have been a selection of a few different schedul-

ing algorithms that could be run on a switch. However, with these programmable switches,

new algorithms can be developed from scratch and released publicly, and implemented by

whoever would like to apply them. There are multiple new papers [14, 15, 16, 17] presenting

the possibilities given by the programmable Barefoot Tofino [18] switch. All of these papers

relate to different concepts of networking, which shows the flexibility of implementation

opportunities for programmable switching functionalities.

1.1 Motivation

The motivation for this thesis relates to previous implementations of a PVPIE AQM. The

concept of a PVPIE AQM has previously only been implemented and tested in either

simulated [3] or emulated [19] environments. In this thesis a proposed design and imple-

mentation of a PVPIE AQM on a programmable Barefoot Tofino switch will be tested and

evaluated. The finished implementation will show the possibilities in the near future of flex-

ible and programmable networking devices, where the decisions of network functionalities

are transferred from the device manufacturers to the network operators.

1.2 Objectives and Goals

The objective of this thesis is to implement a PV AQM using the concept of PVPIE on

a programmable Barefoot Tofino switch. This would entail studying the architecture of

the target switch, i.e,. the Barefoot Tofino switch, to design a suitable solution that will

provide the resource-sharing and lower queuing delay. Programmable networking devices

have limitations in the architecture such as the number of processing cycles available for

each packet processed by the device. This is due to when defining the data plane (packet

processing pipeline), the possibilities are limited to be able to handle, for example, 100

2

Gb/s speeds in the case of the Barefoot Tofino switch. With this said, complex operations

such as statistical packet analysis can be offloaded to the control plane (local CPU on the

Barefoot Tofino Switch) instead.

In this thesis the following goals will be accomplished:

• Introduce a design for the PV AQM that can be implemented on the Barefoot Tofino

switch, e.g., what functionalities (algorithms, memory accesses, etc) can be performed

in the data plane, and what needs to be performed in the control plane.

• Present how the PV AQM can be implemented on the Barefoot Tofino switch, e.g.,

the re-configurable data plane, and how functionalities are offloaded to the control

plane.

• Evaluating the precision of the implemented PV AQM on the Barefoot Tofino switch

on the aspects of resource-sharing and queuing delay.

1.3 Ethics and Sustainability

From an ethical perspective it is important to notify users about data that is collected in

the data plane from packet headers. The network operators managing and re-configuring

programmable devices such as the Barefoot Tofino switch need to comply with, for example,

data protection laws such as GDPR [20].

Another important ethical dilemma is the concern about net neutrality [21] over the

internet. In this thesis one of the goals is to achieve resource-sharing by applying network

operator defined marking policies to prioritize certain traffic flows. It is important to

understand that the resource-sharing policies applied can be seen to go against the concept

of net neutrality.

From a sustainability standpoint, the PV AQM implementation will provide one more

solution to the bufferbloat problem. It will also provide scheduling of traffic to apply

3

resource-sharing through the per-packet value concept without losing any of the available

bandwidth.

The programmability of networking devices such as the Barefoot Tofino switch en-

ables re-configuration, which makes it sustainable due to the longevity of such devices.

If modifications for a networking device is needed to achieve a specific purpose, then no

new hardware needs to be bought. The network operator can re-configure the networking

device by applying software changes instead of buying a new application specific device.

1.4 Thesis outline

In Section 2, shorter descriptions of the concepts needed to grasp the extent of the thesis

will be presented. In Section 3, the concepts of PPV and PIE will be presented. In Section

4, the design decisions will be introduced, with the related challenges. In Section 5, the

final implementation will be presented in detail. In Section 6, all of the essential tools

used during the thesis will be presented. In Section 7, time measurements are presented

for API executions (Read/Write) and algorithms. In Section 8, the evaluation of the PV

AQM implementation will be presented. The results of resource-sharing and queuing delay

will be shown for when the PV AQM actively controls the traffic flow. In Section 9, a

conclusion of the results will be discussed and the proposed work that could be done to

further develop and optimize the PV AQM.

2 Background

In this section the necessary background information needed to understand the scope of the

thesis is presented. The concept of SDN, where programmability can be applied to network-

ing devices to allow flexible networking functionalities. The P4 programming languages

will also be presented which allows for an abstract way to define networking functionalities.

4

2.1 Software-Defined Networking

In traditional networking, devices are so called fixed-function, hardware based, and appli-

cation specific. This entails that when devices are designed and manufactured they are so

for a specific purpose. In contrast, SDN [22] is a new network architecture, which moves

the control plane logic out of the forwarding devices to a centralized location (the con-

troller). By logically centralizing network state management, many new applications and

use-cases are possible. The data plane will focus on the forwarding of packets. In contrast,

the control plane configures the networking topology, which tells the data plane how to

handle traffic, e.g., setting flow tables, and data handling policies. In classic networks, the

routing decisions are decentralized from smaller networking infrastructures. In contrast to

traditional networking, with SDN networks, these decisions can be centralized.

2.2 P4: Programming the data plane

P4 [23, 24] is a programming language created for the purpose of defining the data plane of

programmable networking devices. This language can be used to define packet processing

in, for example, switches, routers, and Network Interface Cards (NICs). The data plane of

a programmable device will be defined during initialization by the P4 language, in contrast

to traditional devices where the data plane will be fixed-function and not re-configurable.

Recently, researchers have designed ASICs with re-configurable hardware, based on the

concept of Reconfigurabe Match Tables (RMTs) [25]. With these configurable ASICs

and the P4 language and compiler, the hardware logic can be modified to perform new

functionalities. The P4 language compiler generates the P4 runtime API for the control

plane to access the tables and other objects that are defined in the P4 code.

5

2.2.1 Architecture Model

An abstract view of the P4 packet processing pipeline is presented in Figure 2.1. The first

important object in the pipeline is the parser, where packet data can be extracted from the

incoming packet. After the parser there is a match-action block called the ingress pipeline,

where packets can be modified and also where the forwarding rules are applied to decide

to which output buffer the packet should be sent to. When the packet has been dequeued

from the buffer, the packet will be processed by a second match-action block called the

egress pipeline, where further modifications to the packet can be applied if needed. The

last object of the P4 pipeline is the deparser, where packet data can be inserted to the

packet before being sent.

Figure 2.1: P4 packet processing pipeline. Courtesy of Menth et al. [1].

The P4 architecture consists of multiple objects which is used to define the P4 packet

processing pipeline. One of the most important ones are the header structures, which holds

information about each packet’s header fields and sizes. The architecture also consists of

multiple extern objects, which are constructs of the architecture which can be changed

6

through an API but is not programmable. Examples of such objects are counter (counting

packets or packet sizes), digests (structure to send data from data plane to control plane),

etc. These objects are target dependent, meaning the same extern objects that exists on

the Barefoot Tofino [18] switch, may not exist on, for example, the Netronome SmartNIC

[26], which is another P4 programmable networking device. Another useful part provided

by the P4 architecture is the user-defined metadata, which is data-structures defined and

created through the P4 language for each packet. The architecture also consists of intrinsic

metadata, in which information about packets can be found. This can, for example be, the

time a packet has spent in a queue/buffer, or the time at which the packet was enqueued

or dequeued. Tables are another object which is user-defined, and is used to match a key

value to an output value. Tables can, for example, be used to apply forwarding rules, such

as which output port a packet with a specific destination IP address should be sent to.

Finally with the P4 architecture, a control flow can be defined to control the flow for the

packet processing pipeline, i.e., parsing, ingress processing, egress processing, checksum

calculation, deparsing, etc.

2.2.2 Parsing

P4 uses a construct which is called a parser, which functions as a state machine to collect

data fields from incoming network packets. A parser has a start state called ”start” in

which the parsing begins. It has two finishing states called ”accept” and ”reject”, where

the packet either is accepted or rejected, respectively.

See Figure 2.2 for an example parser. As seen in the figure, the starting state of the

P4 parser checks the hdr.ethernet.etherType field to see if the next header type is 0x800,

which corresponds to the IPv4 header. Then the parser checks if the IPv4 header length

is 5. If false, it will continue parsing additional IPv4 header option fields until the parser

finally ends up in the accept state. If the parser has finished in the accept state, header

fields and metadata will be accessible from the packet processing pipeline.

7

Figure 2.2: Parser example state diagram

2.2.3 Control Blocks

Within a control block, fields such as the header- or metadata fields can be used and

manipulated. Match actions can be called to match values in tables to a specific output

value. It is common to instantiate an ingress- and egress control block for switching

functionality. Then, for example, in the ingress control block, a forwarding match action

is applied to send packets to the correct output port. Control blocks are instantiated with

a name, input/output parameters, constants, variables, match-tables, and actions.

2.2.4 Deparsing

The deparsing part of the P4 programming language consists of constructing the packet

that is supposed to be sent out from the programmable networking device. Depending on

8

what is extracted from the packet during parsing, these data fields and headers can either

be emitted to the packet or deactivated. During the deparsing, either some or all headers

can be emitted to the packet again depending on the purpose of the data plane processing

functionality. As an example, an additional header on top of the ethernet header can be

extracted during the parsing process and then not emitted during deparsing.

3 Related Work

In this section, the related works for the thesis will be presented. The PPV and the

PIE AQM concepts will be introduced in separate sections to explain the initial reasoning

behind why they exist. Finally, the PVPIE concept will be introduced where the light-

weight PIE AQM is combined with the PPV resource-sharing concept to maintain a low

queuing delay and apply resource-sharing policies to prioritize traffic flows.

3.1 PIE AQM

The initial concept of the Proportional Integral Controller Enhanced (PIE) AQM [9], which

takes into account a calculated dropping probability p, where packets are dropped randomly

according to this dropping probability during enqueuing. The dropping is done to activate

the TCP [27] congestion control [28]. When a drop is recognized by the TCP sender

due to lost acknowledgement messages, the sender will slow down the sending rate (TCP

congestion window). The main aim of the PIE algorithm is to maintain a certain target

queuing delay by observing if the queue is growing or shrinking. If it is growing, intuitively

the dropping should be increased in order to maintain a desired target queueing delay.

The dropping probability p is calculated by firstly (1) calculating the current queuing

delay using Little’s law: cur del = q lenavg d r

, where q len is the length of packets in the

queue and avg d r is the average drainage rate for packets being dequeued. Secondly (2)

calculating the dropping probability through the following formula: p = p + α(cur del −

9

tar del) + β(cur del − old del), where α is a parameter to determine the effect on the

dropping probability that the deviation of the current queuing delay (cur del) from the

target queuing delay (ref del) has. β similarly, is a parameter to determine the effect on

the dropping probability the deviation of the current queuing delay from the old queuing

delay has. Thirdly (3) the old queuing delay (old del) should be updated to the newly

calculated queuing delay (cur del).

3.2 PPV

PPV [12] is the concept of applying resources-sharing policies by marking packets with

packet values. The value expresses the relative importance of one flow over another, for

example, different flows can have different throughput or delay requirements. Packet values

can be considered at a resource node in the network where flows share a bottleneck to

provide resource-sharing by scheduling or dropping packets based on the packet value.

The packet values can be as simple as representing a user’s subscription level, i.e., the

user maybe has a gold membership, which gives them higher throughput compared to a

user with, for example, a silver membership. Packets with higher packet values will be let

through and transmitted at a resource node, while packets with lower packet values either

get dropped or delayed by the resource node if the resource node is fully utilized.

The PPV concept uses Throughput Value Functions (TVFs) to apply different resource-

sharing policies to different flows. Examples of such TVFs can be seen in figure 3.1, where

there are four different TVFs. These TVFs are used to match a throughput value to a

packet value that will be marked into the packet. The throughput that is calculated is

independently calculated for each flow to create fairness between flows. For example, if a

single flow marked with the gold TVF seen in the figure has higher throughput than other

flows marked with the gold TVF. By default, higher throughput matches to lower packet

values, which leads to fairness between flows when the packet values are taken into account

to make dropping decisions at a resource node. Another aspect that is important in the

10

PPV concept is that the throughput used for matching a packet value is not discrete, but

instead random. A random value between 0 and the calculated throughput is generated,

and this values is used as the throughput value to match a packet value. Because of this,

fairness is created between different TVFs. Packets marked with the gold TVF will not

always get a higher packet value, but the packets will always have a greater chance of

getting marked with a higher packet value than the packets marked with, for example the

silver TVF. With the randomly generated value, flows marked with the silver TVF cannot

be starved at the resource node by the gold marked flows.

Figure 3.1: Throughput value functions. Courtesy of ELTE [2].

3.3 PVPIE

In this section, the concept of combining the PPV resource-sharing with the PIE AQM is

explained under the combined acronym PVPIE [3]. PVPIE uses the initial specification of

PIE, where the packets are dropped or ECN-marked at random depending on a calculated

dropping probability p. Together with this, the packet value of different packets will get

different prioritization during the enqueuing phase. If a packet has a high packet value,

11

it will be less likely to be dropped than a packet with a lower packet value. The PVPIE

concept can be applied to achieve resource-sharing between different flows marked with

different TVFs (shown in Figure 3.1), while at the same time maintaining a low queuing

delay.

A Congestion Threshold Value (CTV) will be calculated at a time t, which is dependent

on the observed incoming packet value distribution during a recent time interval. The CTV

is calculated with the following formula:

• CTV (t) = ECDF−1[t−γT,t)(p(t))

The formula shows an Empirical Cumulative Distribution Function (ECDF) that will

depend on the calculated dropping probability, and will be updated regularly with the time

interval γ. If the packet value of a received packet is less than the calculated CTV, the

packet is dropped; else, it will be let through. Also, to note is that if the number of packets

received during a time interval is less than 1/p, the CTV is set to 0, and thus no packets

will be dropped. In Figure 3.2, the PVPIE scheme is shown, where at a time interval γ the

dropping probability p is calculated by the PIE controller. At the same time interval γ,

a new ECDF is calculated to provide a function which describes the current distribution

of packet values. Finally, the calculated dropping probability p is matched to a packet

value V in the ECDF, which will be used as the current CTV at which packets with packet

values lower than the CTV should be dropped.

12

Figure 3.2: PVPIE scheme. Courtesy of Laki et al. [3].

4 Design of the PV AQM

In this section, the design of the PV AQM will be presented. The challenges of designing

a suitable PV AQM and the decisions taken to solve them are described. Ideally the PV

AQM would be designed to only be implemented with P4 code in the data plane. The

planned design need to be suitable for the Barefoot Tofino switch, where the defined P4

code at compile-time need to follow the limitations of the target real-time system. The

Barefoot Tofino switch only allows P4 programs to be compiled if it can provide packet

processing at high speeds (i.e., 100 Gbps).

4.1 Design Overview

The initial goals of the AQM were to follow and implement the PVPIE concept on a

Barefoot Tofino switch in the data plane, as described in Section 3.3. This would entail

the following functional blocks:

• For every packet processed in the data plane:

13

– Count the packet size in bytes in a memory slot for the marked packet value to

continuously maintaining statistics of the distribution of packet values received.

This means having a separate memory slot for each packet value where all

packets with a specific packet value will get counted. This is done to have

a history of how many bytes of each packet value has been processed in the

data plane. These memory slots will, for the rest of the thesis, be described

as histograms. As defined by the packet value marker [4], the packet value is

marked into a 16-bit header field and can support up to 65536 unique packet

values. This will make it necessary to design the AQM to be able to count

65536 unique values. Each histogram has to be of a large enough size to be able

to count the packet sizes of all packets received during the time interval T. At

which point the histograms are used to calculate a new ECDF and CTV, and

finally the histograms are reset to 0. On the Barefoot Tofino switch architecture

the possible histogram sizes can be set to 8-bit, 16-bit or 32-bit. To not risk

an overflow when counting the packet sizes, each histogram can be defined at

a size of 32-bit, which would make it possible to count at-least 4 GB for each

packet value, compared to 64 KB with 16-bit histograms.

• Every T ms interval:

– Collect all histograms stored in memory to calculate an ECDF to describe the

distribution of the received packet values during the last T ms. Each packet

value has to be correlated with a specific probability that will describe how large

percentage of the total amount of bytes received that is either marked with a

specific packet value or lower (e.g., the cumulative probability from 0 to that

packet value).

– Calculate a dropping probability with the PIE formula: p = p + α(cur del −

ref del) + β(cur del − old del).

14

– To calculate the PIE formula, values from the previously calculated dropping

probability have to be stored in memory for use during the next time the cal-

culation will be computed (e.g., the previously calculated dropping probability

and the previous queuing delay). The other variables, such as α, β, and target

queue delay do not need to be updated or changed, which is why they can be

defined as constants.

– When the new dropping probability has been calculated, it will be used to find a

new threshold value at which the AQM should drop packet values lower than the

threshold. In the rest of the thesis, this threshold value is called the Congestion

Threshold Value (CTV). To calculate and update a new CTV, the dropping

probability should match the point in the ECDF, which allows the AQM to

drop approximately p percent of the incoming packets.

4.2 Design Challenges and Decisions

As seen in the design overview, different functionalities and packet processing operations

need to interact in order to implement the PVPIE concept. These operations can all be

implemented in the data plane itself. Alternatively, several operations can be implemented

outside the data plane and be implemented in the control plane. On programmable devices

such as Tofino, the data plane has limited functionality and several limitations that limit

the complexity of operations due to the real-time requirements of the target platform. The

data plane is designed to handle a defined packet processing pipeline at 100 Gbps, which

limits the number of processing cycles that can be spent for a single packet. Consequently,

complex operations such as the ECDF calculations could be outsourced to the control

plane. On the other hand, it is most natural to keep functionality in the data plane where

it is most effective. For example, maintaining packet and traffic statistics requires counter

operations to be executed on every single packet. Therefore, such operations should ideally

be implemented in the data plane itself.

15

For example, when supporting many different packet values (e.g., 65536), a naive design

would maintain histogram counters per individual packet value. Consequently, updating

packet statistics in the data plane would require 65536 stateful memory cells such as

registers or counters per queue to be maintained and updated. Synchronizing such large

number of stateful memory from the data plane to the control plane may lead to large

latency if outsourcing the processing of those traffic statistics to the control plane. This is

due to two reasons. First, the control plane processing is significantly slower than the data

plane. Second, transferring the content of the stateful memory to the control plane may

take several milliseconds (ms), too long for the control plane loops required in the design.

Therefore packet values will not be counted in individual memory cells, and instead packet

values can be coupled together into fewer memory cells (see Section 4.3.1 for the concept

of uniform packet value histograms).

The first major challenge that appeared was that of which parts of the PVPIE concept

would be able to be implemented in the data plane on Tofino. If looking at the PVPIE paper

[3] presented in Section 3.3, a number of operations, like reading/writing of memory (CTV,

histograms, etc.) and mathematical calculations (PIE algorithm) have to be completed in

the data plane. The least complex part of PVPIE, which has to be done for each packet

processed, is to count the size of a packet marked with a packet value into the correct

histogram to have a history of the current distribution of received packet values. This is

not a tough operation to do. The challenge comes when it is time to update the currently

used threshold value. The goal is to calculate a new ECDF and a new CTV every T ms in

the data plane. This had been seen to be a possibility in mininet [29], where a simulated

switch can be defined with P4, but where there are no limitation to what can be fit inside

of the packet processing pipeline. In contrast, on a programmable switch, for this project,

the realization of the challenges was introduced early.

What was initially planned would not be possible with the scope of the project, to

implement a strict data plane controlled PV AQM. Instead, functionalities such as ECDF

16

and CTV calculations, and memory writes and reads would have to be moved from the data

plane to the control plane where the limitation was not as strict. If done in the data plane it

would have to be implemented with P4, which has limited expressibility and functionality.

In contrast, if done in the control plane any functionality can be implemented with C or

Python. This will allow for flexibility when it comes to complex mathematical operations

which would be needed to, for example, calculate a distribution function for the received

packet values. .

In Figure 4.1, the proposed design of the control plane operations is presented. In

the initial PVPIE paper [3], the CTV is calculated only when a new ECDF has been

calculated. Instead, for the PV AQM design, two separate control plane loops will be

executed at different time intervals. The ECDF will be calculated at a larger time interval

TA, because it will entail reading values from the data plane through API calls and then

calculate a cumulative probability for each packet value to produce the ECDF. In contrast,

at a shorter time interval of TB, a new CTV will be calculated by matching the calculated

dropping probability to a packet value in the ECDF. This is done to allow the PV AQM to

quicker react to changes in the queuing delay without having to recalculate a new ECDF

which would take much longer.

In Figure 4.2, a view of the design for the planned PV AQM is presented. It is separated

into two parts, one for the operations completed in the control plane, and one for the

operations completed in the data plane. As seen, when a packet is received by the switch,

a condition will check if it is time to make an update of the CTV or the ECDF, at which a

digest with the port and queue ID will be sent to the control plane. A digest is an extern

object in P4 that is used as a mechanism to send a message from the data plane to the

control plane. When the digest is received in the control plane, a condition will be checked

that either initiate an update of the CTV (e.g., interval TB) or to update the ECDF (e.g.,

interval TA) together with an update of the CTV. This condition depends on how many

digests that has been received in the control plane for a specific port and queue. In the

17

Figure 4.1: Abstract overview of the control plane loops.

figure, the histograms counter can be seen, which will be read through an API function to

get the current distribution of packet values needed for an update of the ECDF. For the

CTV update, it also displays the queuing delay register, which will be read to access the

current queuing delay from the data plane. After the update, the calculated CTV will be

written with an API function to be used as the current threshold value in the data plane.

18

Figure 4.2: Data - and control plane interaction. Interval TA: Update ECDF and IntervalTB: update CTV.

4.3 PV AQM Design Concepts

In this section, concepts used for the design of the PV AQM are presented. The first

concept is called uniform packet value histograms, which is used to keep packet value

statistics in the data plane. The main purpose for it is to limit the time it would take to

read 65536 unique packet value statistics from the data plane to the control plane. This

section will also present in more detail what the ECDF is and how it is used to correlate

a dropping probability to a CTV.

4.3.1 Uniform Packet Value Histograms

The packet value histograms are what the counters (counting packet sizes in bytes) for

all packet values are called. When an update of the packet value distribution is executed,

the control plane will read all counters corresponding to a specific port and queue. Due

to the added delay of reading 65536, e.g., counters for all allowed packet values, these

19

histograms are divided over fewer histogram counters. Which histogram counter a packet

value corresponds to is decided by range-match table rules for packet values.

When the control plane is initialized, a range is decided for which packet value should

be counted into which histogram counter. At initialization, the ranges will be equally

distant, meaning, if there are 65536 different packet values divided into 256 equally wide

ranges, there would be 256 (e.g., 65536/256) packet values for each histogram counter.

As an example In Figure 4.3, if there are lets say 4 histograms (counter 1, 2, 3, and

4), each counting 64 packet values. If they are equally distant, counter 1 will count packet

values from 0 through 63, counter 2 will count packet values from 64 through 127, and so

on.

Figure 4.3: Uniform packet value ranges

20

4.3.2 Packet Value Distribution and ECDF

Within an interval that fits with the time constraints of the control plane, the histogram

counters of packet values for a specific queue will be collected from the data plane. With

all of these histograms, a new ECDF will be calculated.

In Figure 4.4, a uniform packet value distribution is converted into a ECDF and plotted.

This is not usually how it would look in reality, but just as a simple example. If this is

the given ECDF after an update, and for example, the dropping probability is calculated

to be 50%. Then the CTV algorithm would look at which packet value correlates to the

dropping probability of 50%. In the case of this example, with a uniform distribution, it

would be around 65536*0.5, e.g., half of the maximum allowed packet value. This would

lead to theoretically, that 50% of all packets received should be dropped.

Figure 4.4: ECDF diagram.

21

5 Implementation

In this section, the final implementation of the PV AQM will be presented. As mentioned in

the design section, because of limitations found, the implementation has to be divided into

two parts, a control plane, and a data plane. P4 was used to access the programmability

of the Tofino [18] switch, to define the packet processing pipeline. In contrast, Python for

the control plane interaction due to the available API needed for accessing data (reading

and writing registers, counters, etc) in the data plane.

5.1 Data Plane Implementation

In this Section, each part of the implemented P4 packet processing pipeline will be pre-

sented in detail. The different parts, such as parsing, ingress- and egress processing, and

deparsing will be presented in the order at which they are executed in the data plane.

5.1.1 P4 Ingress Parser

The ingress parser has the functionality of extracting data needed for the ingress processing

pipeline. The PV AQM needs specific header fields from the packet to function correctly.

The most important header that needs to be extracted is the IPv4 header, which holds es-

sential fields for forwarding, like the source- and destination IP addresses. The IPv4 header

also contains a 16-bit identification header field [30], which usually holds information about

a group of fragments that a packet corresponds with. In the PV AQM implementation,

this IPv4 identification field holds the 16-bit packet value, which has been marked by the

marker at an earlier stage. Per default this is where the packet value marker inserts the

packet value. But in a more practical implementation the packet values should probably

be inserted in a additional header or field that does not interfere with already established

header fields like the IPv4 identification field.

22

5.1.2 P4 Ingress Processing

The ingress processing control block is where most of the AQM implementation resides,

apart from the control plane functionalities.

The following packet processing has been done within the ingress processing control

block (see Figure 5.1):

1. An exact match table is applied, and a destination IP address is matched to an egress

port number. This is what applies the forwarding rules for the test-bed used during

testing and evaluation.

2. An exact match table is applied to check if the packet is sent from an IP address

that gets marked with a packet value. If it hits, the rest of the AQM functionalities

are activated. This is used to apply the PV AQM functionalities only to flows that

are marked with a packet value.

3. An exact match table is applied to match the egress port and queue ID to an identi-

fication number. This identification is used to read the correct CTV for the port and

queue. This allows the implementation to use individual CTV’s for unique ports.

4. A register action is called to check if a specific time (e.g., 1 ms in this implementation)

has passed since the previous update of the CTV. This will return either one (time to

update) or zero (not enough time has passed). This value is stored into a meta data

field which will tell the P4 ingress deparser if a digest should be sent to the control

plane to initiate an update of the CTV.

5. A second register action is called to get an index of which counter object to count the

packet size in. The action will return 0,1,2 or 3, which corresponds to one of the four

defined counters. There are four different counters to switch between to remove the

added delay of waiting for a counter to be reset by the control plane before starting

to count on that counter again.

23

6. A range match table is applied to match the received packet value to an index in

the counter. This table consists of multiple packet value ranges, with each range

corresponding to a unique counter index (e.g., a unique packet value histogram).

7. A counter action is called on the correct counter index. This action will increment the

number of bytes the packet consists of to the current value in the counter. By doing

this the counter will hold the statistics needed to calculate an ECDF to describe the

distribution of packet values received in the form of bytes.

8. Store the ingress global timestamp metadata field to a header field. This is done to be

able to emit the header during the deparsing and to be able to access the timestamp

in the egress processing block. This timestamp is used in the egress processing block

(Presented in Section 5.1.4) to calculate the queuing delay.

9. A register action is called on a register, which holds the current CTV. The action re-

turns zero if the packet value marked in the packet is higher than the CTV, otherwise

it will return one if the packet value is less than the CTV.

10. If the returned value is 1, then the packet is marked to be dropped; else, it will be

enqueued.

24

Figure 5.1: P4 Ingress processing.

5.1.3 P4 Ingress Deparser

The ingress deparser only has one purpose besides emitting the packet headers. The

deparser will check if the previously mentioned metadata field is marked with one, which

indicates that it is time for an update of the CTV. If the metadata field is one, then a

digest will be sent with the port and queue ID for which the update should be performed.

5.1.4 P4 Egress Processing

The purpose of the egress processing block is to store the current queuing delay to a register.

It will also write the delay into a packet header field to enable end-host post-processing of

experimental data for traffic analysis. The first part of the egress processing is to apply an

exact match index table to match an egress port to find the correct register index for which

25

to update the queuing delay. After this, the queuing delay is to be calculated by subtracting

the timestamp sent from the ingress processing block from the current timestamp in the

egress processing. The queuing delay value will be stored in a register by calling a register

action. The last part of the egress processing is to write the delay to the packet header.

This is done by right-bit shifting the 32-bit queuing delay value by eight and overwriting

the 16-bit IPv4 identification field with this value. The value will correspond to the current

queuing delay in nanoseconds divided by 256 (eg., 28 = 256). This will allow for analysis

of delays of up to 17 ms, in contrast to 0.07 ms without the bit shifting.

5.2 Control Plane Implementation

In this section the control plane functionalities are present in more detail. The two main

purposes of the control plane is to collect packet value statistics to calculate an ECDF,

and to calculate a dropping probability that is used to match a CTV which will be written

to the data plane. All of the following functionalities are programmed with the python

programming language in combination with the available bfrt python API.

5.2.1 CTV Update

The CTV update is the action of updating the threshold value, which is used in the data

plane for deciding if a packet should be dropped or not. The action is triggered every time

a digest message is received in the control plane, which is approximately every 1 ms, due to

the data plane update timer. When a digest is received, the port and queue ID is unpacked

from the digest and used to read (API register read function) the current queuing delay

from the data plane with the correct index correlating with the IDs. When the queuing

delay has been collected and converted into ms by simply dividing it by 1000, the CTV

update function is called. The called function performs the following:

1. Calculating a new dropping probability with the PIE formula:

p = p+ α(cur del − ref del) + β(cur del − old del)

26

2. Check if the calculated dropping probability is out of boundary (e.g., p < 0 or p > 1),

and resetting it to the closest boundary if it is true.

3. Calculate classic TCP dropping probability according to the PI2 [6] formula: p =

(p2)2. This is used in the PV AQM to restrict the dropping probability to a maximum

of 25%. During testing it was shown to work better to not allow big aggressive changes

in dropping probability.

4. Store current queuing delay and dropping probability to be used during the next

CTV update as previous queuing delay and previous dropping probability into a

array data structure.

5. Binary search (see Appendix D.2 for pseudocode) through the current ECDF to find

the suitable CTV for the current packet value distribution.

6. Write (API register write function) the new CTV to the data plane for use as the

new threshold value for the particular port and queue.

5.2.2 Inverse ECDF Update

The ECDF update is the action of where a new cumulative probability function is generated

with the current packet value distribution collected from the histogram counters in the data

plane. This update is triggered in the control plane approximately every n digest because

of a counting variable which increments by 1 for every digest received. In a practical

implementation a more sensible approach would be to have a separate timer for the ECDF

update compared to the CTV. In which the data plane sends different digest messages

depending on if the CTV should be updated or the ECDF should be updated. But for the

current test-bed and experiment conducted it is not necessary.

The ECDF update performs the following:

1. Check which counter that is currently used for counting packet values into histogram

27

counters. This is possible due to a index (for which counter is used) stored in a

python data structure.

2. Synchronize (API counter synchronize function) the counter values from data plane

switch hardware to control plane local software.

3. Write (API register write function) a new index value to the data plane, which tells

it to start counting in a new counter.

4. Read (API counter read function) all packet value histograms from the currently

synchronized counter.

5. Calculate a new ECDF (see Appendix D.3 for pseudocode) with the collected packet

value histograms and store it in a python array structure for use during the CTV

update.

6. Reset (API counter write function) all of the counter values for the previously used

counter.

6 Tools

In this section the various tools used during the thesis are presented. There are two main

API tools that has been used during the thesis, one to manage interaction between the

control- and data plane, and another to configure the traffic manager (e.g., port or queue

configuration). This section also contain the traffic generators that were used, and the

programming language to script and analyse the traffic captured.

6.1 Control Plane Interaction API

Via a connection between the control- and data plane, the control plane can use API calls

to modify P4 extern objects and tables that are used by the data plane. These API calls

28

are accessed through a Python run-time client running on the switch. In this client, Python

scripts can be run, which was used for the purpose of updating the CTV for the PV AQM

implementation.

Some examples of how API functions would look:

• Register/Counter functions:

– program name.control block name.register name.mod(index, value)

– program name.control block name.register name.get(index)

– program name.control block name.counter name.operation counter sync()

• Table:

– program name.control block name.table name.add with hit(match value,

output value)

– program name.control block name.table name.delete(match value)

6.2 Pipeline Traffic Manager API

The pipeline traffic manager is a part of the control plane functionalities that can modify

and configure, for example, the number of egress queues, queue lengths, and queue priority.

For a multiple queue implementation, the traffic manager could be used to set up the

multiple queue capability. Two queues can be allocated, where one queue can manage the

low latency dependent L4S [31] flows, and the other manages all other traffic flows.

6.3 Iperf3

Iperf3 [32] is a tool both available as a library in python and a tool in Linux. It can be

used to start servers and clients to create network traffic. It allows for dynamic traffic with

multiple transport protocols, parallel streams, and binding to specific ports or interfaces.

29

In this project, the iperf tool has been used to add multiple TCP flows to analyze how the

PV AQM reacts to a varying number of TCP flows.

6.4 Flent

Flent [33] is a flexible network tester that has been used throughout this project both to

debug and evaluate the PV AQM. With Flent it is possible to generate multiple different

traffic scenarios. When starting Flent, a test can be specified to run, for example, several

TCP flows in a single direction or bi-directional. If desired, it is possible to also add UDP

and HTTP traffic to test how different transport protocols reacts. Flent also has a GUI

to analyze and display complex graphs of throughput and latency, with CDF curves, box

diagrams, etc.

6.5 Python

Python [34] is an object-oriented programming language that has been used as the main

programming language for control plane interaction. The decision to use Python as the

main language had to do with the ease of use due to the control plane API already existing

for the Tofino switch. Python is also a very easily scripted language that has a large

number of available libraries that can be imported. For example, there are libraries for

network analysis that can be used to split large PCAP (Packet Capture) files. It is possible

to split a PCAP file into multiple PCAP files per TCP flow or source IP address.

7 Control Plane Measurements

In this section multiple measurements will be presented. The main purpose of this sec-

tion is to introduce a general reasoning behind why specific objects (e.g., registers and

counters) was used to keep packet value statistics. The section will also introduce time

measurements for functionalities (e.g., ECDF and CTV calculations) in the control plane

30

to approximate at which time interval (see Figure 4.1 for control plane update intervals)

these functionalities can be executed.

7.1 Histogram Registers vs. Counters

One primary concern that became prevalent later in the project is the limitation to which

histogram registers/counters can be read from the data plane to the control plane. Likewise,

how many bytes of information that can be sent with digests from the data plane to the

control plane during each packet processing pipeline. These problems would limit the

number of unique packet values that can be used for the ECDF calculations. A possible

solution for this would be to read specific histograms from the control plane when the

packet value distribution needed to be updated. This would impose extra delays as reading

individual values from the control plane is more time consuming than reading them in the

data plane. The positive aspect of reading from the control plane is that this would not

slow down the packet processing pipeline. Reading them from the data plane would entail

recirculating packets multiple times through the packet processing pipeline to be able to

read a large number of values. Instead, when reading histograms from the control plane,

packets can flow through the switch without any interruption.

To store and count the number of bytes transferred with a specific correspondent packet

value, there are two choices. The two externs that are available for this purpose is the

register extern and the counter extern. The counter extern can only explicitly be read

from the control plane, while the register extern can be read from either the control- or

the data plane. With that said, the register has more flexibility with what is possible to

do with it. This is because of the potential to store whatever data accessible from the data

plane to the register, while the counter either counts packets or packet sizes.

In the case of this project, where the only necessary use case is to count packets sizes in

bytes corresponding to a packet value, either of the two possible externs would suffice. Due

to this, it is crucial to make sure that the extern best fitted is used. For the implementation,

31

the most important would be the speed at which the histogram values can be read from

the data plane to the control plane to update the ECDF.

Both the register and the counter have multiple API functions that can be used through

the bfrt python API. There are two different ways for each extern to read the histograms,

reading the values straight from the hardware, or the values can be synchronized from the

hardware to the local software and then read. The difference between them is that during

synchronization all values stored in the corresponding extern will get transferred to the

local control plane memory where they can be read quicker. In contrast, when reading

from hardware, the API read function gets a single instance/value for each API function

call, which is less effective.

In Figure 7.1, the comparison between reading 128 register values from the hardware,

and syncing the register to read 128 values from the software is illustrated. This experiment

is conducted by reading 128 different indices of a register multiple times and calculating

an average. As seen in the figure, there is a slight difference between the reading speeds.

In this case, reading 128 register values from the hardware is a couple of ms quicker than

reading it from the software. This is then most likely due to the added delay of calling

the API function to synchronize the values, and then reading them. An important part

to mention is that further evaluation is needed for how the synchronization API function

works. The API function possibly has to be used with a callback function (e.g., a function

defined by the programmer to be called after the synchronization has been completed).

In this project, a callback function was not used; instead the synchronization function is

called, and the values are read right away. This could entail that the histograms read are

previously cached data and that with a callback function, additional delay could be added

to the measurements.

32

Figure 7.1: Reading 128 register values from hardware versus from software.

In Figure 7.2, a more expected result can be seen, where syncing and reading from the

software increases the speed at which register values can be read. This would be the most

efficient way of reading a larger number of register values. If using larger register the more

efficient way to read histograms would be to synchronize and then read the values. But in

the previous Figure 7.1, it is shown that at some point the synchronization of the values

from hardware to software is actually not efficient when reading less values.

Focusing on what the criteria and purpose of the packet value histograms is. Then it is

important to compare whether counters can be used instead of registers to read packet value

histograms from the data plane to the control plane. In Figure 7.3, two measurements are

depicted. It shows the difference between reading 512 values from a register versus reading

512 values from a counter. As seen, there is a significant difference in reading speeds

between a counter and a register. The counter can be instantiated to count packet sizes in

bytes, which would be the most efficient alternative to use for the reading of packet value

histograms.

33

Figure 7.2: Reading 512 register values from hardware versus from software.

Figure 7.3: Difference between reading 512 values from a register versus counter.

34

7.2 CTV Calculations

In this section, there will be measurements of how fast the control plane can calculate a

new CTV and write it to the data plane to be used as the current dropping threshold.

These measurements are calculated with the following operations:

1. The time starts when a digest with port- and queue ID is received in the control

plane.

2. The current queuing delay is read from the data plane for the specific port and queue.

The delay is read with an API function to get a single register instance, which takes

around 0.05 ms (see Appendix C for a register reading measurement).

3. A new dropping probability is calculated by using the PIE controller algorithm.

4. The dropping probability is used to match a new CTV in the ECDF by binary/linear

searching (see Appendix D.2 and D.1 for pseudocode).

5. The time stops when the API function to write the new CTV to the data plane has

been called.

To be clear, these time measurements are conducted specifically in the control plane.

In reality, there is a small delay added for how long it takes for the digest to be sent from

the data plane to the control plane. Secondly, another smaller delay from when the API

function has been called until the CTV register has been modified in the data plane. These

delays were not added to the measurement as they won’t add any delay to the control plane

functionalities, but they will affect the actual real time it would take to update the CTV.

Before measuring the total time it takes to update the CTV, the most time-consuming

part of the update is specifically measured. In Figures [7.4 7.5], times are plotted of how

fast the correct CTV is found by matching a probability to a CTV in the ECDF. Mea-

surements are taken during real-time traffic to show how fast the calculations will be when

35

constant traffic is running through the switch. As seen in Figure 7.4, there are many outliers

reaching upwards of 40 ms at most. These outliers are likely due to how the CTV is found

in the ECDF, which is done through linear searching (see Appendix D.1 for pseudocode).

When the dropping probability reaches higher percentages, there are more steps through

the ECDF that has to be done to find the correct CTV. In Figure 7.5, the same results are

plotted but with the y-axis scaled to show that most of the measured times are between 0

and 0.1 ms, which most likely is when the dropping probability is zero or very close to zero.

Figure 7.4: CTV with linear search. Figure 7.5: CTV with linear search(Scaled).

In Figure 7.6, the searching algorithm was changed to binary search (see Appendix D.2

for pseudocode) at which for each step through the ECDF, the total amount of possible

values to match are divided in half. As seen, the calculation times are much more stable

than with linear searching. This is as expected during the fluctuation of TCP traffic

building up the queue and then slowing down again, which leads to the calculated dropping

probability also fluctuating. The times are now more stable and likely to be able to find

the CTV in around 0.25 ms.

36

Figure 7.6: CTV calculations with binary search (Scaled).

In Figure 7.7 the total time (operations stated at the start of this section) of updating

the CTV is shown. As seen, the large majority of times for updating the CTV is below 1

ms, which then would make it possible to update the CTV every ms without overflowing

the control plane with digests.

Figure 7.7: Total time to update CTV.

37

7.3 Packet Value Distribution Update

In this section, measurements are taken on different parts of the packet value distribution

update. The first part to measure is how fast the packet value histograms can be read

from the data plane. The histogram counter size will be set to 256, which would allow the

control plane to quickly fetch the packet value statistics needed to update the ECDF. In

Figure 7.8, the times for getting 256 counter instances are presented. As seen, the times

vary a lot in contrast to the earlier measurements comparing registers and counters. The

reason for this is either that the measurements now are taken during real traffic, which

likely slows down the syncing and reading of counters, or that the control plane in parallel

receives digests from the data plane at which the CTV has to be updated.

Figure 7.8: Time to get 256 histograms counters.

The second important part of the packet value distribution update is the calculation of

an new ECDF (see Appendix D.3 for pseudocode). In Figure 7.9, the times collected for

calculating an ECDF for 65536 packet values are presented. As seen, the times are similar

to reading 256 counter values from the data plane, between 20 and 30 ms. With these

results, an approximate update interval can be decided to be above 50ms because reading

38

the counter takes about 25 ms plus an additional 25 ms for updating the ECDF. This

update interval is just an approximation and can easily be modified if needed by changing

a parameter in the control plane, which activates an update of the ECDF depending on

the number of digests that has been received (see Section 5.2.2).

Figure 7.9: Time to update the ECDF curve.

8 Evaluation

In this section, there are three primary questions that have to be answered. The first

question is about the actual purpose of the PIE scheme, to achieve lower latency by reducing

queuing delay. Can lower queuing delay be achieved without losing bandwidth compared

to the actual throughput that was achieved before adding the PV AQM? The second

question has to do with the PPV concept. Can marking packets with a packet value be

used together with the PV AQM to achieve resource sharing between different users? The

third question: Can the bottleneck link be fully utilized while at the same time allow a

specific user to get either lower or higher throughput compared to another user?

39

8.1 Evaluation Setup

What is needed to be able to evaluate the PV AQM correctly? There are a few points

that have to be hit. The major one is that a bottleneck has to be created to build a

queuing delay. The queuing delay is needed to test that a lower delay can be achieved by

dropping packets when a high enough queuing delay is prevalent. A single FIFO queue

is used for the bottleneck port on the Tofino switch. The port is not configured with the

traffic manager, e.g., the default configuration is used, which allows the queue to grow to

a large enough length to create a queuing delay (in this case, above 14 ms). The dropping

of packets is also the central part that makes the resource sharing functionality work. The

packet value marker marks the packets following a TVF. For the evaluation, packets are

sent from two clients to the marker (see Figure 8.1 for an overview of the setup). The first

client (IP 10.0.0.1) will be marked according to a gold TVF, while the other client (IP

10.0.0.2) will be marked according to a silver TVF. This, per default, will give one of the

clients a higher chance of getting the packets marked with a higher packet value. These

packets that are marked with a higher packet value will at the AQM bottleneck have a

lower chance of being dropped if a queuing delay is prevalent. By sending TCP traffic from

both clients, the client marked with the silver curve will start sending fewer packets when

drops are noticed. Because of this, a lower throughput will be seen for the client marked

with lower packet values, while the other client gets a higher throughput.

40

Figure 8.1: Evaluation setup.

Each client/server has the following hardware setup:

Ethernet Intel 2x10Gbps NIC 7-seriesCPU Intel i7 3.4GHzTCP implementation CUBIC

Table 8.1: Client/server setup.

The setup parameters that will be used in the evaluation:

Bottleneck on Tofino(Gbps) 1Nr. of sent TCP flows(gold-silver) 1-1, 2-2, 4-4, 8-8Propagation delay(ms) 0

Table 8.2: Evaluation setup parameters.

41

The marker TVFs used for the evaluation is shown In Figure 8.2. As seen in the figure,

when the TCP flows marked with silver has less than 10 Mbps, the silver flows will get

2 times less throughput. If the throughput for silver flows goes above 10 Mbps, silver flows

will get 4 times less throughput.

Figure 8.2: Marker TVF functions. Courtesy of Maher Shaker [4].

8.2 Evaluation Metrics

In this section, the important metrics used for the evaluation of the PV AQM is introduced.

There are three main aspects that are important to analyse: throughput, queuing delay,

and resource sharing.

8.2.1 Throughput

Throughput is a metric for analyzing the speed at which data are sent over the internet.

Usually, throughput is represented in one of two ways, either bits per second (bps) with

42

an appropriate prefix, or it is defined as packets per second (pps). For this evaluation, it is

essential both to see the scale of resource-sharing and to make sure that the total amount

of throughput corresponds to the bottleneck link. This is important to make sure that

the implementation does not limit the speed at which the switch should be able to send

without the PV AQM.

8.2.2 Queuing delay

Queuing delay is one of several different delays that may appear on the internet together

with propagation-, transmission- and processing delay. Queuing delay relates to the various

queues on the internet where packets wait before being sent towards its next destination.

In this evaluation, queuing delay is a significant part of what makes the implementation

work because of its importance in calculating a CTV. It is also an important metric to

evaluate due to its significance of reducing latency for packets running from end to end.

As described in Section 5.1.4, the queuing delay is encoded into the IPv4 identification

header field, which is extracted from a PCAP file and used to evaluate the queuing delay.

8.2.3 Resource-sharing

Resource-sharing is the concept of sharing the total amount of available bandwidth be-

tween, for example, users, TCP flows, subscription levels, etc. In this evaluation, this met-

ric will be shown as the throughput difference between a user which packets are marked

according to the gold TVF, and comparing that to a user which packets are marked ac-

cording to the silver TVF.

8.3 Evaluation of Traffic Without AQM

To make a proper evaluation of the PV AQM, firstly, graphs need to be created to which

the results can be compared. To do this, tests were run with the Tofino switch only

forwarding packets without any dropping of packets by the AQM. Instead, the only drops

43

that can occur are packets being dropped by the switch due to a queue being fully utilized.

The Netronome marker will still be marking packets, but the PV AQM will not drop any

packet.

In Figure 8.3, the throughput results are shown for multiplying the number of TCP

flows by two every 15 seconds starting with one silver flow and one gold flow. Even though

there is no AQM interaction, the packets are still marked by the marker to show that no

resource-sharing is present when the PV AQM is not active and dropping packets. As seen,

there is no priority for packets that are marked with the gold TVF compared to the silver

TVF.

Figure 8.3: Multiplying the number of flows with two every 15 seconds without AQM.

In Figure 8.4, the queuing delays are collected from a test without any AQM on the

switch. It shows the well known saw-tooth behavior where the queue builds to its maximum

capacity. At that point, packets are being dropped, and the TCP traffic will slow down

by reducing the sending window before speeding up again according to TCP Cubic. This

44

figure displays the queue delay during a 10-s period when sending 16 TCP flows. The

maximum queue delay that may occur is approximately 14-ms.

Figure 8.4: Queue delay without AQM.

8.4 Evaluation of the PV AQM With Uniform Ranges

In this section, the evaluation of the PV AQM implementation with uniform histograms

will be evaluated. As mentioned earlier, the essential part to evaluate is the resource-

sharing, at which differently marked traffic flows should share the provided throughput. In

the following figures, a test has been run over 60 s. There are two clients sending traffic

to a single server through the AQM, where the only bottleneck link is situated. The first

client will be marked according to the gold TVF shown in Figure 8.2. In contrast, the

second client will be marked according to the silver TVF, which has lower priority than

the first client. Every 15 s, the number of TCP flows for each client will be multiplied by

two (e.g. at the start a single TCP flow is started for each client, and after 15 s another

45

flow is started for each client, then two new flows is started for each client and so on).

The following parameters were used for the PV AQM throughout this evaluation:

Target delay(ms) 2Alpha 0.3125Beta 3.125CTV update (ms) 1ECDF update (received digests) 50

Table 8.3: PV AQM parameters.

In Figure 8.5, the throughput for all flows are shown in light blue and light red, while

in dark blue and dark red, the average throughput is shown for each client’s TCP flows.

As seen, the throughput is not stable when there is only one TCP flow for each client.

This is expected as drops for higher throughput flows make the TCP flows slow down more

drastically than it would for lower throughput’s. At a later stage, when more flows have

been added, the throughput’s for the two clients are more stable, and the resource-sharing

aspect of the AQM can be better seen.

In Figure 8.5, it is hard to see how close to the desired throughput for each TVF the

flows are. In Figure 8.6, the throughput per TVF marking is shown when multiplying the

flows by two every 15 s period. This is done by calculating the throughput for each of

the client, where as mentioned earlier, one of the client’s is marked with the gold TVF,

while the other is marked with the silver TVF. This figure shows how the number of flows

running through the PV AQM impacts the actual achieved throughput. When only one

flow per TVF is sent, you can see a significant difference between the TVFs, but still far

from the theoretical desired throughput. Between second 45 through 60, when sending

eight flows per TVF, the throughput is smoother and close to the desired. This is as

mentioned because of the limited control dropping of packets has on two high-speed TCP

flows compared to 16 low-speed TCP flows.

46

Figure 8.5: Throughput when multiplying the number of flows every 15 seconds.

Figure 8.6: Throughput per TVF when multiplying the number of flows every 15 seconds.

47

As mentioned earlier, the target queuing delay was set to 2 ms to be used for the PV

AQM. In Figure 8.7, the queuing delay is displayed over the 60 s test described earlier. As

seen at the start, the queuing delay grows to its maximum of just above 14 ms, which will

initiate tail-drop. After about a second or less, the PV AQM has collected enough packet

values and updated the ECDF, which now can be used to decide which packets to drop

according to the current distribution of packet values. One part that is important to note

for all evaluation scenarios is that the PV AQM has just been initialized. Ideally, when

running the PV AQM, statistics have already been collected, and the initial problem of a

burst in queuing delay should disappear. This could be concluded by looking the rest of

the test, disregarding the first second, where even doe more TCP flows are added, no large

burst in queuing delay is seen. From this point on, the queuing delay stabilizes at around

2 ms. But small bursts can be seen in the queuing delay throughout the test, which could

be either new flows being started or the distribution of received packet values is not close

enough to the current ECDF. This would lead to the CTV being set lower, and not enough

packets are being dropped.

Figure 8.7: Queue delay when multiplying the number of flows every 15 seconds.

48

In Figure 8.8, the evolution of CTV’s is plotted for different number of TCP flows.

As seen when comparing the values, the most common threshold value gets larger when

more TCP flows are added. This would be due to the marker having a higher probability

of marking packets with a higher packet value when the throughput rate for each flow is

lower. This is why with two flows, the more common value is just around 25000, while

with 16 flows, it is more common with a CTV around 30000. Note that the y-axis is

scaled to show all CTVs between 20000 and 40000. In reality, the most common CTV is 0

(e.g., forwarding all packets). But the more interesting values are above 0 when the AQM

actively is dropping packets below the threshold value.

Figure 8.8: CTVs when multiplying the number of flows every 15 seconds.

In the previous figures, the CTV has been presented, and it shows how the threshold

value that is set for the data plane increases with the number of TCP flows that are sent.

The reason for this is that the packet values received is higher when each flow has a lower

throughput as defined by the TVF and marking concept. In Figure 8.9, four ECDF’s are

49

plotted for when a different number of flows is sent through the switch. The data points

plotted for the different number of flows are not collected during a single continuous traffic

scenario. Instead four different tests were ran, where a different number of flows were

sent each test, which shows more more precisely how the packet values marked changes

depending on the number of flows which are sent. The data used to calculated the ECDFs

are the packet value statistics accumulated over the whole run. The results show that all

packet values received from the marker are between 20000 and 37500, as defined by the

gold and silver TVF presented in Figure 8.2. These ECDF’s represents the distribution of

packet values that have been collected from the histogram counter in the data plane. The

CTV update uses these stored ECDF’s to match a new calculated dropping probability to

a CTV. This allows the AQM to drop the correct packet values and the right amount of

packets by keeping a history of the current distribution of packet values.

Figure 8.9: ECDF for a different number of flows.

50

9 Conclusion

This thesis has presented the possibility of developing and implementing a PV AQM for

resource-sharing and low latency on the programmable Barefoot Tofino switch. With

knowledge about the limitations of the programmable architecture of switches, it is possible

to configure and manage the processing of packets to control the traffic flow. The PV AQM

can be moved on to a Tofino switch and run without any other software than what is already

present on the switch. To achieve the previously shown resource sharing, a packet-value

marker is needed with predefined TVF’s. With the achieved resource-sharing, different

TCP connections or flows can be prioritized higher than others, decided by the packet

values marked on packets according to a specific TVF.

One important aspect that has to be mentioned is the limitations of the current PV

AQM. In a practical scenario the AQM would have to scale to handle many ports at the

same time. The current evaluation only focus on a single port, but in reality the AQM

would have to keep packet values statistics from multiple port, while at the same time

calculating new CTV’s and ECDF’s for the ports simultaneously. This could be a prob-

lem because of the latency of control- and data plane interaction, and limited processing

capabilities in the control plane which can not handle all ports at the same time.

10 Future Work

Due to the limited amount of time left on the project, there was not enough time to

implement the control plane interaction using one of the lower-level API’s. Using one of

these API’s instead of either bfrt python or grpc python could limit the processing time

for algorithms. It could also limit the DMA (Direct Memory Access) times for reading,

modifying, and resetting registers, counters, and tables. This would be interesting to

increase the performance of the PV AQM further and make the AQM more practical for

industrial use where it can be applied to a large number of ports.

51

References

[1] M. Menth, H. Mostafaei, D. Merling, and M. Haberle, “Implementation and evaluationof activity-based congestion management using p4 (p4-abc),” Future Internet, vol. 11,p. 159, 07 2019.

[2] ELTE, “Ppv-per packet value.” http://ppv.elte.hu/, 2020.

[3] S. Laki, G. Gombos, S. Nadas, and Z. Turanyi, “Take your own share of the pie,” inProceedings of the Applied Networking Research Workshop, pp. 27–32, 2017.

[4] M. Shaker, “A dataplane programmable traffic marker using packet value concept,”2020.

[5] J. Gettys and K. Nichols, “Bufferbloat: Dark buffers in the internet,” Queue, vol. 9,no. 11, pp. 40–54, 2011.

[6] K. De Schepper, O. Bondarenko, I.-J. Tsang, and B. Briscoe, “Pi2: A linearized aqmfor both classic and scalable tcp,” in Proceedings of the 12th International on Confer-ence on emerging Networking EXperiments and Technologies, pp. 105–119, 2016.

[7] S. Floyd and V. Jacobson, “Random early detection gateways for congestion avoid-ance,” IEEE/ACM Transactions on networking, vol. 1, no. 4, pp. 397–413, 1993.

[8] K. Nichols and V. Jacobson, “Controlling queue delay,” Communications of the ACM,vol. 55, no. 7, pp. 42–50, 2012.

[9] R. Pan, P. Natarajan, C. Piglione, M. S. Prabhu, V. Subramanian, F. Baker, andB. VerSteeg, “Pie: A lightweight control scheme to address the bufferbloat problem,”in 2013 IEEE 14th International Conference on High Performance Switching andRouting (HPSR), pp. 148–155, IEEE, 2013.

[10] R. Rajkumar, C. Lee, J. Lehoczky, and D. Siewiorek, “A resource allocation modelfor QoS management,” in Proceedings Real-Time Systems Symposium, pp. 298–307,IEEE, 1997.

[11] F. Agboma and A. Liotta, “Qoe-aware QoS management,” in Proceedings of the 6thinternational conference on advances in mobile computing and multimedia, pp. 111–116, 2008.

[12] S. Nadas, Z. R. Turanyi, and S. Racz, “Per packet value: A practical concept for net-work resource sharing,” in 2016 IEEE Global Communications Conference (GLOBE-COM), pp. 1–7, IEEE, 2016.

52

[13] A. Sivaraman, S. Subramanian, M. Alizadeh, S. Chole, S.-T. Chuang, A. Agrawal,H. Balakrishnan, T. Edsall, S. Katti, and N. McKeown, “Programmable packetscheduling at line rate,” in Proceedings of the 2016 ACM SIGCOMM Conference,pp. 44–57, 2016.

[14] X. Jin, X. Li, H. Zhang, N. Foster, J. Lee, R. Soule, C. Kim, and I. Stoica, “Netchain:Scale-free sub-rtt coordination,” in 15th {USENIX} Symposium on Networked SystemsDesign and Implementation ({NSDI} 18), pp. 35–49, 2018.

[15] X. Jin, X. Li, H. Zhang, R. Soule, J. Lee, N. Foster, C. Kim, and I. Stoica, “Netcache:Balancing key-value stores with fast in-network caching,” in Proceedings of the 26thSymposium on Operating Systems Principles, pp. 121–136, 2017.

[16] R. Joshi, T. Qu, M. C. Chan, B. Leong, and B. T. Loo, “Burstradar: Practicalreal-time microburst monitoring for datacenter networks,” in Proceedings of the 9thAsia-Pacific Workshop on Systems, pp. 1–8, 2018.

[17] N. Feamster and J. Rexford, “Why (and how) networks should run themselves,” arXivpreprint arXiv:1710.11583, 2017.

[18] B. Network, “Barefoot tofino.” https://www.barefootnetworks.com/technology/

#tofino, 2020.

[19] S. Nadas, G. Gombos, F. Fejes, and S. Laki, “A Congestion Control Independent L4SScheduler,” in Proceedings of the Applied Networking Research Workshop, pp. 45–51,2020.

[20] P. Voigt and A. Von dem Bussche, “The EU general data protection regulation(GDPR),” A Practical Guide, 1st Ed., Cham: Springer International Publishing, 2017.

[21] J. Kramer, L. Wiewiorra, and C. Weinhardt, “Net neutrality: A progress report,”Telecommunications Policy, vol. 37, no. 9, pp. 794–813, 2013.

[22] K. Benzekki, A. El Fergougui, and A. Elbelrhiti Elalaoui, “Software-defined network-ing (SDN): a survey,” Security and communication networks, vol. 9, no. 18, pp. 5803–5833, 2016.

[23] P. L. Consortium et al., “P416 language specification,” Version, vol. 1, no. 0, p. 16,2017.

[24] P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford, C. Schlesinger,D. Talayco, A. Vahdat, G. Varghese, et al., “P4: Programming protocol-independentpacket processors,” ACM SIGCOMM Computer Communication Review, vol. 44,no. 3, pp. 87–95, 2014.

53

[25] P. Bosshart, G. Gibb, H.-S. Kim, G. Varghese, N. McKeown, M. Izzard, F. Mujica,and M. Horowitz, “Forwarding metamorphosis: Fast programmable match-action pro-cessing in hardware for SDN,” ACM SIGCOMM Computer Communication Review,vol. 43, no. 4, pp. 99–110, 2013.

[26] Netronome, “Agilio cx smartnics.” https://www.netronome.com/products/

agilio-cx/, 2020.

[27] J. Postel et al., “Transmission control protocol,” 1981.

[28] M. Allman, V. Paxson, W. Stevens, et al., “TCP congestion control,” 1999.

[29] Mininet, “Introduction to mininet.” https://github.com/mininet/mininet/wiki/

Introduction-to-mininet, 2020.

[30] J. Postel et al., “Internet protocol,” 1981.

[31] B. Briscoe, K. Schepper, and M. Bagnulo, “Low latency, low loss, scalable throughput(L4S) internet service: Architecture,” Internet Engineering Task Force, Internet-Draftdraft-briscoe-tsvwgl4s-arch-02, 2017.

[32] J. Dugan, S. Elliott, B. A. Mah, J. Poskanzer, and K. Prabhu, “iperf3, tool for ac-tive measurements of the maximum achievable bandwidth on ip networks,” URL:https://github. com/esnet/iperf, 2014.

[33] T. Høiland-Jørgensen, C. A. Grazia, P. Hurtig, and A. Brunstrom, “Flent: The flexiblenetwork tester,” in Proceedings of the 11th EAI International Conference on Perfor-mance Evaluation Methodologies and Tools, pp. 120–125, 2017.

[34] “Welcome to python.org.” https://www.python.org/doc/.

54

Acronyms

API Application Programmable Interface

AQM Active Queue Management

CTV Congestion Threshold Value

ECN Explicit Congestion Notification

ECDF Empirical Cumulative Distribution Function

FIFO First In First Out

HTTP Hypertext Transfer Protocol

ID Identification

IP Internet Protocol

IPv4 Internet Protocol version 4

L4S Low Latency Low Loss Scalable Throughput

NIC Network Interface Card

P4 Programming Protocol-Independent Packet Processors

PPV Per-Packet Value

PV Packet Value

PIE Proportional Integral Controller Enhanced

QoS Quality of Service

SDN Software-Defined Networking

55

TCP Transmission Control Protocol

TVF Throughput Value Function

UDP User Datagram Protocol

QoS Quality of Service

56

A Throughput and delay with up to 40 flows per TVF

Figure A.1: Multiplying the number of flows with 2 every 15 seconds starting with 5 flowsper TVF.

57

B Silver flow with 8 times less throughput

Figure B.1: 2 gold, 2 silver flows using a silver TVF with 8 times less throughput than thegold TVF.

58

C Reading Register Instances

Figure C.1: Reading register instances from switch hardware. Reading times grow linearlywith the number of instances.

59

D Pseudo code

D.1 Linear search

1 Input : ECDF array a , search value item

2 Begin

3 f o r i = 0 to (n − 1) by 1 do

4 i f ( a [ i ] >= item ) then

5 s e t r e t = i

6 Exit

7 end i f

8 endfor

9 s e t r e t = −1

10 End

D.2 Binary search

1 Input : ECDF array a , search value item

2 Begin

3 s e t beg = 0

4 s e t end = n−1

5 s e t mid = ( beg + end ) / 2

6 whi le ( ( beg <= end ) and ( a [ mid ] != item ) ) do

7 i f ( item < a [ mid ] ) then

8 s e t end = mid − 1

9 e l s e

10 s e t beg = mid + 1

11 end i f

12 s e t mid = ( beg + end ) / 2

13 endwhi le

14 i f ( beg > end ) then

15 s e t r e t = −1

16 e l s e

17 s e t r e t = mid

60

18 end i f

19 End

D.3 ECDF Calculation

1 Input : Packet value counter array a

2 Begin

3 s e t s = sum( a )

4 s e t r e t = [ 0 ] ∗ n

5 s e t idx = 0

6 s e t pkt s per pv = 0

7 s e t part sum = 0

8 s e t l en = range // Length o f histogram ranges

9 i f ( s == 0) :

10 Exit

11 end i f

12 f o r i = 0 to (n − 1) by 1 do

13 pkt s per pv = a [ i ] / l en

14 f o r j = 0 to ( len −1) by 1 do

15 part sum += pkts per pv

16 p v l i s t [ idx ] = ( part sum/ s )

17 endfor

18 endfor

19 r e t = ecd f

20 End

61