Making a Packet-value Based AQM on a Programmable Switchfor Resource-sharing and Low Latency Ludwig Toresson
Faculty of Health, Science and Technology
Subject: Computer Science
Points: 30 hp
Supervisor: Andreas Kassler
Examiner: Karl-Johan Grinnemo
Date: 210125
Making a Packet-value Based AQM on a
Programmable Switch for Resource-sharing
and Low Latency
Ludwig Toresson
<ludwig [email protected]>
c© 2021 The author(s) and Karlstad University
Abstract
There is a rapidly growing number of advanced applications running over the internet that
requires ultra-low latency and high throughput. Bufferbloat is one of the most known
problems which add delay in the form of packets being enqueued into large buffers before
being transmitted. This has been solved with the developments of various Active Queue
Management (AQM) schemes to control how large the queue buffers are allowed to grow.
Another aspect that is important today is how the available bandwidth can be shared be-
tween applications with different priorities. The Per-Packet Value (PPV) concept has been
presented as a solution for resource-sharing by marking packets according to predefined
marking policies. The packet value will be taken into consideration to make drop/mark
decisions, which leads to higher packet values being prioritized at bottleneck links.
In this thesis, a design of a packet value-based AQM on a programmable Barefoot
Tofino switch will be presented. It will use a combination of the Proportional Integral
Controller (PIE) AQM scheme and the PPV concept to make drop decisions when queuing
delay is discovered. Packet value statistics are collected through the P4 programmable data
plane to maintain knowledge about the distribution of packet values. With the dropping
probability calculated through the PIE AQM scheme, a decision can be made about which
packets should be dropped.
An evaluation shows that with the implemented PV AQM, a low queuing delay can
be achieved by dropping an appropriate amount of packets. It also shows that the PV
AQM controls the resource-sharing between different traffic flows according to a predefined
marking policy.
Keywords— PPV, PIE, SDN, AQM, Resource-sharing
i
Sammanfattning
Det finns ett snabbt vaxande antal avancerade applikationer som kors over internet som
kraver extremt lag latens och hog throughput. Bufferbloat ar ett av de mest kanda prob-
lemen som ger fordrojning i form av paket som satts in i stora buffertar innan de skickas
vidare. Detta har losts med utvecklingen av olika AQM (Active Queue Management)-
scheman for att kontrollera hur stora kobuffertarna far vaxa. En annan aspekt som ar viktig
idag ar hur den tillgangliga bandbredden kan delas mellan applikationer med olika prior-
iteringar. Per-Packet Value (PPV)-konceptet har presenterats som en losning for resurs-
delning genom att markera paket enligt fordefinierade markningsprinciper. Paketvardet
kommer att tas i beaktande for att fatta drop/mark beslut, vilket leder till att hogre
paketvarden prioriteras vid flaskhalslankar.
I denna avhandling presenteras en design av en paketvarde baserad AQM pa en pro-
grammerbar Barefoot Tofino-switch. Den kommer att anvanda en kombination av Propor-
tional Integral Controller (PIE) AQM-schemat och PPV-konceptet for att fatta slappbeslut
nar kofordrojning upptacks. Paketvardesstatistik samlas grundligt i det P4 programmer-
bara dataplanet for att uppratthalla kunskap om fordelningen av paketvarden. Med den
sannolikhet som beraknas genom PIE AQM-schemat kan ett beslut fattas om vilka paket
som ska slappas.
En utvardering visar att med denna implementerade AQM, kan en lag kofordrojning
uppnas genom att man slapper en lamplig mangd paket. Det visar ocksa att AQM:en styr
resursdelningen mellan olika trafikfloden enligt en fordefinierad markeringspolicy.
ii
Acknowledgement
I would like to thank Prof. Andreas Kassler, my supervisor at Karlstad University, for his
support. For the possibility to work on this project and for his great feedback and ideas.
I would also like to thank Jonathan Langlet for providing initial guidance to get started
with the hardware setup; my colleague Maher Shaker for making this project more enjoy-
able and for his support; Szilveszter Nadas at Ericsson Research for his feedback and
knowledge; and, finally, my family for their never-ending support.
iii
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Objectives and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Ethics and Sustainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 4
2.1 Software-Defined Networking . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 P4: Programming the data plane . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Architecture Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 Control Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.4 Deparsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Related Work 9
3.1 PIE AQM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 PPV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 PVPIE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 Design of the PV AQM 13
4.1 Design Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Design Challenges and Decisions . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 PV AQM Design Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3.1 Uniform Packet Value Histograms . . . . . . . . . . . . . . . . . . . 19
4.3.2 Packet Value Distribution and ECDF . . . . . . . . . . . . . . . . . 21
5 Implementation 22
iv
5.1 Data Plane Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.1.1 P4 Ingress Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.1.2 P4 Ingress Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.1.3 P4 Ingress Deparser . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.1.4 P4 Egress Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 Control Plane Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2.1 CTV Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2.2 Inverse ECDF Update . . . . . . . . . . . . . . . . . . . . . . . . . 27
6 Tools 28
6.1 Control Plane Interaction API . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.2 Pipeline Traffic Manager API . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.3 Iperf3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.4 Flent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.5 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
7 Control Plane Measurements 30
7.1 Histogram Registers vs. Counters . . . . . . . . . . . . . . . . . . . . . . . 31
7.2 CTV Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.3 Packet Value Distribution Update . . . . . . . . . . . . . . . . . . . . . . . 38
8 Evaluation 39
8.1 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
8.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8.2.1 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8.2.2 Queuing delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.2.3 Resource-sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.3 Evaluation of Traffic Without AQM . . . . . . . . . . . . . . . . . . . . . . 43
8.4 Evaluation of the PV AQM With Uniform Ranges . . . . . . . . . . . . . . 45
v
9 Conclusion 51
10 Future Work 51
References 52
A Throughput and delay with up to 40 flows per TVF 57
B Silver flow with 8 times less throughput 58
C Reading Register Instances 59
D Pseudo code 60
D.1 Linear search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
D.2 Binary search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
D.3 ECDF Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
vi
List of Figures
2.1 P4 packet processing pipeline. Courtesy of Menth et al. [1]. . . . . . . . . 6
2.2 Parser example state diagram . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 Throughput value functions. Courtesy of ELTE [2]. . . . . . . . . . . . . . 11
3.2 PVPIE scheme. Courtesy of Laki et al. [3]. . . . . . . . . . . . . . . . . . . 13
4.1 Abstract overview of the control plane loops. . . . . . . . . . . . . . . . . . 18
4.2 Data - and control plane interaction. Interval TA: Update ECDF and Inter-
val TB: update CTV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Uniform packet value ranges . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4 ECDF diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.1 P4 Ingress processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7.1 Reading 128 register values from hardware versus from software. . . . . . . 33
7.2 Reading 512 register values from hardware versus from software. . . . . . . 34
7.3 Difference between reading 512 values from a register versus counter. . . . 34
7.4 CTV with linear search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.5 CTV with linear search(Scaled). . . . . . . . . . . . . . . . . . . . . . . . . 36
7.6 CTV calculations with binary search (Scaled). . . . . . . . . . . . . . . . . 37
7.7 Total time to update CTV. . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.8 Time to get 256 histograms counters. . . . . . . . . . . . . . . . . . . . . . 38
7.9 Time to update the ECDF curve. . . . . . . . . . . . . . . . . . . . . . . . 39
8.1 Evaluation setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
8.2 Marker TVF functions. Courtesy of Maher Shaker [4]. . . . . . . . . . . . . 42
8.3 Multiplying the number of flows with two every 15 seconds without AQM. 44
8.4 Queue delay without AQM. . . . . . . . . . . . . . . . . . . . . . . . . . . 45
8.5 Throughput when multiplying the number of flows every 15 seconds. . . . . 47
8.6 Throughput per TVF when multiplying the number of flows every 15 seconds. 47
8.7 Queue delay when multiplying the number of flows every 15 seconds. . . . 48
vii
8.8 CTVs when multiplying the number of flows every 15 seconds. . . . . . . . 49
8.9 ECDF for a different number of flows. . . . . . . . . . . . . . . . . . . . . . 50
A.1 Multiplying the number of flows with 2 every 15 seconds starting with 5
flows per TVF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
B.1 2 gold, 2 silver flows using a silver TVF with 8 times less throughput than
the gold TVF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
C.1 Reading register instances from switch hardware. Reading times grow lin-
early with the number of instances. . . . . . . . . . . . . . . . . . . . . . . 59
viii
List of Tables
8.1 Client/server setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
8.2 Evaluation setup parameters. . . . . . . . . . . . . . . . . . . . . . . . . . 41
8.3 PV AQM parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
ix
1 Introduction
The Internet is very complex today, and it has many different applications running through
it, and with this, different constraints are needed for different applications. One major
problem discussed for years has been the bufferbloat problem [5], where single buffer queues
on the internet are allowed to grow, which creates unnecessary delays. Applications like
remote brain surgery and financial trading require ultra-low delays. Networking devices
such as switches handling regular TCP without any effective AQM at bottlenecks can cause
users considerable delays. Various AQM schemes such as PI2 [6], RED [7], CoDel [8] and
PIE [9] have been developed as a solution to the bufferbloat problem, to remove much of
the queue delay added by large queue buffers.
Quality of Service (QoS) management [10, 11] and resource-sharing is another important
research area. Different types of network traffic have different throughput and latency
requirements. Per-Packet Value (PPV) [12] is a concept created for QoS management and
resource-sharing to define how the available bandwidth should be shared among traffic
flows. With PPV, packets will be marked at an edge node with a value according to
operator-defined marking policies. These policies can be defined to mark different traffic
types with different priorities. This concept has been combined with the PIE AQM scheme
to create a packet-value based PIE AQM (PVPIE) to provide both queue delay control and
resource-sharing. At the AQM node, the packet values will be taken into account when
making the drop decisions.
Implementing the PVPIE concept on a fixed-function commodity switch would be dif-
ficult due to the non-configurable packet processing pipeline. This thesis project relates to
a relatively new emerging concept of Software-Defined Networking (SDN). Through data
plane programmable switches, a programmer can modify how packets will be managed
through the packet processing pipeline without any modification of the hardware [13].
These concepts are possible due to the programmable nature of switches that are being
developed and released at the moment. The programmability is possible due to the re-
1
configurable match-action pipeline, where the packet processing can be defined to perform
different tasks depending on the application. Previously, the core scheduling functionality
has not been modifiable; at most, there may have been a selection of a few different schedul-
ing algorithms that could be run on a switch. However, with these programmable switches,
new algorithms can be developed from scratch and released publicly, and implemented by
whoever would like to apply them. There are multiple new papers [14, 15, 16, 17] presenting
the possibilities given by the programmable Barefoot Tofino [18] switch. All of these papers
relate to different concepts of networking, which shows the flexibility of implementation
opportunities for programmable switching functionalities.
1.1 Motivation
The motivation for this thesis relates to previous implementations of a PVPIE AQM. The
concept of a PVPIE AQM has previously only been implemented and tested in either
simulated [3] or emulated [19] environments. In this thesis a proposed design and imple-
mentation of a PVPIE AQM on a programmable Barefoot Tofino switch will be tested and
evaluated. The finished implementation will show the possibilities in the near future of flex-
ible and programmable networking devices, where the decisions of network functionalities
are transferred from the device manufacturers to the network operators.
1.2 Objectives and Goals
The objective of this thesis is to implement a PV AQM using the concept of PVPIE on
a programmable Barefoot Tofino switch. This would entail studying the architecture of
the target switch, i.e,. the Barefoot Tofino switch, to design a suitable solution that will
provide the resource-sharing and lower queuing delay. Programmable networking devices
have limitations in the architecture such as the number of processing cycles available for
each packet processed by the device. This is due to when defining the data plane (packet
processing pipeline), the possibilities are limited to be able to handle, for example, 100
2
Gb/s speeds in the case of the Barefoot Tofino switch. With this said, complex operations
such as statistical packet analysis can be offloaded to the control plane (local CPU on the
Barefoot Tofino Switch) instead.
In this thesis the following goals will be accomplished:
• Introduce a design for the PV AQM that can be implemented on the Barefoot Tofino
switch, e.g., what functionalities (algorithms, memory accesses, etc) can be performed
in the data plane, and what needs to be performed in the control plane.
• Present how the PV AQM can be implemented on the Barefoot Tofino switch, e.g.,
the re-configurable data plane, and how functionalities are offloaded to the control
plane.
• Evaluating the precision of the implemented PV AQM on the Barefoot Tofino switch
on the aspects of resource-sharing and queuing delay.
1.3 Ethics and Sustainability
From an ethical perspective it is important to notify users about data that is collected in
the data plane from packet headers. The network operators managing and re-configuring
programmable devices such as the Barefoot Tofino switch need to comply with, for example,
data protection laws such as GDPR [20].
Another important ethical dilemma is the concern about net neutrality [21] over the
internet. In this thesis one of the goals is to achieve resource-sharing by applying network
operator defined marking policies to prioritize certain traffic flows. It is important to
understand that the resource-sharing policies applied can be seen to go against the concept
of net neutrality.
From a sustainability standpoint, the PV AQM implementation will provide one more
solution to the bufferbloat problem. It will also provide scheduling of traffic to apply
3
resource-sharing through the per-packet value concept without losing any of the available
bandwidth.
The programmability of networking devices such as the Barefoot Tofino switch en-
ables re-configuration, which makes it sustainable due to the longevity of such devices.
If modifications for a networking device is needed to achieve a specific purpose, then no
new hardware needs to be bought. The network operator can re-configure the networking
device by applying software changes instead of buying a new application specific device.
1.4 Thesis outline
In Section 2, shorter descriptions of the concepts needed to grasp the extent of the thesis
will be presented. In Section 3, the concepts of PPV and PIE will be presented. In Section
4, the design decisions will be introduced, with the related challenges. In Section 5, the
final implementation will be presented in detail. In Section 6, all of the essential tools
used during the thesis will be presented. In Section 7, time measurements are presented
for API executions (Read/Write) and algorithms. In Section 8, the evaluation of the PV
AQM implementation will be presented. The results of resource-sharing and queuing delay
will be shown for when the PV AQM actively controls the traffic flow. In Section 9, a
conclusion of the results will be discussed and the proposed work that could be done to
further develop and optimize the PV AQM.
2 Background
In this section the necessary background information needed to understand the scope of the
thesis is presented. The concept of SDN, where programmability can be applied to network-
ing devices to allow flexible networking functionalities. The P4 programming languages
will also be presented which allows for an abstract way to define networking functionalities.
4
2.1 Software-Defined Networking
In traditional networking, devices are so called fixed-function, hardware based, and appli-
cation specific. This entails that when devices are designed and manufactured they are so
for a specific purpose. In contrast, SDN [22] is a new network architecture, which moves
the control plane logic out of the forwarding devices to a centralized location (the con-
troller). By logically centralizing network state management, many new applications and
use-cases are possible. The data plane will focus on the forwarding of packets. In contrast,
the control plane configures the networking topology, which tells the data plane how to
handle traffic, e.g., setting flow tables, and data handling policies. In classic networks, the
routing decisions are decentralized from smaller networking infrastructures. In contrast to
traditional networking, with SDN networks, these decisions can be centralized.
2.2 P4: Programming the data plane
P4 [23, 24] is a programming language created for the purpose of defining the data plane of
programmable networking devices. This language can be used to define packet processing
in, for example, switches, routers, and Network Interface Cards (NICs). The data plane of
a programmable device will be defined during initialization by the P4 language, in contrast
to traditional devices where the data plane will be fixed-function and not re-configurable.
Recently, researchers have designed ASICs with re-configurable hardware, based on the
concept of Reconfigurabe Match Tables (RMTs) [25]. With these configurable ASICs
and the P4 language and compiler, the hardware logic can be modified to perform new
functionalities. The P4 language compiler generates the P4 runtime API for the control
plane to access the tables and other objects that are defined in the P4 code.
5
2.2.1 Architecture Model
An abstract view of the P4 packet processing pipeline is presented in Figure 2.1. The first
important object in the pipeline is the parser, where packet data can be extracted from the
incoming packet. After the parser there is a match-action block called the ingress pipeline,
where packets can be modified and also where the forwarding rules are applied to decide
to which output buffer the packet should be sent to. When the packet has been dequeued
from the buffer, the packet will be processed by a second match-action block called the
egress pipeline, where further modifications to the packet can be applied if needed. The
last object of the P4 pipeline is the deparser, where packet data can be inserted to the
packet before being sent.
Figure 2.1: P4 packet processing pipeline. Courtesy of Menth et al. [1].
The P4 architecture consists of multiple objects which is used to define the P4 packet
processing pipeline. One of the most important ones are the header structures, which holds
information about each packet’s header fields and sizes. The architecture also consists of
multiple extern objects, which are constructs of the architecture which can be changed
6
through an API but is not programmable. Examples of such objects are counter (counting
packets or packet sizes), digests (structure to send data from data plane to control plane),
etc. These objects are target dependent, meaning the same extern objects that exists on
the Barefoot Tofino [18] switch, may not exist on, for example, the Netronome SmartNIC
[26], which is another P4 programmable networking device. Another useful part provided
by the P4 architecture is the user-defined metadata, which is data-structures defined and
created through the P4 language for each packet. The architecture also consists of intrinsic
metadata, in which information about packets can be found. This can, for example be, the
time a packet has spent in a queue/buffer, or the time at which the packet was enqueued
or dequeued. Tables are another object which is user-defined, and is used to match a key
value to an output value. Tables can, for example, be used to apply forwarding rules, such
as which output port a packet with a specific destination IP address should be sent to.
Finally with the P4 architecture, a control flow can be defined to control the flow for the
packet processing pipeline, i.e., parsing, ingress processing, egress processing, checksum
calculation, deparsing, etc.
2.2.2 Parsing
P4 uses a construct which is called a parser, which functions as a state machine to collect
data fields from incoming network packets. A parser has a start state called ”start” in
which the parsing begins. It has two finishing states called ”accept” and ”reject”, where
the packet either is accepted or rejected, respectively.
See Figure 2.2 for an example parser. As seen in the figure, the starting state of the
P4 parser checks the hdr.ethernet.etherType field to see if the next header type is 0x800,
which corresponds to the IPv4 header. Then the parser checks if the IPv4 header length
is 5. If false, it will continue parsing additional IPv4 header option fields until the parser
finally ends up in the accept state. If the parser has finished in the accept state, header
fields and metadata will be accessible from the packet processing pipeline.
7
Figure 2.2: Parser example state diagram
2.2.3 Control Blocks
Within a control block, fields such as the header- or metadata fields can be used and
manipulated. Match actions can be called to match values in tables to a specific output
value. It is common to instantiate an ingress- and egress control block for switching
functionality. Then, for example, in the ingress control block, a forwarding match action
is applied to send packets to the correct output port. Control blocks are instantiated with
a name, input/output parameters, constants, variables, match-tables, and actions.
2.2.4 Deparsing
The deparsing part of the P4 programming language consists of constructing the packet
that is supposed to be sent out from the programmable networking device. Depending on
8
what is extracted from the packet during parsing, these data fields and headers can either
be emitted to the packet or deactivated. During the deparsing, either some or all headers
can be emitted to the packet again depending on the purpose of the data plane processing
functionality. As an example, an additional header on top of the ethernet header can be
extracted during the parsing process and then not emitted during deparsing.
3 Related Work
In this section, the related works for the thesis will be presented. The PPV and the
PIE AQM concepts will be introduced in separate sections to explain the initial reasoning
behind why they exist. Finally, the PVPIE concept will be introduced where the light-
weight PIE AQM is combined with the PPV resource-sharing concept to maintain a low
queuing delay and apply resource-sharing policies to prioritize traffic flows.
3.1 PIE AQM
The initial concept of the Proportional Integral Controller Enhanced (PIE) AQM [9], which
takes into account a calculated dropping probability p, where packets are dropped randomly
according to this dropping probability during enqueuing. The dropping is done to activate
the TCP [27] congestion control [28]. When a drop is recognized by the TCP sender
due to lost acknowledgement messages, the sender will slow down the sending rate (TCP
congestion window). The main aim of the PIE algorithm is to maintain a certain target
queuing delay by observing if the queue is growing or shrinking. If it is growing, intuitively
the dropping should be increased in order to maintain a desired target queueing delay.
The dropping probability p is calculated by firstly (1) calculating the current queuing
delay using Little’s law: cur del = q lenavg d r
, where q len is the length of packets in the
queue and avg d r is the average drainage rate for packets being dequeued. Secondly (2)
calculating the dropping probability through the following formula: p = p + α(cur del −
9
tar del) + β(cur del − old del), where α is a parameter to determine the effect on the
dropping probability that the deviation of the current queuing delay (cur del) from the
target queuing delay (ref del) has. β similarly, is a parameter to determine the effect on
the dropping probability the deviation of the current queuing delay from the old queuing
delay has. Thirdly (3) the old queuing delay (old del) should be updated to the newly
calculated queuing delay (cur del).
3.2 PPV
PPV [12] is the concept of applying resources-sharing policies by marking packets with
packet values. The value expresses the relative importance of one flow over another, for
example, different flows can have different throughput or delay requirements. Packet values
can be considered at a resource node in the network where flows share a bottleneck to
provide resource-sharing by scheduling or dropping packets based on the packet value.
The packet values can be as simple as representing a user’s subscription level, i.e., the
user maybe has a gold membership, which gives them higher throughput compared to a
user with, for example, a silver membership. Packets with higher packet values will be let
through and transmitted at a resource node, while packets with lower packet values either
get dropped or delayed by the resource node if the resource node is fully utilized.
The PPV concept uses Throughput Value Functions (TVFs) to apply different resource-
sharing policies to different flows. Examples of such TVFs can be seen in figure 3.1, where
there are four different TVFs. These TVFs are used to match a throughput value to a
packet value that will be marked into the packet. The throughput that is calculated is
independently calculated for each flow to create fairness between flows. For example, if a
single flow marked with the gold TVF seen in the figure has higher throughput than other
flows marked with the gold TVF. By default, higher throughput matches to lower packet
values, which leads to fairness between flows when the packet values are taken into account
to make dropping decisions at a resource node. Another aspect that is important in the
10
PPV concept is that the throughput used for matching a packet value is not discrete, but
instead random. A random value between 0 and the calculated throughput is generated,
and this values is used as the throughput value to match a packet value. Because of this,
fairness is created between different TVFs. Packets marked with the gold TVF will not
always get a higher packet value, but the packets will always have a greater chance of
getting marked with a higher packet value than the packets marked with, for example the
silver TVF. With the randomly generated value, flows marked with the silver TVF cannot
be starved at the resource node by the gold marked flows.
Figure 3.1: Throughput value functions. Courtesy of ELTE [2].
3.3 PVPIE
In this section, the concept of combining the PPV resource-sharing with the PIE AQM is
explained under the combined acronym PVPIE [3]. PVPIE uses the initial specification of
PIE, where the packets are dropped or ECN-marked at random depending on a calculated
dropping probability p. Together with this, the packet value of different packets will get
different prioritization during the enqueuing phase. If a packet has a high packet value,
11
it will be less likely to be dropped than a packet with a lower packet value. The PVPIE
concept can be applied to achieve resource-sharing between different flows marked with
different TVFs (shown in Figure 3.1), while at the same time maintaining a low queuing
delay.
A Congestion Threshold Value (CTV) will be calculated at a time t, which is dependent
on the observed incoming packet value distribution during a recent time interval. The CTV
is calculated with the following formula:
• CTV (t) = ECDF−1[t−γT,t)(p(t))
The formula shows an Empirical Cumulative Distribution Function (ECDF) that will
depend on the calculated dropping probability, and will be updated regularly with the time
interval γ. If the packet value of a received packet is less than the calculated CTV, the
packet is dropped; else, it will be let through. Also, to note is that if the number of packets
received during a time interval is less than 1/p, the CTV is set to 0, and thus no packets
will be dropped. In Figure 3.2, the PVPIE scheme is shown, where at a time interval γ the
dropping probability p is calculated by the PIE controller. At the same time interval γ,
a new ECDF is calculated to provide a function which describes the current distribution
of packet values. Finally, the calculated dropping probability p is matched to a packet
value V in the ECDF, which will be used as the current CTV at which packets with packet
values lower than the CTV should be dropped.
12
Figure 3.2: PVPIE scheme. Courtesy of Laki et al. [3].
4 Design of the PV AQM
In this section, the design of the PV AQM will be presented. The challenges of designing
a suitable PV AQM and the decisions taken to solve them are described. Ideally the PV
AQM would be designed to only be implemented with P4 code in the data plane. The
planned design need to be suitable for the Barefoot Tofino switch, where the defined P4
code at compile-time need to follow the limitations of the target real-time system. The
Barefoot Tofino switch only allows P4 programs to be compiled if it can provide packet
processing at high speeds (i.e., 100 Gbps).
4.1 Design Overview
The initial goals of the AQM were to follow and implement the PVPIE concept on a
Barefoot Tofino switch in the data plane, as described in Section 3.3. This would entail
the following functional blocks:
• For every packet processed in the data plane:
13
– Count the packet size in bytes in a memory slot for the marked packet value to
continuously maintaining statistics of the distribution of packet values received.
This means having a separate memory slot for each packet value where all
packets with a specific packet value will get counted. This is done to have
a history of how many bytes of each packet value has been processed in the
data plane. These memory slots will, for the rest of the thesis, be described
as histograms. As defined by the packet value marker [4], the packet value is
marked into a 16-bit header field and can support up to 65536 unique packet
values. This will make it necessary to design the AQM to be able to count
65536 unique values. Each histogram has to be of a large enough size to be able
to count the packet sizes of all packets received during the time interval T. At
which point the histograms are used to calculate a new ECDF and CTV, and
finally the histograms are reset to 0. On the Barefoot Tofino switch architecture
the possible histogram sizes can be set to 8-bit, 16-bit or 32-bit. To not risk
an overflow when counting the packet sizes, each histogram can be defined at
a size of 32-bit, which would make it possible to count at-least 4 GB for each
packet value, compared to 64 KB with 16-bit histograms.
• Every T ms interval:
– Collect all histograms stored in memory to calculate an ECDF to describe the
distribution of the received packet values during the last T ms. Each packet
value has to be correlated with a specific probability that will describe how large
percentage of the total amount of bytes received that is either marked with a
specific packet value or lower (e.g., the cumulative probability from 0 to that
packet value).
– Calculate a dropping probability with the PIE formula: p = p + α(cur del −
ref del) + β(cur del − old del).
14
– To calculate the PIE formula, values from the previously calculated dropping
probability have to be stored in memory for use during the next time the cal-
culation will be computed (e.g., the previously calculated dropping probability
and the previous queuing delay). The other variables, such as α, β, and target
queue delay do not need to be updated or changed, which is why they can be
defined as constants.
– When the new dropping probability has been calculated, it will be used to find a
new threshold value at which the AQM should drop packet values lower than the
threshold. In the rest of the thesis, this threshold value is called the Congestion
Threshold Value (CTV). To calculate and update a new CTV, the dropping
probability should match the point in the ECDF, which allows the AQM to
drop approximately p percent of the incoming packets.
4.2 Design Challenges and Decisions
As seen in the design overview, different functionalities and packet processing operations
need to interact in order to implement the PVPIE concept. These operations can all be
implemented in the data plane itself. Alternatively, several operations can be implemented
outside the data plane and be implemented in the control plane. On programmable devices
such as Tofino, the data plane has limited functionality and several limitations that limit
the complexity of operations due to the real-time requirements of the target platform. The
data plane is designed to handle a defined packet processing pipeline at 100 Gbps, which
limits the number of processing cycles that can be spent for a single packet. Consequently,
complex operations such as the ECDF calculations could be outsourced to the control
plane. On the other hand, it is most natural to keep functionality in the data plane where
it is most effective. For example, maintaining packet and traffic statistics requires counter
operations to be executed on every single packet. Therefore, such operations should ideally
be implemented in the data plane itself.
15
For example, when supporting many different packet values (e.g., 65536), a naive design
would maintain histogram counters per individual packet value. Consequently, updating
packet statistics in the data plane would require 65536 stateful memory cells such as
registers or counters per queue to be maintained and updated. Synchronizing such large
number of stateful memory from the data plane to the control plane may lead to large
latency if outsourcing the processing of those traffic statistics to the control plane. This is
due to two reasons. First, the control plane processing is significantly slower than the data
plane. Second, transferring the content of the stateful memory to the control plane may
take several milliseconds (ms), too long for the control plane loops required in the design.
Therefore packet values will not be counted in individual memory cells, and instead packet
values can be coupled together into fewer memory cells (see Section 4.3.1 for the concept
of uniform packet value histograms).
The first major challenge that appeared was that of which parts of the PVPIE concept
would be able to be implemented in the data plane on Tofino. If looking at the PVPIE paper
[3] presented in Section 3.3, a number of operations, like reading/writing of memory (CTV,
histograms, etc.) and mathematical calculations (PIE algorithm) have to be completed in
the data plane. The least complex part of PVPIE, which has to be done for each packet
processed, is to count the size of a packet marked with a packet value into the correct
histogram to have a history of the current distribution of received packet values. This is
not a tough operation to do. The challenge comes when it is time to update the currently
used threshold value. The goal is to calculate a new ECDF and a new CTV every T ms in
the data plane. This had been seen to be a possibility in mininet [29], where a simulated
switch can be defined with P4, but where there are no limitation to what can be fit inside
of the packet processing pipeline. In contrast, on a programmable switch, for this project,
the realization of the challenges was introduced early.
What was initially planned would not be possible with the scope of the project, to
implement a strict data plane controlled PV AQM. Instead, functionalities such as ECDF
16
and CTV calculations, and memory writes and reads would have to be moved from the data
plane to the control plane where the limitation was not as strict. If done in the data plane it
would have to be implemented with P4, which has limited expressibility and functionality.
In contrast, if done in the control plane any functionality can be implemented with C or
Python. This will allow for flexibility when it comes to complex mathematical operations
which would be needed to, for example, calculate a distribution function for the received
packet values. .
In Figure 4.1, the proposed design of the control plane operations is presented. In
the initial PVPIE paper [3], the CTV is calculated only when a new ECDF has been
calculated. Instead, for the PV AQM design, two separate control plane loops will be
executed at different time intervals. The ECDF will be calculated at a larger time interval
TA, because it will entail reading values from the data plane through API calls and then
calculate a cumulative probability for each packet value to produce the ECDF. In contrast,
at a shorter time interval of TB, a new CTV will be calculated by matching the calculated
dropping probability to a packet value in the ECDF. This is done to allow the PV AQM to
quicker react to changes in the queuing delay without having to recalculate a new ECDF
which would take much longer.
In Figure 4.2, a view of the design for the planned PV AQM is presented. It is separated
into two parts, one for the operations completed in the control plane, and one for the
operations completed in the data plane. As seen, when a packet is received by the switch,
a condition will check if it is time to make an update of the CTV or the ECDF, at which a
digest with the port and queue ID will be sent to the control plane. A digest is an extern
object in P4 that is used as a mechanism to send a message from the data plane to the
control plane. When the digest is received in the control plane, a condition will be checked
that either initiate an update of the CTV (e.g., interval TB) or to update the ECDF (e.g.,
interval TA) together with an update of the CTV. This condition depends on how many
digests that has been received in the control plane for a specific port and queue. In the
17
Figure 4.1: Abstract overview of the control plane loops.
figure, the histograms counter can be seen, which will be read through an API function to
get the current distribution of packet values needed for an update of the ECDF. For the
CTV update, it also displays the queuing delay register, which will be read to access the
current queuing delay from the data plane. After the update, the calculated CTV will be
written with an API function to be used as the current threshold value in the data plane.
18
Figure 4.2: Data - and control plane interaction. Interval TA: Update ECDF and IntervalTB: update CTV.
4.3 PV AQM Design Concepts
In this section, concepts used for the design of the PV AQM are presented. The first
concept is called uniform packet value histograms, which is used to keep packet value
statistics in the data plane. The main purpose for it is to limit the time it would take to
read 65536 unique packet value statistics from the data plane to the control plane. This
section will also present in more detail what the ECDF is and how it is used to correlate
a dropping probability to a CTV.
4.3.1 Uniform Packet Value Histograms
The packet value histograms are what the counters (counting packet sizes in bytes) for
all packet values are called. When an update of the packet value distribution is executed,
the control plane will read all counters corresponding to a specific port and queue. Due
to the added delay of reading 65536, e.g., counters for all allowed packet values, these
19
histograms are divided over fewer histogram counters. Which histogram counter a packet
value corresponds to is decided by range-match table rules for packet values.
When the control plane is initialized, a range is decided for which packet value should
be counted into which histogram counter. At initialization, the ranges will be equally
distant, meaning, if there are 65536 different packet values divided into 256 equally wide
ranges, there would be 256 (e.g., 65536/256) packet values for each histogram counter.
As an example In Figure 4.3, if there are lets say 4 histograms (counter 1, 2, 3, and
4), each counting 64 packet values. If they are equally distant, counter 1 will count packet
values from 0 through 63, counter 2 will count packet values from 64 through 127, and so
on.
Figure 4.3: Uniform packet value ranges
20
4.3.2 Packet Value Distribution and ECDF
Within an interval that fits with the time constraints of the control plane, the histogram
counters of packet values for a specific queue will be collected from the data plane. With
all of these histograms, a new ECDF will be calculated.
In Figure 4.4, a uniform packet value distribution is converted into a ECDF and plotted.
This is not usually how it would look in reality, but just as a simple example. If this is
the given ECDF after an update, and for example, the dropping probability is calculated
to be 50%. Then the CTV algorithm would look at which packet value correlates to the
dropping probability of 50%. In the case of this example, with a uniform distribution, it
would be around 65536*0.5, e.g., half of the maximum allowed packet value. This would
lead to theoretically, that 50% of all packets received should be dropped.
Figure 4.4: ECDF diagram.
21
5 Implementation
In this section, the final implementation of the PV AQM will be presented. As mentioned in
the design section, because of limitations found, the implementation has to be divided into
two parts, a control plane, and a data plane. P4 was used to access the programmability
of the Tofino [18] switch, to define the packet processing pipeline. In contrast, Python for
the control plane interaction due to the available API needed for accessing data (reading
and writing registers, counters, etc) in the data plane.
5.1 Data Plane Implementation
In this Section, each part of the implemented P4 packet processing pipeline will be pre-
sented in detail. The different parts, such as parsing, ingress- and egress processing, and
deparsing will be presented in the order at which they are executed in the data plane.
5.1.1 P4 Ingress Parser
The ingress parser has the functionality of extracting data needed for the ingress processing
pipeline. The PV AQM needs specific header fields from the packet to function correctly.
The most important header that needs to be extracted is the IPv4 header, which holds es-
sential fields for forwarding, like the source- and destination IP addresses. The IPv4 header
also contains a 16-bit identification header field [30], which usually holds information about
a group of fragments that a packet corresponds with. In the PV AQM implementation,
this IPv4 identification field holds the 16-bit packet value, which has been marked by the
marker at an earlier stage. Per default this is where the packet value marker inserts the
packet value. But in a more practical implementation the packet values should probably
be inserted in a additional header or field that does not interfere with already established
header fields like the IPv4 identification field.
22
5.1.2 P4 Ingress Processing
The ingress processing control block is where most of the AQM implementation resides,
apart from the control plane functionalities.
The following packet processing has been done within the ingress processing control
block (see Figure 5.1):
1. An exact match table is applied, and a destination IP address is matched to an egress
port number. This is what applies the forwarding rules for the test-bed used during
testing and evaluation.
2. An exact match table is applied to check if the packet is sent from an IP address
that gets marked with a packet value. If it hits, the rest of the AQM functionalities
are activated. This is used to apply the PV AQM functionalities only to flows that
are marked with a packet value.
3. An exact match table is applied to match the egress port and queue ID to an identi-
fication number. This identification is used to read the correct CTV for the port and
queue. This allows the implementation to use individual CTV’s for unique ports.
4. A register action is called to check if a specific time (e.g., 1 ms in this implementation)
has passed since the previous update of the CTV. This will return either one (time to
update) or zero (not enough time has passed). This value is stored into a meta data
field which will tell the P4 ingress deparser if a digest should be sent to the control
plane to initiate an update of the CTV.
5. A second register action is called to get an index of which counter object to count the
packet size in. The action will return 0,1,2 or 3, which corresponds to one of the four
defined counters. There are four different counters to switch between to remove the
added delay of waiting for a counter to be reset by the control plane before starting
to count on that counter again.
23
6. A range match table is applied to match the received packet value to an index in
the counter. This table consists of multiple packet value ranges, with each range
corresponding to a unique counter index (e.g., a unique packet value histogram).
7. A counter action is called on the correct counter index. This action will increment the
number of bytes the packet consists of to the current value in the counter. By doing
this the counter will hold the statistics needed to calculate an ECDF to describe the
distribution of packet values received in the form of bytes.
8. Store the ingress global timestamp metadata field to a header field. This is done to be
able to emit the header during the deparsing and to be able to access the timestamp
in the egress processing block. This timestamp is used in the egress processing block
(Presented in Section 5.1.4) to calculate the queuing delay.
9. A register action is called on a register, which holds the current CTV. The action re-
turns zero if the packet value marked in the packet is higher than the CTV, otherwise
it will return one if the packet value is less than the CTV.
10. If the returned value is 1, then the packet is marked to be dropped; else, it will be
enqueued.
24
Figure 5.1: P4 Ingress processing.
5.1.3 P4 Ingress Deparser
The ingress deparser only has one purpose besides emitting the packet headers. The
deparser will check if the previously mentioned metadata field is marked with one, which
indicates that it is time for an update of the CTV. If the metadata field is one, then a
digest will be sent with the port and queue ID for which the update should be performed.
5.1.4 P4 Egress Processing
The purpose of the egress processing block is to store the current queuing delay to a register.
It will also write the delay into a packet header field to enable end-host post-processing of
experimental data for traffic analysis. The first part of the egress processing is to apply an
exact match index table to match an egress port to find the correct register index for which
25
to update the queuing delay. After this, the queuing delay is to be calculated by subtracting
the timestamp sent from the ingress processing block from the current timestamp in the
egress processing. The queuing delay value will be stored in a register by calling a register
action. The last part of the egress processing is to write the delay to the packet header.
This is done by right-bit shifting the 32-bit queuing delay value by eight and overwriting
the 16-bit IPv4 identification field with this value. The value will correspond to the current
queuing delay in nanoseconds divided by 256 (eg., 28 = 256). This will allow for analysis
of delays of up to 17 ms, in contrast to 0.07 ms without the bit shifting.
5.2 Control Plane Implementation
In this section the control plane functionalities are present in more detail. The two main
purposes of the control plane is to collect packet value statistics to calculate an ECDF,
and to calculate a dropping probability that is used to match a CTV which will be written
to the data plane. All of the following functionalities are programmed with the python
programming language in combination with the available bfrt python API.
5.2.1 CTV Update
The CTV update is the action of updating the threshold value, which is used in the data
plane for deciding if a packet should be dropped or not. The action is triggered every time
a digest message is received in the control plane, which is approximately every 1 ms, due to
the data plane update timer. When a digest is received, the port and queue ID is unpacked
from the digest and used to read (API register read function) the current queuing delay
from the data plane with the correct index correlating with the IDs. When the queuing
delay has been collected and converted into ms by simply dividing it by 1000, the CTV
update function is called. The called function performs the following:
1. Calculating a new dropping probability with the PIE formula:
p = p+ α(cur del − ref del) + β(cur del − old del)
26
2. Check if the calculated dropping probability is out of boundary (e.g., p < 0 or p > 1),
and resetting it to the closest boundary if it is true.
3. Calculate classic TCP dropping probability according to the PI2 [6] formula: p =
(p2)2. This is used in the PV AQM to restrict the dropping probability to a maximum
of 25%. During testing it was shown to work better to not allow big aggressive changes
in dropping probability.
4. Store current queuing delay and dropping probability to be used during the next
CTV update as previous queuing delay and previous dropping probability into a
array data structure.
5. Binary search (see Appendix D.2 for pseudocode) through the current ECDF to find
the suitable CTV for the current packet value distribution.
6. Write (API register write function) the new CTV to the data plane for use as the
new threshold value for the particular port and queue.
5.2.2 Inverse ECDF Update
The ECDF update is the action of where a new cumulative probability function is generated
with the current packet value distribution collected from the histogram counters in the data
plane. This update is triggered in the control plane approximately every n digest because
of a counting variable which increments by 1 for every digest received. In a practical
implementation a more sensible approach would be to have a separate timer for the ECDF
update compared to the CTV. In which the data plane sends different digest messages
depending on if the CTV should be updated or the ECDF should be updated. But for the
current test-bed and experiment conducted it is not necessary.
The ECDF update performs the following:
1. Check which counter that is currently used for counting packet values into histogram
27
counters. This is possible due to a index (for which counter is used) stored in a
python data structure.
2. Synchronize (API counter synchronize function) the counter values from data plane
switch hardware to control plane local software.
3. Write (API register write function) a new index value to the data plane, which tells
it to start counting in a new counter.
4. Read (API counter read function) all packet value histograms from the currently
synchronized counter.
5. Calculate a new ECDF (see Appendix D.3 for pseudocode) with the collected packet
value histograms and store it in a python array structure for use during the CTV
update.
6. Reset (API counter write function) all of the counter values for the previously used
counter.
6 Tools
In this section the various tools used during the thesis are presented. There are two main
API tools that has been used during the thesis, one to manage interaction between the
control- and data plane, and another to configure the traffic manager (e.g., port or queue
configuration). This section also contain the traffic generators that were used, and the
programming language to script and analyse the traffic captured.
6.1 Control Plane Interaction API
Via a connection between the control- and data plane, the control plane can use API calls
to modify P4 extern objects and tables that are used by the data plane. These API calls
28
are accessed through a Python run-time client running on the switch. In this client, Python
scripts can be run, which was used for the purpose of updating the CTV for the PV AQM
implementation.
Some examples of how API functions would look:
• Register/Counter functions:
– program name.control block name.register name.mod(index, value)
– program name.control block name.register name.get(index)
– program name.control block name.counter name.operation counter sync()
• Table:
– program name.control block name.table name.add with hit(match value,
output value)
– program name.control block name.table name.delete(match value)
6.2 Pipeline Traffic Manager API
The pipeline traffic manager is a part of the control plane functionalities that can modify
and configure, for example, the number of egress queues, queue lengths, and queue priority.
For a multiple queue implementation, the traffic manager could be used to set up the
multiple queue capability. Two queues can be allocated, where one queue can manage the
low latency dependent L4S [31] flows, and the other manages all other traffic flows.
6.3 Iperf3
Iperf3 [32] is a tool both available as a library in python and a tool in Linux. It can be
used to start servers and clients to create network traffic. It allows for dynamic traffic with
multiple transport protocols, parallel streams, and binding to specific ports or interfaces.
29
In this project, the iperf tool has been used to add multiple TCP flows to analyze how the
PV AQM reacts to a varying number of TCP flows.
6.4 Flent
Flent [33] is a flexible network tester that has been used throughout this project both to
debug and evaluate the PV AQM. With Flent it is possible to generate multiple different
traffic scenarios. When starting Flent, a test can be specified to run, for example, several
TCP flows in a single direction or bi-directional. If desired, it is possible to also add UDP
and HTTP traffic to test how different transport protocols reacts. Flent also has a GUI
to analyze and display complex graphs of throughput and latency, with CDF curves, box
diagrams, etc.
6.5 Python
Python [34] is an object-oriented programming language that has been used as the main
programming language for control plane interaction. The decision to use Python as the
main language had to do with the ease of use due to the control plane API already existing
for the Tofino switch. Python is also a very easily scripted language that has a large
number of available libraries that can be imported. For example, there are libraries for
network analysis that can be used to split large PCAP (Packet Capture) files. It is possible
to split a PCAP file into multiple PCAP files per TCP flow or source IP address.
7 Control Plane Measurements
In this section multiple measurements will be presented. The main purpose of this sec-
tion is to introduce a general reasoning behind why specific objects (e.g., registers and
counters) was used to keep packet value statistics. The section will also introduce time
measurements for functionalities (e.g., ECDF and CTV calculations) in the control plane
30
to approximate at which time interval (see Figure 4.1 for control plane update intervals)
these functionalities can be executed.
7.1 Histogram Registers vs. Counters
One primary concern that became prevalent later in the project is the limitation to which
histogram registers/counters can be read from the data plane to the control plane. Likewise,
how many bytes of information that can be sent with digests from the data plane to the
control plane during each packet processing pipeline. These problems would limit the
number of unique packet values that can be used for the ECDF calculations. A possible
solution for this would be to read specific histograms from the control plane when the
packet value distribution needed to be updated. This would impose extra delays as reading
individual values from the control plane is more time consuming than reading them in the
data plane. The positive aspect of reading from the control plane is that this would not
slow down the packet processing pipeline. Reading them from the data plane would entail
recirculating packets multiple times through the packet processing pipeline to be able to
read a large number of values. Instead, when reading histograms from the control plane,
packets can flow through the switch without any interruption.
To store and count the number of bytes transferred with a specific correspondent packet
value, there are two choices. The two externs that are available for this purpose is the
register extern and the counter extern. The counter extern can only explicitly be read
from the control plane, while the register extern can be read from either the control- or
the data plane. With that said, the register has more flexibility with what is possible to
do with it. This is because of the potential to store whatever data accessible from the data
plane to the register, while the counter either counts packets or packet sizes.
In the case of this project, where the only necessary use case is to count packets sizes in
bytes corresponding to a packet value, either of the two possible externs would suffice. Due
to this, it is crucial to make sure that the extern best fitted is used. For the implementation,
31
the most important would be the speed at which the histogram values can be read from
the data plane to the control plane to update the ECDF.
Both the register and the counter have multiple API functions that can be used through
the bfrt python API. There are two different ways for each extern to read the histograms,
reading the values straight from the hardware, or the values can be synchronized from the
hardware to the local software and then read. The difference between them is that during
synchronization all values stored in the corresponding extern will get transferred to the
local control plane memory where they can be read quicker. In contrast, when reading
from hardware, the API read function gets a single instance/value for each API function
call, which is less effective.
In Figure 7.1, the comparison between reading 128 register values from the hardware,
and syncing the register to read 128 values from the software is illustrated. This experiment
is conducted by reading 128 different indices of a register multiple times and calculating
an average. As seen in the figure, there is a slight difference between the reading speeds.
In this case, reading 128 register values from the hardware is a couple of ms quicker than
reading it from the software. This is then most likely due to the added delay of calling
the API function to synchronize the values, and then reading them. An important part
to mention is that further evaluation is needed for how the synchronization API function
works. The API function possibly has to be used with a callback function (e.g., a function
defined by the programmer to be called after the synchronization has been completed).
In this project, a callback function was not used; instead the synchronization function is
called, and the values are read right away. This could entail that the histograms read are
previously cached data and that with a callback function, additional delay could be added
to the measurements.
32
Figure 7.1: Reading 128 register values from hardware versus from software.
In Figure 7.2, a more expected result can be seen, where syncing and reading from the
software increases the speed at which register values can be read. This would be the most
efficient way of reading a larger number of register values. If using larger register the more
efficient way to read histograms would be to synchronize and then read the values. But in
the previous Figure 7.1, it is shown that at some point the synchronization of the values
from hardware to software is actually not efficient when reading less values.
Focusing on what the criteria and purpose of the packet value histograms is. Then it is
important to compare whether counters can be used instead of registers to read packet value
histograms from the data plane to the control plane. In Figure 7.3, two measurements are
depicted. It shows the difference between reading 512 values from a register versus reading
512 values from a counter. As seen, there is a significant difference in reading speeds
between a counter and a register. The counter can be instantiated to count packet sizes in
bytes, which would be the most efficient alternative to use for the reading of packet value
histograms.
33
Figure 7.2: Reading 512 register values from hardware versus from software.
Figure 7.3: Difference between reading 512 values from a register versus counter.
34
7.2 CTV Calculations
In this section, there will be measurements of how fast the control plane can calculate a
new CTV and write it to the data plane to be used as the current dropping threshold.
These measurements are calculated with the following operations:
1. The time starts when a digest with port- and queue ID is received in the control
plane.
2. The current queuing delay is read from the data plane for the specific port and queue.
The delay is read with an API function to get a single register instance, which takes
around 0.05 ms (see Appendix C for a register reading measurement).
3. A new dropping probability is calculated by using the PIE controller algorithm.
4. The dropping probability is used to match a new CTV in the ECDF by binary/linear
searching (see Appendix D.2 and D.1 for pseudocode).
5. The time stops when the API function to write the new CTV to the data plane has
been called.
To be clear, these time measurements are conducted specifically in the control plane.
In reality, there is a small delay added for how long it takes for the digest to be sent from
the data plane to the control plane. Secondly, another smaller delay from when the API
function has been called until the CTV register has been modified in the data plane. These
delays were not added to the measurement as they won’t add any delay to the control plane
functionalities, but they will affect the actual real time it would take to update the CTV.
Before measuring the total time it takes to update the CTV, the most time-consuming
part of the update is specifically measured. In Figures [7.4 7.5], times are plotted of how
fast the correct CTV is found by matching a probability to a CTV in the ECDF. Mea-
surements are taken during real-time traffic to show how fast the calculations will be when
35
constant traffic is running through the switch. As seen in Figure 7.4, there are many outliers
reaching upwards of 40 ms at most. These outliers are likely due to how the CTV is found
in the ECDF, which is done through linear searching (see Appendix D.1 for pseudocode).
When the dropping probability reaches higher percentages, there are more steps through
the ECDF that has to be done to find the correct CTV. In Figure 7.5, the same results are
plotted but with the y-axis scaled to show that most of the measured times are between 0
and 0.1 ms, which most likely is when the dropping probability is zero or very close to zero.
Figure 7.4: CTV with linear search. Figure 7.5: CTV with linear search(Scaled).
In Figure 7.6, the searching algorithm was changed to binary search (see Appendix D.2
for pseudocode) at which for each step through the ECDF, the total amount of possible
values to match are divided in half. As seen, the calculation times are much more stable
than with linear searching. This is as expected during the fluctuation of TCP traffic
building up the queue and then slowing down again, which leads to the calculated dropping
probability also fluctuating. The times are now more stable and likely to be able to find
the CTV in around 0.25 ms.
36
Figure 7.6: CTV calculations with binary search (Scaled).
In Figure 7.7 the total time (operations stated at the start of this section) of updating
the CTV is shown. As seen, the large majority of times for updating the CTV is below 1
ms, which then would make it possible to update the CTV every ms without overflowing
the control plane with digests.
Figure 7.7: Total time to update CTV.
37
7.3 Packet Value Distribution Update
In this section, measurements are taken on different parts of the packet value distribution
update. The first part to measure is how fast the packet value histograms can be read
from the data plane. The histogram counter size will be set to 256, which would allow the
control plane to quickly fetch the packet value statistics needed to update the ECDF. In
Figure 7.8, the times for getting 256 counter instances are presented. As seen, the times
vary a lot in contrast to the earlier measurements comparing registers and counters. The
reason for this is either that the measurements now are taken during real traffic, which
likely slows down the syncing and reading of counters, or that the control plane in parallel
receives digests from the data plane at which the CTV has to be updated.
Figure 7.8: Time to get 256 histograms counters.
The second important part of the packet value distribution update is the calculation of
an new ECDF (see Appendix D.3 for pseudocode). In Figure 7.9, the times collected for
calculating an ECDF for 65536 packet values are presented. As seen, the times are similar
to reading 256 counter values from the data plane, between 20 and 30 ms. With these
results, an approximate update interval can be decided to be above 50ms because reading
38
the counter takes about 25 ms plus an additional 25 ms for updating the ECDF. This
update interval is just an approximation and can easily be modified if needed by changing
a parameter in the control plane, which activates an update of the ECDF depending on
the number of digests that has been received (see Section 5.2.2).
Figure 7.9: Time to update the ECDF curve.
8 Evaluation
In this section, there are three primary questions that have to be answered. The first
question is about the actual purpose of the PIE scheme, to achieve lower latency by reducing
queuing delay. Can lower queuing delay be achieved without losing bandwidth compared
to the actual throughput that was achieved before adding the PV AQM? The second
question has to do with the PPV concept. Can marking packets with a packet value be
used together with the PV AQM to achieve resource sharing between different users? The
third question: Can the bottleneck link be fully utilized while at the same time allow a
specific user to get either lower or higher throughput compared to another user?
39
8.1 Evaluation Setup
What is needed to be able to evaluate the PV AQM correctly? There are a few points
that have to be hit. The major one is that a bottleneck has to be created to build a
queuing delay. The queuing delay is needed to test that a lower delay can be achieved by
dropping packets when a high enough queuing delay is prevalent. A single FIFO queue
is used for the bottleneck port on the Tofino switch. The port is not configured with the
traffic manager, e.g., the default configuration is used, which allows the queue to grow to
a large enough length to create a queuing delay (in this case, above 14 ms). The dropping
of packets is also the central part that makes the resource sharing functionality work. The
packet value marker marks the packets following a TVF. For the evaluation, packets are
sent from two clients to the marker (see Figure 8.1 for an overview of the setup). The first
client (IP 10.0.0.1) will be marked according to a gold TVF, while the other client (IP
10.0.0.2) will be marked according to a silver TVF. This, per default, will give one of the
clients a higher chance of getting the packets marked with a higher packet value. These
packets that are marked with a higher packet value will at the AQM bottleneck have a
lower chance of being dropped if a queuing delay is prevalent. By sending TCP traffic from
both clients, the client marked with the silver curve will start sending fewer packets when
drops are noticed. Because of this, a lower throughput will be seen for the client marked
with lower packet values, while the other client gets a higher throughput.
40
Figure 8.1: Evaluation setup.
Each client/server has the following hardware setup:
Ethernet Intel 2x10Gbps NIC 7-seriesCPU Intel i7 3.4GHzTCP implementation CUBIC
Table 8.1: Client/server setup.
The setup parameters that will be used in the evaluation:
Bottleneck on Tofino(Gbps) 1Nr. of sent TCP flows(gold-silver) 1-1, 2-2, 4-4, 8-8Propagation delay(ms) 0
Table 8.2: Evaluation setup parameters.
41
The marker TVFs used for the evaluation is shown In Figure 8.2. As seen in the figure,
when the TCP flows marked with silver has less than 10 Mbps, the silver flows will get
2 times less throughput. If the throughput for silver flows goes above 10 Mbps, silver flows
will get 4 times less throughput.
Figure 8.2: Marker TVF functions. Courtesy of Maher Shaker [4].
8.2 Evaluation Metrics
In this section, the important metrics used for the evaluation of the PV AQM is introduced.
There are three main aspects that are important to analyse: throughput, queuing delay,
and resource sharing.
8.2.1 Throughput
Throughput is a metric for analyzing the speed at which data are sent over the internet.
Usually, throughput is represented in one of two ways, either bits per second (bps) with
42
an appropriate prefix, or it is defined as packets per second (pps). For this evaluation, it is
essential both to see the scale of resource-sharing and to make sure that the total amount
of throughput corresponds to the bottleneck link. This is important to make sure that
the implementation does not limit the speed at which the switch should be able to send
without the PV AQM.
8.2.2 Queuing delay
Queuing delay is one of several different delays that may appear on the internet together
with propagation-, transmission- and processing delay. Queuing delay relates to the various
queues on the internet where packets wait before being sent towards its next destination.
In this evaluation, queuing delay is a significant part of what makes the implementation
work because of its importance in calculating a CTV. It is also an important metric to
evaluate due to its significance of reducing latency for packets running from end to end.
As described in Section 5.1.4, the queuing delay is encoded into the IPv4 identification
header field, which is extracted from a PCAP file and used to evaluate the queuing delay.
8.2.3 Resource-sharing
Resource-sharing is the concept of sharing the total amount of available bandwidth be-
tween, for example, users, TCP flows, subscription levels, etc. In this evaluation, this met-
ric will be shown as the throughput difference between a user which packets are marked
according to the gold TVF, and comparing that to a user which packets are marked ac-
cording to the silver TVF.
8.3 Evaluation of Traffic Without AQM
To make a proper evaluation of the PV AQM, firstly, graphs need to be created to which
the results can be compared. To do this, tests were run with the Tofino switch only
forwarding packets without any dropping of packets by the AQM. Instead, the only drops
43
that can occur are packets being dropped by the switch due to a queue being fully utilized.
The Netronome marker will still be marking packets, but the PV AQM will not drop any
packet.
In Figure 8.3, the throughput results are shown for multiplying the number of TCP
flows by two every 15 seconds starting with one silver flow and one gold flow. Even though
there is no AQM interaction, the packets are still marked by the marker to show that no
resource-sharing is present when the PV AQM is not active and dropping packets. As seen,
there is no priority for packets that are marked with the gold TVF compared to the silver
TVF.
Figure 8.3: Multiplying the number of flows with two every 15 seconds without AQM.
In Figure 8.4, the queuing delays are collected from a test without any AQM on the
switch. It shows the well known saw-tooth behavior where the queue builds to its maximum
capacity. At that point, packets are being dropped, and the TCP traffic will slow down
by reducing the sending window before speeding up again according to TCP Cubic. This
44
figure displays the queue delay during a 10-s period when sending 16 TCP flows. The
maximum queue delay that may occur is approximately 14-ms.
Figure 8.4: Queue delay without AQM.
8.4 Evaluation of the PV AQM With Uniform Ranges
In this section, the evaluation of the PV AQM implementation with uniform histograms
will be evaluated. As mentioned earlier, the essential part to evaluate is the resource-
sharing, at which differently marked traffic flows should share the provided throughput. In
the following figures, a test has been run over 60 s. There are two clients sending traffic
to a single server through the AQM, where the only bottleneck link is situated. The first
client will be marked according to the gold TVF shown in Figure 8.2. In contrast, the
second client will be marked according to the silver TVF, which has lower priority than
the first client. Every 15 s, the number of TCP flows for each client will be multiplied by
two (e.g. at the start a single TCP flow is started for each client, and after 15 s another
45
flow is started for each client, then two new flows is started for each client and so on).
The following parameters were used for the PV AQM throughout this evaluation:
Target delay(ms) 2Alpha 0.3125Beta 3.125CTV update (ms) 1ECDF update (received digests) 50
Table 8.3: PV AQM parameters.
In Figure 8.5, the throughput for all flows are shown in light blue and light red, while
in dark blue and dark red, the average throughput is shown for each client’s TCP flows.
As seen, the throughput is not stable when there is only one TCP flow for each client.
This is expected as drops for higher throughput flows make the TCP flows slow down more
drastically than it would for lower throughput’s. At a later stage, when more flows have
been added, the throughput’s for the two clients are more stable, and the resource-sharing
aspect of the AQM can be better seen.
In Figure 8.5, it is hard to see how close to the desired throughput for each TVF the
flows are. In Figure 8.6, the throughput per TVF marking is shown when multiplying the
flows by two every 15 s period. This is done by calculating the throughput for each of
the client, where as mentioned earlier, one of the client’s is marked with the gold TVF,
while the other is marked with the silver TVF. This figure shows how the number of flows
running through the PV AQM impacts the actual achieved throughput. When only one
flow per TVF is sent, you can see a significant difference between the TVFs, but still far
from the theoretical desired throughput. Between second 45 through 60, when sending
eight flows per TVF, the throughput is smoother and close to the desired. This is as
mentioned because of the limited control dropping of packets has on two high-speed TCP
flows compared to 16 low-speed TCP flows.
46
Figure 8.5: Throughput when multiplying the number of flows every 15 seconds.
Figure 8.6: Throughput per TVF when multiplying the number of flows every 15 seconds.
47
As mentioned earlier, the target queuing delay was set to 2 ms to be used for the PV
AQM. In Figure 8.7, the queuing delay is displayed over the 60 s test described earlier. As
seen at the start, the queuing delay grows to its maximum of just above 14 ms, which will
initiate tail-drop. After about a second or less, the PV AQM has collected enough packet
values and updated the ECDF, which now can be used to decide which packets to drop
according to the current distribution of packet values. One part that is important to note
for all evaluation scenarios is that the PV AQM has just been initialized. Ideally, when
running the PV AQM, statistics have already been collected, and the initial problem of a
burst in queuing delay should disappear. This could be concluded by looking the rest of
the test, disregarding the first second, where even doe more TCP flows are added, no large
burst in queuing delay is seen. From this point on, the queuing delay stabilizes at around
2 ms. But small bursts can be seen in the queuing delay throughout the test, which could
be either new flows being started or the distribution of received packet values is not close
enough to the current ECDF. This would lead to the CTV being set lower, and not enough
packets are being dropped.
Figure 8.7: Queue delay when multiplying the number of flows every 15 seconds.
48
In Figure 8.8, the evolution of CTV’s is plotted for different number of TCP flows.
As seen when comparing the values, the most common threshold value gets larger when
more TCP flows are added. This would be due to the marker having a higher probability
of marking packets with a higher packet value when the throughput rate for each flow is
lower. This is why with two flows, the more common value is just around 25000, while
with 16 flows, it is more common with a CTV around 30000. Note that the y-axis is
scaled to show all CTVs between 20000 and 40000. In reality, the most common CTV is 0
(e.g., forwarding all packets). But the more interesting values are above 0 when the AQM
actively is dropping packets below the threshold value.
Figure 8.8: CTVs when multiplying the number of flows every 15 seconds.
In the previous figures, the CTV has been presented, and it shows how the threshold
value that is set for the data plane increases with the number of TCP flows that are sent.
The reason for this is that the packet values received is higher when each flow has a lower
throughput as defined by the TVF and marking concept. In Figure 8.9, four ECDF’s are
49
plotted for when a different number of flows is sent through the switch. The data points
plotted for the different number of flows are not collected during a single continuous traffic
scenario. Instead four different tests were ran, where a different number of flows were
sent each test, which shows more more precisely how the packet values marked changes
depending on the number of flows which are sent. The data used to calculated the ECDFs
are the packet value statistics accumulated over the whole run. The results show that all
packet values received from the marker are between 20000 and 37500, as defined by the
gold and silver TVF presented in Figure 8.2. These ECDF’s represents the distribution of
packet values that have been collected from the histogram counter in the data plane. The
CTV update uses these stored ECDF’s to match a new calculated dropping probability to
a CTV. This allows the AQM to drop the correct packet values and the right amount of
packets by keeping a history of the current distribution of packet values.
Figure 8.9: ECDF for a different number of flows.
50
9 Conclusion
This thesis has presented the possibility of developing and implementing a PV AQM for
resource-sharing and low latency on the programmable Barefoot Tofino switch. With
knowledge about the limitations of the programmable architecture of switches, it is possible
to configure and manage the processing of packets to control the traffic flow. The PV AQM
can be moved on to a Tofino switch and run without any other software than what is already
present on the switch. To achieve the previously shown resource sharing, a packet-value
marker is needed with predefined TVF’s. With the achieved resource-sharing, different
TCP connections or flows can be prioritized higher than others, decided by the packet
values marked on packets according to a specific TVF.
One important aspect that has to be mentioned is the limitations of the current PV
AQM. In a practical scenario the AQM would have to scale to handle many ports at the
same time. The current evaluation only focus on a single port, but in reality the AQM
would have to keep packet values statistics from multiple port, while at the same time
calculating new CTV’s and ECDF’s for the ports simultaneously. This could be a prob-
lem because of the latency of control- and data plane interaction, and limited processing
capabilities in the control plane which can not handle all ports at the same time.
10 Future Work
Due to the limited amount of time left on the project, there was not enough time to
implement the control plane interaction using one of the lower-level API’s. Using one of
these API’s instead of either bfrt python or grpc python could limit the processing time
for algorithms. It could also limit the DMA (Direct Memory Access) times for reading,
modifying, and resetting registers, counters, and tables. This would be interesting to
increase the performance of the PV AQM further and make the AQM more practical for
industrial use where it can be applied to a large number of ports.
51
References
[1] M. Menth, H. Mostafaei, D. Merling, and M. Haberle, “Implementation and evaluationof activity-based congestion management using p4 (p4-abc),” Future Internet, vol. 11,p. 159, 07 2019.
[2] ELTE, “Ppv-per packet value.” http://ppv.elte.hu/, 2020.
[3] S. Laki, G. Gombos, S. Nadas, and Z. Turanyi, “Take your own share of the pie,” inProceedings of the Applied Networking Research Workshop, pp. 27–32, 2017.
[4] M. Shaker, “A dataplane programmable traffic marker using packet value concept,”2020.
[5] J. Gettys and K. Nichols, “Bufferbloat: Dark buffers in the internet,” Queue, vol. 9,no. 11, pp. 40–54, 2011.
[6] K. De Schepper, O. Bondarenko, I.-J. Tsang, and B. Briscoe, “Pi2: A linearized aqmfor both classic and scalable tcp,” in Proceedings of the 12th International on Confer-ence on emerging Networking EXperiments and Technologies, pp. 105–119, 2016.
[7] S. Floyd and V. Jacobson, “Random early detection gateways for congestion avoid-ance,” IEEE/ACM Transactions on networking, vol. 1, no. 4, pp. 397–413, 1993.
[8] K. Nichols and V. Jacobson, “Controlling queue delay,” Communications of the ACM,vol. 55, no. 7, pp. 42–50, 2012.
[9] R. Pan, P. Natarajan, C. Piglione, M. S. Prabhu, V. Subramanian, F. Baker, andB. VerSteeg, “Pie: A lightweight control scheme to address the bufferbloat problem,”in 2013 IEEE 14th International Conference on High Performance Switching andRouting (HPSR), pp. 148–155, IEEE, 2013.
[10] R. Rajkumar, C. Lee, J. Lehoczky, and D. Siewiorek, “A resource allocation modelfor QoS management,” in Proceedings Real-Time Systems Symposium, pp. 298–307,IEEE, 1997.
[11] F. Agboma and A. Liotta, “Qoe-aware QoS management,” in Proceedings of the 6thinternational conference on advances in mobile computing and multimedia, pp. 111–116, 2008.
[12] S. Nadas, Z. R. Turanyi, and S. Racz, “Per packet value: A practical concept for net-work resource sharing,” in 2016 IEEE Global Communications Conference (GLOBE-COM), pp. 1–7, IEEE, 2016.
52
[13] A. Sivaraman, S. Subramanian, M. Alizadeh, S. Chole, S.-T. Chuang, A. Agrawal,H. Balakrishnan, T. Edsall, S. Katti, and N. McKeown, “Programmable packetscheduling at line rate,” in Proceedings of the 2016 ACM SIGCOMM Conference,pp. 44–57, 2016.
[14] X. Jin, X. Li, H. Zhang, N. Foster, J. Lee, R. Soule, C. Kim, and I. Stoica, “Netchain:Scale-free sub-rtt coordination,” in 15th {USENIX} Symposium on Networked SystemsDesign and Implementation ({NSDI} 18), pp. 35–49, 2018.
[15] X. Jin, X. Li, H. Zhang, R. Soule, J. Lee, N. Foster, C. Kim, and I. Stoica, “Netcache:Balancing key-value stores with fast in-network caching,” in Proceedings of the 26thSymposium on Operating Systems Principles, pp. 121–136, 2017.
[16] R. Joshi, T. Qu, M. C. Chan, B. Leong, and B. T. Loo, “Burstradar: Practicalreal-time microburst monitoring for datacenter networks,” in Proceedings of the 9thAsia-Pacific Workshop on Systems, pp. 1–8, 2018.
[17] N. Feamster and J. Rexford, “Why (and how) networks should run themselves,” arXivpreprint arXiv:1710.11583, 2017.
[18] B. Network, “Barefoot tofino.” https://www.barefootnetworks.com/technology/
#tofino, 2020.
[19] S. Nadas, G. Gombos, F. Fejes, and S. Laki, “A Congestion Control Independent L4SScheduler,” in Proceedings of the Applied Networking Research Workshop, pp. 45–51,2020.
[20] P. Voigt and A. Von dem Bussche, “The EU general data protection regulation(GDPR),” A Practical Guide, 1st Ed., Cham: Springer International Publishing, 2017.
[21] J. Kramer, L. Wiewiorra, and C. Weinhardt, “Net neutrality: A progress report,”Telecommunications Policy, vol. 37, no. 9, pp. 794–813, 2013.
[22] K. Benzekki, A. El Fergougui, and A. Elbelrhiti Elalaoui, “Software-defined network-ing (SDN): a survey,” Security and communication networks, vol. 9, no. 18, pp. 5803–5833, 2016.
[23] P. L. Consortium et al., “P416 language specification,” Version, vol. 1, no. 0, p. 16,2017.
[24] P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford, C. Schlesinger,D. Talayco, A. Vahdat, G. Varghese, et al., “P4: Programming protocol-independentpacket processors,” ACM SIGCOMM Computer Communication Review, vol. 44,no. 3, pp. 87–95, 2014.
53
[25] P. Bosshart, G. Gibb, H.-S. Kim, G. Varghese, N. McKeown, M. Izzard, F. Mujica,and M. Horowitz, “Forwarding metamorphosis: Fast programmable match-action pro-cessing in hardware for SDN,” ACM SIGCOMM Computer Communication Review,vol. 43, no. 4, pp. 99–110, 2013.
[26] Netronome, “Agilio cx smartnics.” https://www.netronome.com/products/
agilio-cx/, 2020.
[27] J. Postel et al., “Transmission control protocol,” 1981.
[28] M. Allman, V. Paxson, W. Stevens, et al., “TCP congestion control,” 1999.
[29] Mininet, “Introduction to mininet.” https://github.com/mininet/mininet/wiki/
Introduction-to-mininet, 2020.
[30] J. Postel et al., “Internet protocol,” 1981.
[31] B. Briscoe, K. Schepper, and M. Bagnulo, “Low latency, low loss, scalable throughput(L4S) internet service: Architecture,” Internet Engineering Task Force, Internet-Draftdraft-briscoe-tsvwgl4s-arch-02, 2017.
[32] J. Dugan, S. Elliott, B. A. Mah, J. Poskanzer, and K. Prabhu, “iperf3, tool for ac-tive measurements of the maximum achievable bandwidth on ip networks,” URL:https://github. com/esnet/iperf, 2014.
[33] T. Høiland-Jørgensen, C. A. Grazia, P. Hurtig, and A. Brunstrom, “Flent: The flexiblenetwork tester,” in Proceedings of the 11th EAI International Conference on Perfor-mance Evaluation Methodologies and Tools, pp. 120–125, 2017.
[34] “Welcome to python.org.” https://www.python.org/doc/.
54
Acronyms
API Application Programmable Interface
AQM Active Queue Management
CTV Congestion Threshold Value
ECN Explicit Congestion Notification
ECDF Empirical Cumulative Distribution Function
FIFO First In First Out
HTTP Hypertext Transfer Protocol
ID Identification
IP Internet Protocol
IPv4 Internet Protocol version 4
L4S Low Latency Low Loss Scalable Throughput
NIC Network Interface Card
P4 Programming Protocol-Independent Packet Processors
PPV Per-Packet Value
PV Packet Value
PIE Proportional Integral Controller Enhanced
QoS Quality of Service
SDN Software-Defined Networking
55
TCP Transmission Control Protocol
TVF Throughput Value Function
UDP User Datagram Protocol
QoS Quality of Service
56
A Throughput and delay with up to 40 flows per TVF
Figure A.1: Multiplying the number of flows with 2 every 15 seconds starting with 5 flowsper TVF.
57
B Silver flow with 8 times less throughput
Figure B.1: 2 gold, 2 silver flows using a silver TVF with 8 times less throughput than thegold TVF.
58
C Reading Register Instances
Figure C.1: Reading register instances from switch hardware. Reading times grow linearlywith the number of instances.
59
D Pseudo code
D.1 Linear search
1 Input : ECDF array a , search value item
2 Begin
3 f o r i = 0 to (n − 1) by 1 do
4 i f ( a [ i ] >= item ) then
5 s e t r e t = i
6 Exit
7 end i f
8 endfor
9 s e t r e t = −1
10 End
D.2 Binary search
1 Input : ECDF array a , search value item
2 Begin
3 s e t beg = 0
4 s e t end = n−1
5 s e t mid = ( beg + end ) / 2
6 whi le ( ( beg <= end ) and ( a [ mid ] != item ) ) do
7 i f ( item < a [ mid ] ) then
8 s e t end = mid − 1
9 e l s e
10 s e t beg = mid + 1
11 end i f
12 s e t mid = ( beg + end ) / 2
13 endwhi le
14 i f ( beg > end ) then
15 s e t r e t = −1
16 e l s e
17 s e t r e t = mid
60
18 end i f
19 End
D.3 ECDF Calculation
1 Input : Packet value counter array a
2 Begin
3 s e t s = sum( a )
4 s e t r e t = [ 0 ] ∗ n
5 s e t idx = 0
6 s e t pkt s per pv = 0
7 s e t part sum = 0
8 s e t l en = range // Length o f histogram ranges
9 i f ( s == 0) :
10 Exit
11 end i f
12 f o r i = 0 to (n − 1) by 1 do
13 pkt s per pv = a [ i ] / l en
14 f o r j = 0 to ( len −1) by 1 do
15 part sum += pkts per pv
16 p v l i s t [ idx ] = ( part sum/ s )
17 endfor
18 endfor
19 r e t = ecd f
20 End
61