Mastering the Concurrency of Shared Path TCP Connections
Pedro de Almeida Braz
Thesis to obtain the Master of Science Degree in
Telecommunications and Computer Engineering
Supervisor: Prof. Ricardo Jorge Feliciano Lopes Pereira
Examination Committee
Chairperson: Prof. Luís Manuel Antunes VeigaSupervisor: Prof. Ricardo Jorge Feliciano Lopes Pereira
Member of the Committee: Prof. Rui Jorge Morais Tomaz Valadas
November 2016
ii
Acknowledgments
The authors of this work were never alone. In the pursuit of knowledge hardships take place and the
process would have been harder if not for those that comfort you and for those that turn your doubts
around.
As a student, I want to thank professor Ricardo for thinking clearly when problems arose.
As a son, I want to thank the father, the mother and the sister for caring.
As a colleague and friend, I want to thank Pedro, Nuno, Joao, Rui, Artur and Karan for being there,
every step of the way.
iii
iv
Abstract
Parallelism is a necessity for the Internet and most Transmission Control Protocol (TCP) based network-
ing applications benefit from its use, as is the case of the Hypertext Transfer Protocol (HTTP). But, this
introduces a worrying amount of concurrency which has adverse effects on networks: having too many
parallel TCP connections is proven to be too aggressive, causing unnecessary congestion and throttling
the throughput for all network users.
This work addresses the problem at the sender, by grouping parallel connections, which share the
same path, at the sender, reducing per connection redundancies. We survey existing implementations
and adapt them for our own protocol. Our solution enables TCP to: group connections from hosts in
close proximity, have finer network state estimates, react quickly to congestion, skip slow start and have
an increased average throughput.
Keywords: TCP, HTTP, Concurrency, Linux, Reno, internet.
v
vi
Resumo
O paralelismo de ligacoes TCP tornou-se numa necessidade da Internet e muitos protocolos beneficiam
do seu uso, como o e caso das aplicacoes HTTP. Infelizmente, a concorrencia tem consequencias
adversas na rede: ligacoes paralelas sao agressivas e causam congestao desnecessariamente. Este
trabalho foca-se nesse problema, agrupando no emissor ligacoes paralelas que partilham o mesmo
caminho na rede, reduzindo redundancias. Analisamos implementacoes que permitam isto e adaptamo-
las ao nosso protocolo. A nossa solucao permite ao TCP: agrupar ligacoes por recetores proximos,
melhorar as estimativas do estado da rede, reagir mais depressa ao estado de congestao, saltar o slow
start e aumentar o debito maximo na rede.
Palavras-chave: TCP, HTTP, Concorrencia, Linux, Reno, Internet.
vii
viii
Contents
Acknowledgments iii
Abstract v
Resumo vii
List of Figures xi
List of Tables xiii
Acronyms xv
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Proposted Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Related Work 5
2.1 Transmission Control Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 TCP operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Conservation in TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 The Hypertext Transfer Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Optimizations for Shared Path Parallel TCP Connections . . . . . . . . . . . . . . . . . . . 12
2.5 Multiplexing Parallel TCP flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 Protocols Comparison and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7 Ensemble Sharing Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Architecture 21
3.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Hydra: Connection Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Heracles: Congestion Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
ix
4 Implementation 27
4.1 Implementation Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.1 Linux Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.2 Linux Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.3 Kernel Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.4 Scripting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1.5 Iperf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.6 Netkit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.7 Tc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Heracles Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.1 Fast Retransmit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.2 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5 Evaluation 37
5.1 Tests Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Tests Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.1 Long-Short . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.2 Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.3 Sequential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.4 Packet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.4 Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.4.1 Long Short . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.4.2 Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.4.3 Sequential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.4.4 Packet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.5 Protocol Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6 Conclusions 47
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.2 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Bibliography 52
x
List of Figures
2.1 Representation of the three way handshake for hosts A and B. (1) and (2) contains the
sequence numbers for hosts A and B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Representation of a TCP connection closing. . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 State transitions for TCP Reno. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Per-host TCP-Int structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1 Representation of the hydra structure, composed of an externally linked hash table and
binary trees, where each leaf represents a hydra group, which the Heracles structure
points to. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1 Diagram of the Heracles cong avoid function. . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Diagram of the Heracles pkts acked function. . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.1 Test Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Empirical Cumulative Distribution Function (CDF) plot for Long/short throughput. . . . . . 41
5.3 Empirical CDF graph for 2, 4 and 10 parallel connections respectively. . . . . . . . . . . . 43
5.4 Empirical CDF plot for sequential throughput. . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.5 Empirical CDF plot for packet test throughput. . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.6 2 connections partitioning into different groups with different throughput values. . . . . . . 45
xi
xii
List of Tables
2.1 Temporal Sharing TCB Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Temporal Sharing Cache Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Ensemble Sharing TCB Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Ensemble Sharing Cache Updates - rtt update indicates the operation of sampling the
newest round trip time (rtt) value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 part of the EBC structure definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 TCP/DCA-C Congestion Windows Update . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1 Information stored in each hydra group. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.1 Long/Short test results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Results for parallel tests with 2 connections. . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Results for parallel tests with 4 connections. . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.4 Results for parallel tests with 10 connections. . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.5 Sequential test results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.6 Packet test results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
xiii
xiv
List of Acronyms
ack acknowledgment
CDF Cumulative Distribution Function
csv comma-separated values
cwnd congestion window
HTTP Hypertext Transfer Protocol
IP Internet Protocol
IPv4 Internet Protocol version 4
MSS Max Segment Size
NAT Network Address Translation
P-HTTP Persistent HTTP
rto retransmission timeout
rtt round trip time
rttvar round trip time variance
srtt smooth round trip time
ssthresh slow start threshold
syscall system call
tbf token bucket filter
TCB TCP Control Block
TCP Transmission Control Protocol
URI Uniform Resource Identifier
WWW World Wide Web
xv
xvi
Chapter 1
Introduction
1.1 Motivation
TCP’s congestion avoidance and control algorithms play an important role in supporting the Internet in-
frastructure. Van Jacobson describes a period when these were badly implemented and almost caused
an Internet collapse [1]. The congestion mechanisms allow Internet hosts to detect and deal effectively
with congestion, even so, they are not well suited for web traffic, predominantly derived from HTTP1.
Most traffic over the Internet has a high degree of parallelism. Previously HTTP connections were
characterized as having a short lifespan. In HTTP/1.0, as each TCP connection was used for a single
request/response exchange, multiple connections were required for fetching a webpage. Modifications
introduced in HTTP/1.1, allowed TCP connections to be reused, making them long lived, yet parallel
connections continued to be a requirement for reducing client latency. This is because, for long lived
connections, a request cannot be sent before the previous response is received, causing blocking for re-
quest/response exchanges in the same connection. To add to this, a change in the nature of web traffic,
with the rise of Ajax and video streaming, made content shift from static to dynamic, which resulted in
more traffic bursts, forcing browsers to raise their upper limit on parallel connections [3].
In the year 2000, parallel connections accounted for 44% of total connections to a web server [4].
From a 2010 data set, connections evolved to being mostly parallel with a median number of 6 to 7 [3].
For this work, we address the problem of parallel connections, which is a necessity for most web
applications, but creates congestion in the network. Each individual connection is aggressive by nature, it
is constantly probing the network to discover its maximum available throughput and this process requires
the detection of losses to signal congestion in the network. This is not only done by each individual host
in the network, but also for each of the parallel connections that every client uses. From the network’s
perspective there is no difference, every connection is treated equally, independently of the originating
user, whether it has one or more connections. Research has been published on this subject, attempting
to reduce the impact of parallel TCP connections, but failed to gain support since published [5, 6, 7].
The research targets TCP, to make connections cooperate better on scenarios with high concurrency,
1A 2009 study was able to gather traffic from 20,000 residential clients and found HTTP traffic to account for almost 60% oftotal Internet traffic [2].
1
by using two different techniques: ensemble sharing and temporal sharing. They are based around
grouping same path connections and sharing network resources efficiently between them. They attempt
to reduce the normal aggressiveness by making connections share:
• losses, so they can all react immediately, instead of each having to experience one.
• state, for finer-grained control of the network’s condition, by using estimates from different connec-
tions.
From these solutions, we notice some problems, mainly, that each single address is assumed to be
a single host, which is not always the case with the use of Network Address Translation (NAT) interfaces
that hide multiple hosts. Even assuming that latency is negligible, an untrusted host or one who crashes
can deny the service for all hosts behind the NAT interface.
1.2 Problem Statement
Parallel connections play a major role in web traffic, but a fundamental TCP design choice makes it
inappropriate for dealing with these connections. As a transport layer protocol, TCP applies congestion
algorithms per connection, so every connection a server uses is independent from all others. But state
is path specific, connections sharing the same path will have similar value estimates for latency and
available throughput, and these calculations are not cheap due to TCP’s mechanisms for congestion
control:
1. Slow start, which is required for finding an initial threshold for traffic throughput. Connections start
with a decreased throughput.
2. Congestion avoidance algorithms are as aggressive as they need to be, reacting to losses, which
signal congestion in the network. A loss causes the connection’s throughput to lower and increases
latency because of packet retransmissions.
To minimize the negative effects of TCP’s concurrency on application protocols, some solutions have
been proposed, targeting either the transport or the application protocol. For dealing with the problem
directly, we will focus on transport level solutions, enabling us to group at the sender connections sharing
the same path. This allows TCP to reduce the aggressiveness of parallel connections, for fewer losses
and increased throughput.
1.3 Proposted Solution
Our initial goal is to design an ensemble sharing technique, adapted from existing research, where same
path connections can:
1. skip slow start on paths for which there is already a connection in congestion avoidance;
2. react to losses from other connections by decreasing their throughput, thus preventing a loss;
2
3. Share network path information to provide finer TCP estimations.
By doing these optimizations to standard TCP, we can reduce unnecessary congestion, reduce losses,
and increase throughput when possible, effectively pacifying the effects of concurrency. To improve even
further on this, we will allow the protocol to group hosts by Internet Protocol (IP) addresses and deal
accordingly with any group inconsistencies that arise, making it possible to benefit a larger number of
hosts.
With this work, we contribute with: a new protocol design, adapted from existing research; a Linux
implementation; a set of experimental tests that evaluate the protocol.
1.4 Outline
• Chapter 2 analyses previous research done as a way of solving the problem of network concur-
rency, taking a look at solutions that proposed either sharing information between parallel connec-
tions or multiplexing them into a single entity.
• Chapter 3 describes the architecture of the systems that compose the implemented solution and
their purpose.
• Chapter 4 describes the Linux Kernel and how it was used to build the proposed solution and all
the tooling used to assert the correct protocol’s behavior.
• Chapter 5 describes the evaluation scenarios used to test the solution, their objectives and ana-
lyzes the results.
• Chapter 6 summarizes the goals achieved by the protocol and its shortcomings, as well as prob-
lems to be solved in the future.
3
4
Chapter 2
Related Work
In this section we survey the literature on optimizations for TCP for shared path connections. We will
start by describing TCP, over which the optimizations were designed, along with the mechanisms nec-
essary to make it behave correctly on the Internet. We will then describe basic HTTP and its evolution
process. Finally, we present a view of the protocols designed specifically for optimizing TCP for parallel
connections and compare their different aspects.
2.1 Transmission Control Protocol
TCP[8] was developed to provide reliable communication for the internet. The protocol operates in a
symmetric manner, providing basic data transfer on a duplex connection with added mechanisms to
assure reliability to its end hosts.
As a transport layer protocol, TCP must be capable of dealing with the network layer’s flaws. For this,
TCP provides the following mechanisms:
• basic data transfer - packaging streams of bytes into segments to be transmitted over IP ;
• reliability - ability to recover from data that is lost, damaged, duplicated or out of order;
• flow control - restricting the max amount of data that the other host can accept.
2.1.1 TCP operation
For each sequence of bytes sent, an acknowledgment (ack ) is expected by the receiver for asserting
the correct delivery of the packet. For each client, packets are given sequence numbers that identify
and denote their ordering. These sequence numbers are assigned, incrementally, based on the size of
previous packets. A received packet with an higher sequence number than the expected one, cannot
be acknowledged, because it has to arrive in the correct sequence, this indicates that the correct packet
was delayed or lost in the network. At the sender side, transmitting a packet starts a timer, if no ac-
knowledgment is received during this interval, a timeout happens which forces the sender to retransmit
5
the same packet. Retransmiting packets may cause the receiving of duplicate packets due to network
congestion.
Each TCP connection is responsible for 2 buffers, one for sending and the other for receiving bytes.
The client inserts data into the former buffer and waits for TCP to schedule the transmission. In the
receiving buffer, data is stored until TCP can pass it to the client. These buffers limit the amount of
outstanding data each client can have on the network. Every ack received informs the sender on how
many bytes it can still fit into the other host’s receiving buffer.
The opening of a connection can be performed actively or passively, depending on whether the client
knows the foreign host’s (socket) information or it wants to wait for incoming connection’s requests.
Independently of the type of connection, a TCP Control Block (TCB) is created, responsible for storing
state.
The procedure used for the initialization of the connection between hosts is called three way hand-
shake, in which a active host sends a message with the synchronize flag (SYN) to a passive host that
will acknowledge the SYN packet and in the same packet send its own SYN (called SYN-ACK), lastly,
the active host will acknowledge it. From this point onward, both connections are capable of sending
and receiving segments, to and from one another (figure 1).
Its purpose is for both hosts to decide on the sequence numbers in use for each side of the connec-
tion, these number will identify packets in the stream and point to data in the buffers. On the originating
connection, the sequence number points to the last bytes of data sent on the sending buffer, for the
receiving connection, it points to the last bytes of data received in the receiving buffer.
For a connection to close, each user must signal the other that there are no more segments to send.
When a user is ready to close the connection it sends a finish segment (FIN) to signal the remote host
that it has nothing more to send. From this point the connection is still open, because its assumed the
user can still receive segments, until the other also decides to end by also sending a FIN packet and
receiving the corresponding ack (Figure 2).
The unpredictability of the network makes it hard to decide with some certainty when a packet has
been lost. When waiting for an acknowledgment a timeout needs to be calculated based on smooth
round trip time (srtt) that tries to consider the unpredictability of the network, smoothing the samples
received. The initial expression used was later known to be inappropriate [1], this is discussed further in
the following section.
2.2 Conservation in TCP
Tahoe was the first version of TCP using congestion avoidance and control algorithms, introduced in
the late 90s on a release of the BSD (Berkeley Unix) operating system [9]. Here we describe the con-
servation property of TCP and the initial algorithms for the purpose: Slow Start, correct rtt , congestion
avoidance, fast retransmit and fast recovery.
TCP needs to obey the conservation of packets principle to work as intended [1]. This means that for
a connection to run stably it requires a conservative flow of packets, where no new packet is injected into
6
Figure 2.1: Representation of the three wayhandshake for hosts A and B. (1) and (2) con-tains the sequence numbers for hosts A andB.
Figure 2.2: Representation of a TCP connec-tion closing.
the network until an older one has left (when it has a maximum amount of data in transit - full window).
This can fail under 3 possible conditions:
1. The connection can’t reach a stable state when starting, this is caused by having no information
on the initial state of the network and overestimating the amount of data it can actually send. This
causes unnecessary losses and retransmissions.
2. For a full window, the connection starts injecting packets, before those in the network leave. Having
more packets inside the network than the window allows may cause congestion.
3. There are not enough resources to allow stability, the network buffers along the path are not pre-
pared to deal with an increase in rate of packets, causing losses.
The same problems still apply to this day, but TCP has mechanisms in place to make a reliable detection
of the network’s state and adapt accordingly to its needs.
• Slow Start was detailed as a new method for initializing TCP connections in a controlled manner,
without injecting too many packets for the network to handle. It also initializes the ack clocking
mechanism, which allows a host to estimate the delay in the network path by receiving acks,
and to discover the stable state in the network, for which it can send data safely without causing
congestion in the link. It is a requirement for correct TCP behavior, and its relative slow action
between all connections sharing the same link allows them to start without causing congestion
and impacting the network performance for others.
A slow start threshold (ssthresh) for the connection is created that indicates the point from which it
can stop using slow start. Initially, when nothing is known about the path, it is set to a high value to
7
cause a necessary packet loss that indicates the initial threshold for that path. The initial window
(IW) for the connection is set, according to the Max Segment Size (MSS), to around 4k bytes[10].
For each ack received, after the three way handshake, the congestion window (cwnd) is increased
by one MSS, causing it to double for each rtt , until cwnd > ssthresh [11]. A timeout causes slow
start to restart with ssthresh set to half of the congestion window and a cwnd of a single segment.
• Round-Trip Timing, on which TCP depends for accurate estimations of a packet’s travel time, is
important for loss detection. This work added dynamic rtt variation to the estimate calculations.
The original used a fixed value, that was found to cause retransmission of delayed packets after
load on the link reached 30%.
When a packet leaves the host, a timer starts for the waiting time before retransmiting the same
packet. The duration of the timer is the retransmission timeout (rto), and it is dependent on the srtt
and the round trip time variance (rttvar ). For precise calculation of the timer the following equations
are used, based on the newest rtt value:
rttvar ← (1− β)rttvar + β|srtt− rtt|
srtt← (1− α)srtt+ α× rtt
rto← srtt+ 4rttvar
The standard suggests the use of α = 1/8 and β = 1/4 for the rttvar and srtt calculations. Then
the rto value is updated according to both these equations [12].
• Congestion Avoidance requires 2 components to work, the sender must be able to detect the
congestion and endpoints must have policies in place for dealing with it.
In a network, there are buffer limits along the path the packets travel, and there is a chance they
will get discarded. We can almost certainly say that a loss happens due to congestion, and that
will be signaled through a timeout for the sender. On a congested system, queue lengths start
increasing exponentially during congestion. For the system to stabilize, the traffic sources must
reduce their outgoing traffic as quickly as the queues grow. For the sender, this is a multiplicative
decrease of the packets sent (currently the congestion window is cut in half).
In the case where a connection is using less than its fair share of the bandwidth, it should increase
its utilization. This suffers from the same problem that slow start solves for detecting the available
bandwidth for the connection. It uses the same method, increasing the amount of data it can send
on each acknowledgment received, but instead of increasing exponentially, it increases linearly
(adding 1 segment to the congestion window for each rtt).
• Fast Retransmit [13, 11] describes a algorithm for early detection of losses, where a receiver
generates a duplicate acknowledgment after receiving an out of order packet. This signals the
sender that a packet still hasn’t arrived and may be lost. At the sender, receiving 3 duplicate acks
is an indication that the packet was really lost, causing an early retransmission.
8
Figure 2.3: State transitions for TCP Reno.
Reno was the name given to the next version of TCP, that improved the fast retransmit algorithm and
added a fast recovery phase, after retransmiting a dropped segment, for which TCP doesn’t need to
drop to slow start.
When the first duplicate ack arrives on a connection, it uses the limited transmit algorithm [14]. This
algorithm allows the sender to keep injecting new packets in the network, while receiving duplicate acks,
without having to go through slow start. When 3 duplicate acks arrive, and the client retransmits the
missing packet, the fast recovery algorithm governs the transmission until a non-duplicate ack arrives.
The congestion window is then lowered to ssthresh + 3mss, and for every duplicate ack the window
is inflated, so that it can keep sending new data. A non-duplicate ack will stop the algorithm and will
deflate the window to the ssthresh value [11]. This allows the connection to conserve the packets that
were already buffered by the receiver and preserve the ack clocking. The different states and their
respective transitions can be seen in Figure 2.3.
The algorithms presented above are what constitute a base TCP implementation, their use is not a
requirement, but it is imposed that any TCP implementation is not to be more aggressive than these [9].
2.3 The Hypertext Transfer Protocol
HTTP allows creating client/server services, targeted at the World Wide Web (WWW), hiding the imple-
mentation details of the services and presenting an uniform interface to the client for making requests,
independently of the resources associated with the service.
An HTTP server waits for connections, for servicing requests and sending responses [15]. It identifies
available resources and relationships between them with the Uniform Resource Identifier (URI) standard.
A request exchanges a URI with the server, identifying the target resource, for the server to return. The
protocol doesn’t define limitations for the nature of resources, only an interface that can be used to
interact with them [16], giving the server implementation flexibility.
9
An HTTP message is divided into a header and a body, which is optional. In the header, a client
indicates a semantic method and the URI it applies to, while a server will indicate the result of the
request as a status code. The body is only used when a request or response requires a payload to be
exchanged. The semantic requests indicate the purpose of the identified resource, these methods are
inserted into the message header as uppercase letters, examples of these are:
• GET: fetches the current representation of the identified resource;
• POST: the identified resource processes the client’s request.
The full list of existing methods try to exhaust all possible use cases of the protocol.
At the start, in version 1.0, the protocol made each request/response pair a single TCP connection,
where after the server sent its response, it would close the underlying connection [17].
Persistent HTTP (P-HTTP) [18] was a solution for improving web traffic performance, that updated
HTTP to version 1.1.
Network latency is the biggest bottleneck for web retrieval, and congestion will cause a high incre-
ment to the rtt . To diminish this problem, unnecessary round trips must be avoided, which the initial
version didn’t, incurring a bigger delay than needed. To improve, on the inherent limitations of the proto-
col, 2 alterations where proposed:
• Long-Lived Connections that can keep a single TCP connection open for multiplexing all the
HTTP objects needed. Both client and server keep connections open for future requests, eliminat-
ing the need for TCP to go through slow start for each request, decreasing latency.
In the connection, a new request needs to wait for the previous response, before it can be sent.
For resource intensive requests, the connection will be blocked for an extended period of time, this
is known has the head-of-line blocking problem. In these cases, concurrency is needed.
• Request Pipelining expands on the previous long-lived connections, eliminating the need to wait
for before sending a request in the same connection, allowing the host to send multiple concurrent
requests, reducing latency between responses. The protocol requires responses to be sent in the
same order that requests are received [15], and so first in, first out (FIFO) ordering must be en-
sured, this is problematic when a more expensive request is issued, later requests will then suffer
from head-of-line blocking, causing the connection to block and increasing the latency for all later
requests.
HTTP/2[19] is a new proposed standard for the HTTP protocol, that is being currently pushed, as a way
of dealing with the increased requirements of the internet and for mitigating congestion. It describes
issues like head-of-line blocking, as we’ve seen already, and verbose headers, that inflate the HTTP
message’s size. The protocol’s main change is to allow HTTP to fully benefit from the use of a single
connection, decreasing the impact of having multiple connections in the network, which can lead to
congestion.
10
The protocol facilitates pipelining, removing FIFO constraints for concurrent requests, by multiplexing
independent requests/response inside the same TCP connection to identify logically different sequences
of messages. This way FIFO ordering doesn’t have to be followed and streams can be independent from
each other, so the blocking in one doesn’t affect the others.
It also adds other features, as:
• Server Push: a server can send a unsolicited response, to a client, that it predicts to be necessary.
Used when a HTTP object has multiple dependencies, that will then force the sender to request
these independently. As when a browser requests an index page and will then have to parse it and
request all objects inside it. This reduces latency by removing the delay for the client’s requests.
• Stream Priority: for managing resource allocation between concurrent streams. Each stream can
be assigned a stream dependency or a weight. The stream dependency defines the parent of the
stream, from which it receives its relative share of resources, but only when the parent is not being
used. The weight defines the share it receives from the parent.
A unit of communication is called a frame, comprised of a header and a variable sequence of bytes.
The frames are exchanged inside a stream. A single HTTP/2 connection can have multiple streams
inside, each one identifiable, and can be opened or closed by either client or server. This completely
removes the need for parallel connections.
Frames can be of different types to serve different purposes:
• DATA: for client requests or response payloads;
• HEADERS: for opening new streams;
• PRIORITY: for defining the stream’s priority or dependency (for efficient multiplexing);
• RST STREAM: for terminating a stream or to indicate errors;
• SETTINGS: for declaring connections parameters;
• PUSH PROMISE: for reserving a stream;
• PING: for testing connection availability and respective round trip time;
• GOAWAY: to stop the connection gracefully;
• WINDOW UPDATE: for specific cases where limiting flow control is necessary due to peer con-
straints.
The operation of normal HTTP is mostly unchanged. A client, who wishes to send a request, uses a
new stream, which is then used by the server for sending the response. A normal HTTP message is now
divided into frames, with at least one header and optional DATA frames. A single stream behaves like
a TCP connection would in the initial version of HTTP, where each request/response exchange would
consume the entire stream.
11
Cache TCB New TCBold mss old mssold rtt old rttold rttvar old rttvarold cwnd old cwnd
Table 2.1: Temporal Sharing TCB Initialization
2.4 Optimizations for Shared Path Parallel TCP Connections
In this section we discuss several designs and protocols that were proposed for benefiting TCP con-
nections with an high degree of parallelism; to note that to simplify we assume parallel connections are
connections which share the same network path.
The idea of dependence on shared path TCP connections started with Touch’s TCP Control Block
Interdependence draft [5]. In it, concerns were raised about TCP’s per connection state, and its nega-
tive influence on performance for same host connections.
For each connection, TCP allocates a TCP control block (TCB) to store its state, such as srtt , rttvar ,
ssthresh, cwnd and MSS. These are the most important for congestion control and the focus of the
draft.
For the state, they classified it into: host-pair dependent and host-pair dependent for the aggregate,
the nuance here, is that, for aggregate dependency, state is required to be divided for each connection,
as is the case for congestion window. Host pair dependent state is equal for each parallel connection,
independently of the number of connections being used: MSS, srtt , rttvar .
These dependencies make the existence of a linking factor between connections clear, making most
of the the state calculations per connection redundant, which introduces an unneeded overhead. As a
way of minimizing this, Control Block Interdependence proposes two different tactics: temporal sharing
and ensemble sharing.
For building a model for each kind of sharing they based their design in Transactional TCP [20, 21],
which, on some available implementations, uses cache for storing state from older TCP connections.
Transactional TCP’s purpose is to lower the latency for TCP connections, by bypassing the three way
handshake, for already used connections. This protocol failed to gain widespread deployment and is
now classified as obsolete [22, 23].
The type of state dependency plays an important role in determining how information can be used
by the sharing tactics.
• Temporal Sharing tries to reuse closed connections’ state, when available, to initialize a TCB
faster (Table 2.1), where the values are simply copied. In Temporal sharing, caching is done
whenever a connection closes, or, in the case of MSS, when the the value is updated (Table 2.2).
• Ensemble Sharing is similar to Temporal Sharing, but allows interactions between concurrent par-
allel connections, where a cache is updated often during the connection’s lifetime, reflecting their
12
Current TCB Cached TCB when New Cached TCBcurrent mss old mss mss Update current msscurrent RTT old RTT conn close old = old + (current - old) >> 2current rttvar old rttvar conn close old = old + (current - old) >> 2current cwnd old cwnd conn close current cwnd
Table 2.2: Temporal Sharing Cache Updates
Cache TCB New TCBold mss old mssold rtt old rttold rttvar old rttvar
Table 2.3: Ensemble Sharing TCB Initialization
joined state. Newer connections are able to copy updated rtt information from other connections,
available in the cache. This will benefit newer connections, which will be able to start without a big
delay, and older connections, which can be notified of changes in the network through the cache.
Tables 2.3 and 2.4 present a trivial solution for Ensemble Sharing, the difference between these
and tables 2.1 and 2.2, is that the former have access to the most updated information, because
the cached information is from an open connection.
The draft enumerates some advantages of implementing TCB interdependence in TCP, first it pre-
vents the need of multiplexing different logically different streams into a single connection, as in P-HTTP
that does this to avoid the slow start penalty of starting a TCP connection for every request needed. TCB
interdependence still provides the same benefits as P-HTTP, but removes the coupling of connections
and, most importantly, transfers the concerns to the transport layer.
An initial solution to this problem was better described in TCP Behavior of a Busy Internet Server
[6]. The article studied the performance of parallel connections in a web server, focusing on losses: how
a group of independent connections experience losses and their combined congestion window after a
loss; and the increase in bandwidth: examining the ratio between the total throughput and the number of
parallel connections. They concluded that multiple parallel connections had an increase in throughput,
but the protocol was more aggressive due to it, aggression will lead to more congestion and losses.
They proposed changes in the form of TCP-Int, that would make parallel connections less aggressive,
where they would behave similarly to a single connection. TCP-Int provides better loss recovery and start
up performance for parallel connections and it only requires changes to the sender to make it compatible
Current TCB Cached TCB when New Cached TCBcurrent mss old mss mss Update current msscurrent rtt old rtt conn close rtt update(old, curr)
current rttvar old rttvar conn close rtt update(old, curr)current cwnd old cwnd conn close current cwnd
Table 2.4: Ensemble Sharing Cache Updates - rtt update indicates the operation of sampling the newestrtt value
13
with other TCP versions.
For improving loss recovery they devised an integrated congestion control for parallel connections.
This mechanism has a single congestion window for all parallel connections and a loss will affect the
shared window, mimicking the same effects of a single TCP connection. The unified window also makes
it unnecessary for new connections to do slow-start. This can also speed up fast retransmit: a TCP
connection that suffers a loss can use packets received later, from any other parallel connection, as
duplicate acknowledgments.
They created two simplified structures for storing state per-host, instead of per-connection, in Figure
2.4 we see this, where the connection is linked to the host, and packets for different connections are
maintained per host and are sent according to a round robin scheduler.
struct chost {
Address addr;
int cwnd;
int ownd;
int ssthresh;
int count;
Time decr_ts;
Packet pkts[];
TCPConn conn[];
}
struct packet {
TCPConn *conn;
int seqno;
int size;
Time sent_ts;
int later_acks;
}
Figure 2.4: Per-host TCP-Int structures
The integrated fast recovery mechanism implemented stores, for each packet sent and unacknowl-
edged, the number of acknowledgments that were received after (later acks in Figure 2.4). These ac-
knowledgments are not all duplicated, as in standard fast retransmit, because they may come from any
connection that shares the same host. By doing this, the protocol decreases the number of false timeouts
during the fast retransmit phase; these false timeouts were due to insufficient duplicated acknowledg-
ments. From their server analysis, they predicted that, for all retransmissions from coarse timeouts, they
could have avoided 25%.
From their tests, they compared connections using TCP-Int and SACK [24], which is another protocol
used over TCP for congestion avoidance. They noticed that transfers using SACK had more timeouts
and worse bandwidth sharing. For TCP-Int, a round Robin scheduling allows connections to have a
better share of the network for the singular congestion window.
Even though, for the tests performed, TCP-Int performs better with a single congestion window, it’s
hard to reach conclusions about its performance over the Internet (deployed on a server), which is
their primary focus for the protocol changes. Their tests were purposely restrictive, done in a way to
experience constant losses and not to mimic a large scale network.
Addressing the Challenges of Web Data Transport [25], based on the previous research for TCB
Interdependence, developed 2 techniques for temporal and ensemble sharing: TCP fast start and TCP
sessions . They started by defining the needs of the WWW at the time. TCP is designed for long burst
14
transfers to maximize throughput while HTTP transfers are too short for allowing this and latency is more
important than throughput. They came up with the following solutions:
The first solution is called fast start, which is based on temporal sharing, a cache is maintained for
avoiding the slow-start penalty after an idle period. This can be especially important in connections with
higher latency. The cache also stores the congestion window, but they raise the issue that using an older
congestion window could be too aggressive on the network. Their solution to this is implementing a new
drop algorithm in routers for making fast start traffic have lower priority and be dropped first if causing
congestion. But this solution would require changes in all routers, if there was an incompatible router, it
would not be able to discern that the marked packets must be discarded and would worsen congestion.
TCP Session is built over TCP-Int, aggregating the parallel connections and providing congestion
control and loss recovery mechanisms (most of these were already discussed), designed with a focus
on HTTP applications. These changes impact only the sending of data, not the receiving, so it isn’t
required for both endpoints to use the protocol, or detect if the other one has it.
The changes talked in TCP Session are mostly the same to those on TCP-Int, but it goes into finer
detail about packet scheduling for different connections, implementing a weighted round robin scheduler,
to have a better distinction between differently privileged connections. They claim that the connections’
weights can change dynamically, and this could solve this scheduler’s problems when having to deal
with different sized packets for each connection, but it is not explained.
In their simulation tests, they compared their approach to persistent HTTP and independent TCP
connections. They arrived to similar results between TCP session and and P-HTTP, but for individual
connections they noted an increase of 30% to 40% of packet loss, when there was more congestion.
For moderate loads, session performs 20% to 25% better than P-HTTP because of its changes to fast
retransmit.
One of the faults we find in their work is that their slow start solution is not explained thoroughly, see-
ing that TCP session will benefit short connections, slow start should play an important role in minimizing
the delay for these.
Effects of Ensemble-TCP [26] also pursued ways of adapting the initial design for TCB interde-
pendence. Their Ensemble-TCP architecture is capable of temporal sharing and ensemble sharing,
providing a shared structure for parallel TCP connections. The biggest divergence is that they don’t dif-
ferentiate as much between the two, being tightly coupled in the architecture. For explaining its design,
they go through the different components and respective thought process.
The authors start by defining the state that should be cached, because of their initial performance
cost. A misestimated rtt is costly, by default TCP connections use a conservative value. An initial higher
value is important for unknown higher latency connections, but an high delay can also be caused by
packet loss, so real losses will take time to be detected and dealt with. The same happens with the
congestion window, it has a low value initially and increases exponentially for each rtt . This conservative
start is bad for the throughput on connections, specially on smaller connections, but it can be improved
on by using caching.
For grouping connections into ensembles, they differentiate between hosts, but they idealize that this
15
Variable Descriptionr rtt round trip timer srtt smoothed round trip timet rttvar variance for round trip timesnd cwnd congestion windowsnd ssthresh slow start thresholdmembers associated connections’ TCBsosegs unacknowledged packets
Table 2.5: part of the EBC structure definitions
could be extended by grouping subnets instead (in that case, the delay between links in the same subnet
would have to be negligible). As in TCP Interdependence, state is shared differently and some needs to
be divided between the ensemble connections. In their case, for sharing congestion control information,
a priority scheduler was chosen with 4 different policies.
The use of temporal sharing depends on the stability of the values stored, because the network is
constantly changing, and using cached values can be to aggressive if the conditions in the network
deteriorate. This same problem was referred previously in TCP fast start, but their method for dealing
with this was flawed. Ensemble TCP proposes an aging mechanism to avoid it, even though the authors
fail to describe one in the paper. We can assume this method would make cached values converge to
default ones over time.
The structure used for caching state is called Ensemble Control Block (ECB) and it can be in 2
different states: active or cached, depending if there is any connection associated with it. In it, they store
the TCP state required by a singular connection and their own variables for allowing multiple connections.
A representation of this can be seen in Figure 2.5. A new connection will try using a cached ECB, or
create a new one if there isn’t one yet for that specific host. It then creates a modified TCB that is stored
inside the ECB, the new TCB references values directly from the ECB and has an added priority field.
For congestion control it applies the same algorithms we’ve seen in TCP-Int: shared congestion
window, where an ack increases the shared window and a loss decreases it, and shared fast recovery,
where other connections’ packets can be used as duplicate acks after an initial timeout.
An Integrated Congestion Management Architecture for Internet Hosts[27], described changes
in the internet traffic patterns, which could, in turn, threaten the long term stability of the internet. A novel
framework is introduced, capable of controlling network congestion from an end-to-end perspective, that
allows:
1. Multiplexing parallel flows to ensure proper congestion behavior;
2. Application and transport protocols adaptation with an API
The article has a wider focus, giving applications control over some transport layer concerns: the ability
to track and adapt to different bandwidths and to congestion. At the cost of removing abstraction between
protocols and increasing coupling. With this, they get a framework which is independent of transport and
applications protocols.
For managing different flows of data with different needs, a Congestion Manager (CM) acts as a
16
Sender cwnd = cwnd− α× cwndReceiver cwnd = cwnd− α× cwnd/(N − 1)
Table 2.6: TCP/DCA-C Congestion Windows Update
central point for maintaining network statistics and schedule outgoing transmissions according to formal
congestion control mechanisms, instead of having streams act independently of each other. Applications
use shared state learning to share network information along common paths.
Their implementation for the CM is divided in two modules, one for sending and another for receiving.
The sender side schedules data transmissions for all connections sharing the same links, the receiver
stores statistics relative to losses.
All network concerns are centered in the CM and state is stored in a centralized manner. This
includes TCP’s concerns for congestion avoidance and control which are used by default for all commu-
nications. The CM acts based on receiver feedback to estimate network capacity, mainly from packet
losses, independently of transport protocol.
Their design of a web server over their congestion manager has a bigger connection with our work.
A client will request objects from the server and the CM can control how to divide bandwidth between
each one. It can also provide an adaptive solution, where the same object can be requested with a
specified quality, to increase context-based performance.
In Collaborative Congestion Control in Parallel TCP Flows [7], they propose the TCP/DCA-C
as a different way of sharing state for parallel connections, using a delay-based congestion avoidance
scheme (DCA), where events are shared collaboratively (C) between flows. Again, the idea of sharing
in subnets is hinted at, but they maintain host specific groups in their protocol. For detecting congestion,
each flow calculates a threshold for the rtt , indicated by: T = rttmin+λ× (rttmax−rttmin); where λ is a
constant and the rtt values are taken from older rtt’s of the connections. A rtt higher than the threshold
(T ) indicates impending congestion. This allows the first connection in a group that experiences a
bigger delay to quickly share it with others, providing more accurate congestion windows for other group
members.
The protocol behaves like an ensemble, grouping the same host parallel connections into a group,
but without using any kind of caching, the group is used only for direct communication between flows.
When rtt > threshold in one flow, an event is signaled to all other flows, for all other flows in the
congestion avoidance phase the congestion window is reduced if the original flow had a lower window.
The congestion windows are decreased by a factor of 0.125, given by the parameter α. The reason this
value is so low, compared to other congestion avoidance mechanisms where the decrease would be of
0.5, is because the event only signaled an increase in latency and not a loss, which would be worse,
so the decrease doesn’t need to be more accentuated. For these flows, on the receiving end of an
event signal, the congestion window adjustment is lower than that of the event sender (table 2.6), it is
inversely proportional to the size of the group (N ) to compensate for cases where more flows experience
imminent congestion and also signal delay events.
17
2.5 Multiplexing Parallel TCP flows
A recurring topic of discussion is on multiplexing parallel TCP flows. Multiplexing groups the different
streams into a single one, as a way of reducing the redundant aggressiveness associated with having
multiple independent flows that share the same path. This aggressiveness would come from each
connection having to do slow start and experience losses individually. This is detailed as an application
level solution, by not changing the underlying protocol layers, it is implemented by application.
Implementing a multiplexing solution is restrained by the TCP protocol layer, because it doesn’t
discern multiplexing. All flows, that the application differentiates, are treated equally by a single TCP
connection. If the application requires more from the transport protocol (using different ports on the
same host or increasing throughput), then it needs to use parallel connections.
From the literature, some drawbacks are enumerated related to the use of most multiplexing solutions
(as in the case of P-HTTP):
1. Protocol changes are done per application.
2. Each application requires an independent TCP connection, different applications can’t multiplex
into a single connection.
3. Multiplexing adds coupling between independent streams that share the same connection. A loss
or delay will affect all the objects in the connection.
4. The maximum throughput available to a multiplexed connection is the same of a single TCP con-
nection.
The biggest advantage of using parallel connections is, as we have already seen, that we can in-
crease the throughput for a client by increasing the amount of parallelism for connections, but with
increased aggressiveness [6]. Adding a ensemble technique makes these parallel connections behave
as aggressive as a single one, and conform to the TCP Reno standard, without compromising the higher
throughput of parallelism.
2.6 Protocols Comparison and Analysis
For a detailed analysis of the different protocols, we focus only on the ensemble sharing part of the
reviewed protocols. Starting at the beginning, we compare the distinct mechanisms in use, giving a
subjective appreciation when deemed necessary to reinforce flaws. The protocols are: TCB Interdepen-
dence, TCP-Int, TCP Session, E-TCP, CM and TCP/DCA-C.
The different protocols, share similar tactics for introducing efficient ensemble sharing to TCP. TCB
Interdependence introduced the idea, but has less architectural detail and no working implementation.
It details storing the rtt , MSS and rttvar for a connection as soon as possible in a central cache, so that
it can be accessed by other connections that share the same path.
18
TCP-Int, TCP-Session and E-TCP are the closest to TCB Interdependence, and easier to differen-
tiate. But, unlike TCB Interdependence, they choose to cache directly in the TCB, which then acts as
a common structure between the same path connections that can be updated directly when any flow
detects a change in the network. This works better, because there doesn’t have to be any logic for con-
nections to know when to cache and update their state, and, this way, the state is always kept updated.
Because these designs use a common structure, the srtt , cwnd and ssthresh are also stored, in addition
to those already seen. From these, a distinction can be made, Ensemble-TCP uses a structure for each
TCP connection and then associates the state with the common structure, this allows these connections
to have independent parameters, as is the case of the connection priority which is used, this gives it
more flexibility, as to what can be stored, future-proofing it for subsequent updates. We question the
usefulness of this, compared to the others’ simpler design. From all the protocols, TCP/DCA-C is a
special case, the protocol only groups connections so that there can be signaling between them, but no
state is shared directly, only when the signals need to be sent.
As for the congestion manager, because it doesn’t fully explore its inner architecture, there is no
information on the details of its congestion mechanisms for parallel connections, making it impossible to
desiccate thoroughly. It does provide insight into performance tests used to compare the CM protocol to
a TCP implementation, to check if it achieves a similar performance and can compete fairly with other
TCP implementations.
The main goal for most of these algorithms is to have a shared congestion window for same path
connections, changes to this window are reflected in all the group connections. TCP-Int and TCP Ses-
sion act as a single connection for the same path, with a single congestion window and single queue
where different connections’ packets are enqueued. Scheduling wise, for the former we only know it
uses a round-robin variant, the latter uses weighted round-robin with dynamically set weights. E-TCP
does not stray from the formula, using a ticket based approach which reflects the relative share of the
congestion window, according to the priority given to each individual connection. The CM also uses a
round-robin scheduler, but reinforces the fact that schedulers are interchangeable. Then there is the
implementation of a later acks mechanism by TCP-Int, that was also used in E-TCP and TCP Session,
which speeds up fast retransmit.
Later acks and congestion window sharing prove that algorithms from standard TCP can be reimple-
mented to benefit from the use of ensemble sharing, at least those of which are path dependent. It is to
be seen if more algorithms can be adapted the same way.
2.7 Ensemble Sharing Considerations
Ensemble sharing solutions failed to gain adoption since they were initially proposed. We were unable
to find any reasons that could explain this, but there are a few disadvantages that we assume led to it:
1. At the time, the use of parallel connections was uncommon. A server in the year 2000 only found
44% of clients to make parallel connections [4].
19
2. The implementations don’t conform to the TCP/IP protocol stack, where abstraction between the
network layer and transport layer should be well defined. The protocols need to group connection
according to IP specific concerns, and that breaks the abstraction. This can then make it harder
to push for the protocol implementation in operating systems, which use those abstractions in the
network code.
3. NAT interfaces present a risk as a single client behind a NAT may completely deny connections for
other users in the same private network, to the same server, by purposely delaying acknowledg-
ments or making its connection timeout prematurely.
20
Chapter 3
Architecture
In this chapter we provide an overview of the chosen architecture, detailing the parts that constitute
it and how it addresses the imposed objectives. In section 3.1, we describe our architecture require-
ments. Sections 3.2 and 3.3 explain the mechanisms that compose the architecture, which we divide
into connection grouping and congestion control. In section 3.4 a short summary is given of the
entire chapter.
3.1 Requirements
As discussed previously, parallelism is often the preferred option for increasing the throughput of TCP
connections for a single host, but, as a consequence, multiple connections between the same hosts
add per connection redundancies, resulting in increased aggressiveness inside the network. This will,
in turn, increase the amount of packet losses, diminishing the maximum available throughput in the
network. Even so, as parallelism is a must for most HTTP applications, enabling TCP to deal effectively
with it can be an important factor in reducing the overhead present in modern traffic.
Not only are the solutions presented above, in chapter 2, are relatively old, but web traffic has
changed completely in the last decade. With the rise of dynamic content, traffic bursts became more
characteristic, which forced browsers to increase their upper limit on parallel connections [3]. It remains
to be seen if the kind of solutions described, are now, more than ever, appropriate for the new needs of
the Internet.
The base idea is simple, group connections and share events inside the group. This is based on how
TCP/DCA-C used concurrency.
• To group connections we need to take into account the geographical location of the destination
hosts and their delay. To do so, we use the IPv4 address’ hierarchical nature to group connections
in close proximity and use the round trip time to deal with path and delay inconsistencies.
• Same group connections can share events, signaling the entry of a new connection, losses and
exits; and providing new estimates for the congestion window and slow start threshold.
21
The system is comprised of 2 parts: Hydra, a data structure that receives network information from
different connections and assigns them to groups; and Heracles, the congestion control mechanism
that shares events between same group connections.
3.2 Hydra: Connection Grouping
Hydra can aggregate multiple remote hosts by their Internet Protocol version 4 (IPv4) addresses using
an implementation defined mask. In our approach we assume that 24 bit masks are used. There is no
certainty that any 2 addresses which share the same first 24 bits are indeed topographically close or
that hosts with completely different addresses aren’t close, but it’s the most efficient way for the sender
to check for potential group matches.
When a connection is inserted into a group, based only on its IPv4 address, a problem arises, we
have to admit that all connections in the group have the same delay, which, in the case of the Internet,
is impossible. As Jacobson described it, the Internet is a ”network of unknown topology and with an
unknown, unknowable and constantly changing population of competing conversations” [1]. Basing our
work in assumptions can reduce system complexity and improve the performance for most hosts. but
false positives (when a connection is wrongly inserted into a group with different network requirements),
even if rare, can be problematic. Supposing different rtt samples are taken from different connections:
a disparity of rtt estimations, causes wrong rttvar and srtt calculations, that will, in turn, provide wrong
timeouts, decreasing throughput. For groups of connections with the same IP address, we could have
the same problem: the address could be a NAT interface hiding multiple hosts, and we would need
to assume that the internal network latency is always negligible. It should be noted that, even if the
internal network latency was negligible, the grouping is still vulnerable to connections that timeout or
ill-intentioned users that delay the ack sending to inflate the timeout estimate.
This could be handled by having a beforehand knowledge of the network and only using the protocol
for specific trusted hosts, known to have similar path delays. Instead, we have a fail-safe mechanism
for detecting the problem above, to make the protocol functional independently of the network topology,
and minimize the effects of false positives.
The hydra structure is managed by a simple interface for adding, removing and updating groups
from the congestion control module. An interval comparison function is used to make decisions for each
group, based only on the latest rtt sample. Whenever a group is to be picked, the hydra structure is
accessed (figure 3.1). An hash table separates different 24 bits subnets according to the connections’
IPv4 address. For each position a binary tree is stored, groups are sorted based on the last rtt sample
taken from a connection. The group’s information can then be accessed during congestion control
and be updated with new information. Initially a connection has no group, until enough information
is gathered from the connection’s path, then it tries to find a group or create a new one. A group
is only determined according to the path’s round trip time, for which it tries to get a minimum of 3
acknowledgments, before trying to find a group or creating one. There is no specific reason for the
choice of the number 3 as the lower limit of acknowledgments, it was chosen as a way to make the
22
Hydra GroupGroup SizeSubnetGroup rttTotal ssthreshTotal cwndEvent Timestamps (JOIN, LOSS, LEAVE)
Table 3.1: Information stored in each hydra group.
Figure 3.1: Representation of the hydra structure, composed of an externally linked hash table andbinary trees, where each leaf represents a hydra group, which the Heracles structure points to.
connection increase its sending rate before joining a group, allowing the network to stabilize a bit, before
making the assumption that the connection shares a path with other connections. Group information
is described in table 3.1, which is posteriorly accessed to calculate network latency estimations. The
subnet is used so that the connections changing groups don’t need to traverse the initial hash table more
than once.
After each rtt sample received, the connection checks if it needs to change group, reseting group
specific information and finding a new group. When the connection stops transmitting, it performs any
necessary group cleanup. The operation required to change group is heavily simplified in our implemen-
tation. Our assumption is that if a connection is inside a group, then the rtt interval calculation doesn’t
need to be strict and should be quick to verify. We only check if the rtt sample received is within an
interval from 0.5 to 1.5 of the group’s last round trip time, the calculation is coarse, but enough to only
be affected by big changes in the network.
23
3.3 Heracles: Congestion Control
Heracles is the congestion control algorithm, based on the Reno algorithm. Its base operation is closely
related to Reno, with added complexity for dealing with multiple connections for each group. So that,
when a connection is inside a group, it can send or receive events. There are different possible execution
paths according to whether the connection is in slow start or congestion avoidance:
• During Slow Start a connection can skip it, if it already found a group, else it behaves the same
as Reno and increases the congestion window exponentially.
• During Congestion Avoidance a connection always increases linearly. After, it tries to find a
group if it hasn’t already and updates the group’s cwnd and ssthresh total values.
In any of the previous states, connections check for the latests group events, dealing accordingly with
them. There are 3 possible events that force cwnd and ssthresh changes for the entire group: losses,
joins and leaves. For losses, all connections update the cwnd and ssthresh values to the estimated
ssthresh value of the connection that transmitted the event, which had the loss, decreasing both values.
During leaves, all excess cwnd is split across the remaining connections equally. On reception of a
join event, connections only update the ssthresh, there is no congestion window decrease in this case,
even if the joining connection starts with an increased cwnd . There is the possibility of performing
some decrease, which could be beneficial, making it easier for newer connections to increase. Our
evaluated implementation doesn’t do it, as a way of mimicking normal connections and to behave better
in concurrent environments
The Reno protocol cwnd increases stay unmodified in the Heracles protocol. For N connections
inside the group, the total window rises N times as fast as a single independent connection, as all
connections search for bandwidth concurrently, whether exponentially during slow start or linearly in
congestion avoidance.
When a loss happens in Reno, the single connection which perceived the loss will halve it’s ssthresh.
For Heracles the same happens, but the reduction in that single connection is split equally between the
connections in the group. Instead of a decrease of1
2, there is a decrease of
1
2N(for N connections in
the group). From the group’s viewpoint, only 1 connection suffered the loss, as with Reno, but it is shared
equally, to promote fairness. This goes against Van Jacobson’s stand on multiplicative decrease of con-
nections, halving the cwndduring congestion, [1] and doesn’t agree with the previous implementations,
which made losses halve all connections in a group [6, 26]. We defend that this approach is better for
the current state of the Internet, where most hosts using multiple parallel connections only decrease for
each connection, because there is no information being shared between the different connections. For
competing protocols, the usual aggressive approach would be disadvantageous for a Heracles conser-
vative protocol that reduced too quickly. As others would take the decrease in throughput as a chance of
stealing more bandwidth. This can be seen on the interactions between Vegas and Reno, where Reno
is more aggressive and ends up using an higher bandwidth percentage. On the network, Reno tries to
overflow buffer queues, leading to losses and Vegas sees this as a reason to reduce its own sending
rate [28].
24
A connection can leave a group for 2 reasons: when it has nothing more to send or when it receives
a round trip time sample outside of the group’s interval. In both cases, it produces an event sharing
new window information with the group. When it changes group, its own window information remains
unaltered, so that the connection doesn’t have to restart itself on the new group.
During a join, a connection fetches a ssthresh estimate, so that it can use it to update its own cwnd
and instantly skip the slow start procedure. The connection can then start sending without having to
worry about finding the initial slow start value, suffering a loss and having to recover from it. Shortening
the slow start time decreases the amount of round trips required for it; where the connection has a
bottlenecked cwnd and purposefully limits its own throughput.
The mechanism used for sharing information is not perfect, because of the use of events. Events are
not consumed immediately after being emitted, only when execution control is passed to the recipient
connection. This is slightly problematic, as in the time it takes for a connection to read an event the
network state may have changed, and have made the event information useless. The delay until other
connections receive the event may be enough to provoke network congestion, especially with an high
degree of concurrency in bigger groups. This is a limitation of doing shared congestion control on top of
Linux’s network stack, which we discuss further in chapter 4. Events are prioritized (in order of the most
important to least one) by: loss, join and leave. Losses are the most important, because they will always
force a window decrease. Joins only requires the ssthresh estimate to be updated. A Leave event is the
least important, because it increases the connection’s window and it’s the only event where our protocol
forces a window rise. To simplify the event sharing platform and minimize the performance overhead, we
only have connections deal with the last highest priority event received and all other events are ignored.
This solution is similar to what was already seen in TCP/DCA-C. Using events allows the congestion
control to be built separated from the transport protocol, without having to modify the kernel directly.
3.4 Chapter Summary
In this chapter we presented a general overview of our protocol’s inner working. Detailing the two
mechanisms that compose it and explaining the mindset for the decisions taken in its design.
• Hydra, for grouping different connections into a single entity based on a common path to a subnet
and similar rtt . Shared information is calculated to be posteriorly read by other connections in the
same group, being constantly updated for providing the best network estimates.
• Heracles, from which connections interact with the kernel: receiving network specific information
and sharing it in their respective groups. Then deciding on a network state for all connections in
each group.
We explained the way different connections communicate with each other, storing events in the group,
to be then accessed by others. Events which can be from a new connection, from a connection that left
or from a connection that suffered a loss.We then presented how connections individually react to each
25
of these events, changing their network values, to accurately represent the shared state. We also ex-
plained how these connections deal with the rtt samples they receive, adapting themselves to changes
in network congestion, by partitioning into different groups. This way, connections can defend them-
selves against abnormal network states, where connections in a group with supposed close network
proximity provide disparate values.
26
Chapter 4
Implementation
In this chapter, we go through important aspects of the protocol implementation. In section 4.1, we
provide information on the tools that were used to build the working protocol implementation. In section
4.2, we discuss details of the implemented protocol.
4.1 Implementation Options
This thesis’ work was targeted specifically for the Linux kernel. The possibility of using network sim-
ulators was discarded for the option of building the protocol directly in the kernel, as a real use case
implementation. With a simulator we would have more control over the state of the network and make
time agnostic tests, simulating the connections’ delay, this allows tests to iterate over the protocol’s con-
gestion control more times, facilitating the evaluation. Even so, coding a direct software implementation
for the Linux Kernel allows us to have results pertaining to real world use, in a real operating system,
without having to port code build specifically for a simulator, with different restrictions for which end
results may differ.
4.1.1 Linux Kernel
Linux is based on the Unix operating system. Created by Linus Torvalds and maintained by a team of
remote contributers across the world 1, the kernel is almost entirely coded using the C programming
language with the code base openly available.
Kernel code is different from normal user space code, that uses specific user libraries and syscalls
(system calls) to communicate indirectly with the kernel. User space code is not as prioritized as kernel
code, is limited to the provided higher level libraries and has worse performance from having to change
the operating system context (between user and kernel) when using syscalls, but is much safer to run.
The hygienization of user space input when changing context, prevents bad user space code or inputs
from stopping the operating system’s execution. The same control is not entirely possible in kernel code,
where it is easier to have it panic and crash.1git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/README?id=refs/tags/v3.16.37
27
The kernel provides a IPv4 implementation for socket communication, and with it, a complex TCP
state machine, for managing the sender’s transmission, as defined by the TCP standard. TCP has loose
rules pertaining to the sender’s desired flow control. The rate of sending and retransmitting is not defined
in the describing rfc. These rules are mostly implementation dependent 2.
As for our protocol, the original intetion was for it to be coded within the kernel. However this quickly
became a problem due to the sheer complexity of the kernel network stack. A straight kernel implemen-
tation would generate the best performance, but it would require modifying the TCP stack and possibly
the IP stack. Doing direct changes to the kernel code would most likely break the normal implementation
with the added functionality, adding in the process side effects and severely increasing the development
time spent debugging.
Some decisions must be stated: we chose to target version 3.16 of the kernel. The kernel major
version 4 is most recent, but version 3 still sees widespread use, from which we chose the minor version
16. From the major version update, there were small changes to the module interface, but it remains
backwards compatible, other than that, different versions have small differences in the code, for which
results may differ. We make no assumptions of the protocol’s effectiveness for versions different of our
own.
All referenced documentation and code in our work is from that same version, taken from the Linux
code base, which is accessible from public repositories3.
4.1.2 Linux Modules
Linux allows for the creation of dynamically loaded code in the form of modules As of the kernel version
2.6.13, the TCP implementation allows pluggable congestion control modules.
Congestion Control Modules
The Linux kernel allows the congestion control mechanism to be linked from outside of the core kernel,
as a kernel module. Building the congestion control as a module reduces the interactions between the
network protocol itself and the congestion protocol, making it less error prone. All kernel congestion
control mechanisms are built as modules, except reno, which is hard coded into the kernel and used
as a fallback in the absence of any congestion control mechanism. By default, the 3.16 version of the
kernel uses the Cubic algorithm4[29], a congestion control protocol designed for fairness with competing
protocols, whether on short or long rtt paths.
An interface is defined for calling congestion control functions. At a minimum it only requires 2
functions, ssthresh and cong avoid. The first is called upon a loss, for which it returns a new value for
the connection’s ssthresh. Reno reads the congestion window and returns half its value. The latter is
called when any numbers of packets are acknowledged (the kernel doesn’t call it for each individual
ack ). Reno compares the cwnd with the ssthresh, determining the connection’s state, processing either
2git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/Documentation/networking/tcp.txt?id=refs/tags/v3.16.373git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree?id=refs/tags/v3.16.374git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/Documentation/networking/tcp.txt?id=refs/tags/v3.16.37
28
the exponential or linear window increase. Implementing the protocol as a module is the easier option,
though it has some minor drawbacks:
• The scope of the interface provided only allows operations over a small number of settings that
influence the congestion control, as is the purpose of the module. As an example to this, we don’t
change the rtt calculations being performed in the kernel directly. At the risk of breaking the TCP
protocol completely. Fast retransmit is also out of bounds from the module.
• Path information can’t be shared directly between connections. Connections in groups can only
perform Heracles specific calculations from inside the congestion control modules, increasing the
time it takes between one connection modifying the group state and all others updating.
4.1.3 Kernel Debugging
Debugging at the kernel level is not easy, requiring some setup. Most options consist of running a
debugger with a virtualization of the kernel, though there are simpler options as printing debugging
information and having errors logged.
Logging
The function printk allows kernel code to print formatted strings into the kernel ring buffer, from which
these messages can be read from, with appropriate timestamps. It has the signature int printk(const
char * fmt, ...) and the first argument can receive a prepended keyword to characterize the message
logging level5. Printing always adds performance overhead, which is noticeable in network code that
needs to be efficient. It can be used to log values directly from the congestion protocol, from which we
can graph the algorithm’s variables, checking if it’s behaving as supposed.
We added a printk line, to log information on the state of each connection, whenever TCP called
the cong avoid function to increase the congestion window. On each log line we stored: connection
and group identifiers, TCP information and a timestamp. For each entry, we trimmed unnecessary
information, plotting the cwnd and the ssthresh of the different connections. For the evaluation portion
of the protocol the logging was removed to improve its performance.
Kernel Oops
A kernel oops is a kind of error that makes the process die, but is not severe enough to make the kernel
itself crash. This can be caused by a module trying to access an incorrect memory position or it can be
from a call to the BUG or BUG ON macros, which are used for making code assertions. Whenever a
kernel oops occurs, the kernel task being executed is killed and the error that led to to oops is logged
to the kernel ring buffer with a call stack, some disassembly code and a register dump. After this, the
module stops working and can’t be unloaded unless the system is rebooted. On the log, we find the
function where the oops happened and the machine code offset of the offending instruction. If the kernel
5git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/include/linux/printk.h?id=refs/tags/v3.16.37
29
is compiled with the CONFIG DEBUG INFO flag, debugging symbols are available, allowing the kernel
oops to indicate the specific line in the source code that caused the error.
Kernel Panic
Kernel panics are errors from which the kernel can’t recover, leading to a crash. Debugging a kernel
crash is harder, because information isn’t logged as it is by a kernel oops, and memory is flushed from
the system and becomes unrecoverable. However, a tool as kdump6 helps recover crash dump data,
from which we can debug the kernel problem that triggered the crash.
4.1.4 Scripting
Scripting allows us to reduce the time it takes to perform a set of actions repeatedly. In our case, we
resorted to scripting to reduce the overall time spent compiling, running the protocols and evaluating
them. We used make and python.
Python
Python is an interpreted dynamically-typed programming language, with an emphasis on readability and
ease of programming. We used python 2.77 for:
• Starting external processes with the subprocess module;
• Automating the creation of different TCP clients processes. Using the threading library we could
start concurrent threads opening connections to a TCP server and controlling their expected be-
havior;
• Regex pattern matching, which we used to automate the creation of incremental logging files;
• Parsing of logging files and trimming;
Most high-level languages offer the same functionality offered above. Languages like C, take a lot of
boilerplate code and most default libraries it uses are low level, targeting mostly systems programming.
From the other side of the spectrum, interpreted languages like ruby8, python and lua9 are all dynamically
typed, doing no type checking during compile time and offer some high level programming constructs.
They are easier to develop in and can reduce a large chunk of development time, though at the cost of a
lower performance, when compared to compiled languages. We chose python over the others, because
it has a low learning curve and comes packaged with most Linux distributions.
6git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/Documentation/kdump/kdump.txt?id=refs/tags/v3.16.377www.python.org/download/releases/2.7/8Ruby language: www.ruby-lang.org/en/9Lua language: www.lua.org/
30
Make
Make is used for program compilation and building. We used it for compiling Linux kernel modules. It’s
useful, because it takes care of dependencies between files and can check on its own if any requires
recompilation, without compiling everything again.
4.1.5 Iperf
Iperf10 is a simple tool for creating TCP data streams between hosts. A client connects to a server and
it can transmit a set amount of data or transmit data over a number of seconds, pushing the limits of the
TCP window.
Initially we tried using our own implementation of a TCP server and client, but the connection con-
stantly failed to make use of the capacity of the network. The sender couldn’t inject enough data into
the network, making the connection window stall during slow start, never to lose a packet. For Linux,
the window stalls if the congestion window is more than twice of the packets in flight, for which the ex-
ponential and linear increases aren’t processed. This prevents the connection from growing the window
indefinitely without using it completely and suddenly filling the window, injecting more packets than the
network can handle, which ends up causing network congestion. Iperf doesn’t suffer from the same
problems. It can stress the network to force losses and transition to congestion avoidance. It comes with
some other features too: it can transmit using a specific congestion control algorithm and output data
as comma-separated values (csv ) file from which we can graph the throughput. Iperf is available in 2
different versions, there is a choice between version 2 and 3.
Iperf3 has some added changes over the previous version, most importantly, it is able to create
parallel connections between the client and the server, reverse the data flow, which, by default, happens
from the client to the server, and log retransmission information. Unfortunately, the server can only have
1 connection at a time. This is a major disadvantage for our evaluation, that requires different clients to
connect to the server at any time and we need precise control of each client’s lifespan. For that reason
we chose to use version 2 of the tool.
4.1.6 Netkit
Netkit11 is a lightweight tool for creating multiple virtual machines to test network applications. It is not a
simulator, it only provides a virtual infrastructure for creating the closest alternative to a physical network.
One of this work’s objectives was to implement a real use case protocol for the Linux kernel. It is not
possible to test it on a simulator,as it would have to be adapted, and even then, the simulator would
need to use Linux’s network stack to accurately portray its behavior. Using Netkit, we can run the tests
on a virtual controlled environment that is easily deployable. The biggest drawback to its use is that it is
only available on some older specific Kernel versions and this makes it harder to access and download
the tools required to compile and run the module from the available repositories. To fix this, we created
10Iperf website: iperf.fr/11Netkit homepage: wiki.netkit.org/
31
a network interface from the physical machine on which Netkit is running, to Netkit itself. The virtual
machines only need to run the servers (which receive data), while the physical machine runs the clients,
where the congestion protocols are tested, acting as a server.
4.1.7 Tc
Tc is a complex tool for managing traffic control in the Linux kernel. It allows us to limit network interfaces,
imposing queuing disciplines, which takes decisions on the packets scheduled for specific interfaces. For
our case we used a token bucket filter (tbf ) which is a discipline for limiting traffic rate. With it we can
bottleneck the bandwidth available and add delay to the link.
4.2 Heracles Module
The subsystems that compose the module were already detailed in the previous chapter. Here we
present an in-depth explanation on how the module interfaces with the kernel and the Reno algorithm.
In total, five functions are defined to interface with the kernel from the tcp congestion ops structure12. We’ve already discussed the ssthresh and cong avoid function pointers, which are the minimum
requirements for the TCP stack. We also used three optional functions:
• pkts acked - notification of a new round trip time calculation;
• init - startup behavior for new connections;
• release - cleanup behavior for connections.
The module isn’t implemented as an universal bridge between the congestion protocol and the
tcp congestion ops interface, instead it encapsulates Reno, polluting its interface with the kernel. The
downside of this is that our module can’t easily bridge other congestion protocols and the implementation
would need to be adapted for each one. We could have implemented a middleware algorithm, agnostic
to the congestion control module, however it would introduce significant performance overhead, which
should be avoided when dealing with a network algorithm.
The ssthresh function is modified from Reno to check if the connection is inside a group. A connection
inside a group doesn’t have to halve its congestion window. It only needs to decrease it by1
2N(for N
connections inside the group). We update the expected ssthresh value, store an event for the loss and
return the new ssthresh estimation.
The cong avoid is the main operating function of any congestion algorithm. In ours, as in Reno,
the cwnd is compared with the ssthresh, to determine congestion state. If the window is lower than
the threshold, it performs slow start, if equal or higher it performs congestion avoidance. During slow
start, the connection tries to find a group, if it has enough minimum acknowledgments. If there is no
group, protocol execution is transfered to Reno, to manage the window individually. When in congestion
avoidance, the algorithm always does reno window calculations, for the connection itself. The connection12git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/include/net/tcp.h?id=refs/tags/v3.16.37
32
Figure 4.1: Diagram of the Heracles cong avoid function.
33
Figure 4.2: Diagram of the Heracles pkts acked function.
doesn’t need to have a group during congestion avoidance, but it keeps trying to find one. If there is
a group, the connection starts receiving that group’s events and updates its window information in the
group. Figure 4.1 displays the diagram with the main operations performed by the function.
The pkts acked function receives a new rtt sample and does a fast interval check to determine if
a group change is needed (Figure 4.2). A more detailed approach would be to use a mix of different
metrics, because this problem has some similarity to the rto calculation problem: higher degrees of
congestion were not taken into account and the algorithm would fail to approximate correctly the timeout
[1]. For our protocol, if the interval is too small, group churn will be higher, decreasing performance, if
the interval is too big, connections with diverging paths will share bad information in the group, that may
increase injection of packets in the network, inducing congestion and, consequently, losses. We don’t
focus on this problem, but are aware of what it entails.
The init function initializes the heracles control structure. This structure primarily stores the current
group and event timestamp counters.
The release function handles the heracles structure exit routine, checking for an existing group and
removing it, emitting a leave event.
As to the interaction between Heracles and Hydra, Hydra exposes an interface that abstracts its
internals from Heracles and allows control over the following operations:
34
• bool hydra_remains_in_group(struct heracles *heracles)
quickly calculates if the protocol is going to change group from the update;
• struct hydra_group *hydra_add_node(struct heracles *)
returns the group for the respective Heracles structure;
• struct hydra_group *hydra_update(struct heracles *)
changes the group of the Heracles structure;
• void hydra_remove_node(struct heracles*)
removes the connection from the group, performing group cleanup if required.
For each connection added to the hydra structure, we access the IP information to read the con-
nection’s IPv4 address, this is available from the sock structure, as the sk daddr field, which is directly
passed as an argument to the congestion control functions13. When a subnet is to be picked for the new
connection we perform a bitwise operation to get the key for the hash table, traversing the initial portion
of the hydra structure. The operation allocates memory for the hydra group structure. This uses the
kmalloc function which is similar to the malloc function, but receives a flag, with specific instructions for
the memory allocator. For network operations, the allocator cannot sleep, so the GFP NOWAIT flag is
chosen14.
When an update happens, we perform a check on whether the connection will change groups and if
the old group will become empty, removing the group in the process. The connection is then added to a
new group, directly from the tree it was in.
Removing a node, deletes information about the connection and the group if needed, freeing the
allocated memory in the process. Special care is taken to successfully remove the groups when no
connection is inside. Failing to remove the groups would leak memory and increase group search times
in the tree.
4.2.1 Fast Retransmit
Fast retransmit was one of the mechanisms referred in previous papers [6], that could be modified
to benefit from the connection grouping, where connections use acknowledgments from each other in
the same group to speed up fast retransmit sending, recovering sooner. But this mechanism is not
accessible directly from the kernel module and it would require more time than what was available, in
order for us to gain a deeper understanding of the inner workings of the TCP portion of the kernel. For
those reasons it was left out of the implementation.
4.2.2 Data Structures
Hydra uses different data structures to fulfill its duties efficiently. An hash table is used for group insertion,
as a key we use 8 bits, from index 16 to 23. This minimizes Hydra’s reserved memory space in the kernel,13git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/include/net/sock.h?id=refs/tags/v3.16.3714git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/include/linux/slab.h?id=refs/tags/v3.16.37
35
needing only a table of 255 positions. The kernel provides a default hash table 15 implementation. This
table uses external chaining for dealing with collisions, which are appended to a list, making the search
time linear for IP’s with the same 8 repeated bits. This is not a big problem, connections only go through
the hash table when created, even so, for bigger servers, a bigger key should be used, dependent on
the average amount of different IP addresses seen by the server. An hash table seems like the better
option for high performance code, being only memory dependent. Trees require an higher search time
and constant rebalancing.
For the group search portion of the structure a self-balancing sorted binary tree is used. Being
able to access groups in sorted order is necessary for quickly picking groups whenever there is a new
connection or an update causes a change in group. The use of a tree also allows groups to use dynamic
amounts of memory; even though it has higher lookup times, when compared to an hash table. This is
important since a server will have no guarantees about the number of connections, with the same 24 bit
subnets, occupying the same tree. The kernel provides a Red Black Tree implementation16[30] , which
works the same way as AVL trees [31]. Both are sorted binary trees, only with different rebalancing
performance and with red black trees having slightly lower lookup time.
15git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/include/linux/hashtable.h?id=refs/tags/v3.16.3716git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/Documentation/rbtree.txt?id=refs/tags/v3.16.37
36
Chapter 5
Evaluation
In this chapter, we present the results achieved with the Heracles protocol, comparing it against Reno
and Cubic. Section 5.1, explains the goals of the evaluation scenarios, required to test the correct be-
havior of the protocol. Section 5.2, describes the different tests used to evaluate the protocol in different
environments. Section 5.3, explains how the tests are performed and how results are obtained, handled
and shown. Section 5.4, presents and discusses the results. Section 5.5, concludes the chapter, pro-
viding a deeper analysis into the entire set of results obtained, characterizing the protocol’s advantages
and disadvantages in different scenarios.
5.1 Tests Objectives
For the evaluation portion of our work, we compare 3 different protocols, Heracles, Reno and Cubic.
From these, the comparison between Heracles and Reno is the most important, because the former
uses the latter to achieve better performance. Cubic is also in the tests, because it can be used by the
other protocols as a comparison point to a more modern congestion control algorithm.
The evaluation tries to focus on different aspects of the Heracles protocol to test its throughput
against the alternatives. The objectives are the same described on chapter 1.
• skip slow start on paths for which the threshold is already known;
• react to losses on group by decreasing throughput fairly;
• Share common path information to provide better cwnd and ssthresh estimations.
For our evaluation, we present different test cases with specific characteristics, where the Heracles
protocol should have an advantage. We should be able to see better performance by noticing an in-
creased throughput. It should be noted that the tools used don’t provide information specific to TCP; as
is the case of retransmissions. An increased number of retransmissions could be an indicator of a more
aggressive protocol, but the negative effect of retransmissions will ultimately impact the throughput of
the protocol, even if harder to diagnose the cause. As such, we only use throughput as a performance
metric and the mean deviation as a fairness metric.
37
The following test cases are not meant to be thorough or accurately describe normal behavior of
TCP based protocols, but to provide an insight into the use cases of the algorithm. The tests are:
• Long/Short - 1 long lived connection and short lived connections in constant intervals;
• Parallel - multiple connections over the same interval;
• Sequential - short lived connections are interpolated, before a connection ends another starts;
• Packet - 2 client streams constantly flood the network, each with data of different lengths.
The tests are described more in detail in each respective section.
5.2 Tests Scenarios
5.2.1 Long-Short
This test has a max throughput connection, which we call the long connection, that is the first to be
created and lasts until the end of the test. Then, short lived connections start transmitting in constant
intervals, at any point in the test there is only a short lived connection and a long lived connection. All
connections are controlled over how many bytes they can send in a specific amount of time, because
time is independent of the network’s available throughput, and, as such, is easier to manage. This should
allow us to analyze how many bytes these connections can transmit in short intervals of time, while
taking advantage of the preexisting connection. Specifically how fast they join a group and converge in
the network and how the whole group adapts.
For this test we used 20 short connections, with each lasting 5 seconds and a 1 second interval
between them. The test lasts in total 3 minutes.
5.2.2 Parallel
This test is comprised of parallel connections from the server to different clients, connections start and
end at the same time. This test should be able to discern: how fast can connections converge and
stay consistent throughout their lifetime. The test won’t factor group churn for Heracles, because of
the consistent network state throughout the test. Losses should be a major intervener in deciding the
throughput. The test was ran for 2, 4 and 10 parallel connections, with each single test lasting 60
seconds.
5.2.3 Sequential
For this test, short connections are created sequentially, a connection only stops transmitting after the
following connection starts. At most, only 2 connections transmit at the same time. This allows us to test
connection churn inside the Heracles’ groups, for a low number of connections. It should also test how
fast can the connections share information and converge on the network, compared to other connections
38
Figure 5.1: Test Network
competing with each other. The server opens 50 connections, each lasting 5 seconds, 2 seconds before
the current connections stops transmitting, the next connection starts.
5.2.4 Packet
The Packet test is the only non time dependent one. There are 2 main client streams, one sending 1MB
and the other sending 100KB. The first opens 10 connections sequentially and the seconds opens 100
connections sequentially. Heracles should be able to reduce throughput of the bigger connections in
favor of speeding up the smaller connections.
5.3 Methodology
The Netkit software was used to run the tests locally in a controlled environment. Tests are not simulated,
but use the linux kernel to perform networking operations, this should give us accurate results, with a
low degree of unpredictability.
The Network consists only of a server and client, directly linked, with a 100 ms delay and a 100 Mb/s
total throughput (Figure 5.1).
To make the connections’ behavior more realistic we added a small number of TCP background con-
nections between the 2 machines, for the tests we used 10 connections, being transmitted using reno.
These connections are also trying to send as much as possible and are refreshed each 10 seconds.
Having no background traffic makes the connections being tested only steal throughput from each
other. Because we want to replicate the parallel behavior of connections on the Internet, with more
clients, having 2 connections should almost double throughput for the client, instead of it staying the
same if those are the only connections using the network.
Tests are almost all time dependent, with each taking from 1 minute up to 5 minutes. We repeated
each test 10 times, for each of the 3 used algorithms. Samples were gathered using IPerf’s lowest
logging interval of half a second. Output generated from the Iperf client processes is redirected to
temporary files. This data is then read by our python script, which splits different clients according to
the source port from the csv files. IPerf generates csv logs with the following fields: timestamp, source
IPv4 address, source port, destination IPv4 address, destination port, total duration, log start interval,
log end interval, bytes transfered and throughput in bits per second. The logs from Iperf revealed some
39
problems:
• The timestamp is not accurate enough, with the smallest interval being in seconds
• The output appears with a delay in the standard output and the time field represents when the log
was flushed to the standard output and not when the respective event happened.
These are problematic, when trying to divide the logs and have them correspond correctly in time to
each other. To fix it, we added a counter number to the logs to represent the sequence in which the log
appeared (this works because logging intervals are constant).
From the different connections, we can plot graphs to observe the algorithm’s behavior and process
data, to calculate average throughput and the throughput deviation. To visualize the throughput proba-
bility for each protocol we show data as an empirical CDF graph. Each graph is built using the average
throughput samples taken directly from Iperf. Iperf showed limitations when trying to output the average
throughput for each connection. Most entries on the logs had the same throughput values and we were
unable to pinpoint what caused this inaccurate behavior. Only on packet tests were we able to get a
bigger variety of throughput samples. In each graph of this type, the horizontal axis represents the con-
nections’ throughput in Mb/s and the vertical axis represents the throughput’s ratio. These graphs are
created using mathematical python libraries: numpy1 that handles data and matplotlib2 that visualizes
data, by plotting it to a graph.
One of our examples shows the connections’ throughput graph over time, to differentiate the different
connections being monitored in our test cases. This graph is made from data taken by Iperf, which is
then plotted with gnuplot3. To improve the visual data, we create a 5 point smoothed moving average,
using two previous values, the value itself and the next two values. The horizontal axis represents our
sample rate (2 samples per second) and the vertical axis the connection’s throughput in Mb/s.
The result tables show average throughput and deviation. The average throughput is taken by cal-
culating the mean from all entries in each test and then recalculating the average for all test repetitions.
The deviation represents the average of the mean deviation for each test. In each test we calculate
the mean and for each sample we calculate the deviation from the mean. We then average the mean
deviation results from all repetitions.
5.4 Test Results
5.4.1 Long Short
For the long/short test (Table 5.1), Cubic has the best overall performance, with the long connection
being able to clearly steal a big share of the available bandwidth. On the downside it is the least fair
algorithm, with the lowest amount of bandwidth for the short connections from the 3 protocols. Because
1Numpy website: www.numpy.org/2Matplotlib website: matplotlib.org/3Gnuplot website: gnuplot.sourceforge.net/
40
Reno Cubic Heracleslong short long short long short
Average Throughput (Mb/s) 9.57 7.28 14.19 6.2 11.45 7.61Average Short/Long Deviation(Mb/s) 2.69 7.72 3.98
Total Throughput (Mb/s) 16.85 20.4 19.05
Table 5.1: Long/Short test results.
Figure 5.2: Empirical CDF plot for Long/short throughput.
the long connection has an average throughput of 14.62 Mb/s, this leaves a lower amount of throughput
to be used by the rest of connections in the network.
Heracles, on the other hand, has a better long throughput, when compared to Reno, with the best
performance for short connections. This is due to the way the protocol deals with group enters, making
both slow start and its initial loss unnecessary for groups. Exits then allow for the protocol to achieve
an higher performance than Reno on which it is based. When comparing the total throughput, Heracles
has 4.9% less than Cubic. As for the fairness metric, the average mean deviation between Long and
Short connections, Cubic has almost double the deviation.
From Figure 5.2, when comparing the performance probability from the total throughput, Heracles
had the lowest probability of worse performance, having the highest probability of better throughput, until
the 10Mb/s mark, where Cubic has the highest throughput ceiling.
5.4.2 Parallel
For 2 long lasting parallel connections (Table 5.2), Heracles had clearly the worst performance of the 3
protocols. With a average throughput of 18.56% less than Cubic and 2.8% less than Reno. For a 100
Mb/s link, supposing a perfect use, each connection should have an ideal throughput of about 8 Mb/s,
again Cubic is able to steal the highest percentage of throughput.
This test is only testing long lasting connections, so joins and leaves are rare, and group churn should
be non-existent, because of the constant throughput from all connections. This test only focuses on loss
41
performance from the congestion avoidance window rising too high and lowering back down. Even when
comparing only Reno and Heracles, our protocol’s performance degrades. The way both protocols deal
with losses are similar so a big difference in throughput wasn’t expected. The added complexity of the
Heracles protocol is probably to blame for the slightly lower performance.
Reno Cubic HeraclesAverage Throughput (Mb/s) 16.07 19.18 15.62
Average Mean Deviation(Mb/s) 0.58 1.51 1.3
Table 5.2: Results for parallel tests with 2 connections.
With 4 connections (Table 5.3), the protocols start gaining a considerable share of the available
throughput. They all have similar performance, with a slight disadvantage to Heracles, though Cubic
loses its significant edge over the others.
Reno Cubic HeraclesAverage Throughput (Mb/s) 30.25 30.40 29.84
Average Mean Deviation(Mb/s) 0.5 1.19 1.04
Table 5.3: Results for parallel tests with 4 connections.
Finally, for 10 connections (Table 5.4), the test is controlling half of the connections using the network
link. In this test, Reno has on average more than half the throughput available in the network. Compared
to Reno, Cubic has 3.86% worse performance and Heracles has 12.18% worse performance. What
is interesting to note is that connection deviation is lower, this may be due to the higher number of
connection being analyzed that provide a better estimate. Looking at Figure 5.3, we see represented
the empirical CDF values for each of the parallel tests. It should be noted that the throughput scale
gets smaller with an increase of connections, because more parallel connections reduces available
throughput in the network, which in turn diminish the effect of the use of parallel connections.
Reno Cubic HeraclesAverage Throughput (Mb/s) 50.35 48.29 44.11
Average Mean Deviation(Mb/s) 0.3 0.6 0.15
Table 5.4: Results for parallel tests with 10 connections.
5.4.3 Sequential
For sequential connections (Table 5.5), Heracles shows the best performance, being able to transmit
more 11.7% than the other protocols. This can be attributed to the high connection churn, which benefit
Heracles, as events consist mainly in joins and leaves. Joins quickly increase the newer connections’
windows, while leaves will allow the older connections to get an higher share of the network than what
was previously available. Looking at Figure 5.4, values are similar for the 3 protocols, but, for this test,
the Heracles function has the highest throughput probability, being the rightmost function.
42
Figure 5.3: Empirical CDF graph for 2, 4 and 10 parallel connections respectively.
Reno Cubic HeraclesAverage Throughput (Mb/s) 6.80 6.80 7.60
Average Mean Deviation(Mb/s) 1.14 1.00 1.14Bytes Transfered (GB) 0.433 0.432 0.483
Table 5.5: Sequential test results.
5.4.4 Packet
Connections during the Packet test achieve similar results (Table 5.6), with Reno having the worst per-
formance and Heracles having the best performance, but also the highest deviation value from the
protocols. Heracles’ throughput is only 2.78% superior to Cubic’s, but its deviation is 31.94% higher.
From Figure 5.5 we can see that the highest deviation value translates into a higher throughput ceil-
ing. For the lowest throughputs, Heracles has similar probabilities to Cubic. Reno has the leftmost
function with the worst throughput percentages, with a clear margin between itself and the other proto-
cols.
Reno Cubic HeraclesAverage Throughput (Mb/s) 9.96 11.89 12.23
Average Mean Deviation(Mb/s) 1.86 1.91 2.52
Table 5.6: Packet test results.
43
Figure 5.4: Empirical CDF plot for sequential throughput.
Figure 5.5: Empirical CDF plot for packet test throughput.
5.5 Protocol Analysis
From the results it is hard to give hard conclusions on the effectiveness of the Heracles protocol. On one
side, Heracles proved to be able to keep up with Cubic in the packet and long/short test, with only slight
throughput differences between either of them. In the sequential test, Heracles was the most successful
and it was able to guarantee the highest throughput by a noticeable margin. These are tests, with bursty
connections, lasting only a few seconds and Heracles was able to take advantage of most of its features
of information sharing. The scenario represented the expected behavior of downloading a webpage.
For parallel connections, the protocol was the worst performing, with a 12.18% lower throughput
when compared to the highest performing protocol. This is a problem, because it doesn’t guarantee that
44
the protocol can achieve successful results in environments with an high degree of parallelism, which is
part of the specification of the protocol.
From the evaluation, we identified specific points that should be worked on, as a way of improving
the results. First, we were able to detect two implementation faults. Connections inside groups can only
change group after receiving a rtt sample and comparing it with the current group’s interval. This doesn’t
prevent against same path connections being in two different groups which share similar intervals. Hav-
ing the groups partitioned this way, increases group lookup times and path specific redundancies that we
try to avoid. These groups should be merged into a single one. Another implementation problem may
be from a connection changing its group. When a connection leaves a group and sends a leave event,
other connections will try to inflate their windows quickly, while the connection that left keeps the same
throughput. This presents an issue, if the path shared by these connections is similar or the change in
group was provoked by an anomalous rtt the number of packets in flight suddenly increase leading to
congestion. When together, the two previous problems will lead to unfair behavior of the protocol (as in
Figure 5.6).
Figure 5.6: 2 connections partitioning into different groups with different throughput values.
Finally, the last problem is that some values could be tweaked with the help of a more detailed
analysis. This would require slightly different versions of the Heracles algorithm to be evaluated between
themselves. The aspects of the protocol which we consider more important to be tuned are the following:
• Group interval - the rtt interval calculation is too coarse and doesn’t take into consideration the
connection’s rttvar . It’s important to factor in other variables, instead of using just the rtt . A score
should be calculated as a mix of different sender variables, as the srtt and rttvar that predict
network anomalies. This would help reduce false group changes for our evaluation.
• Min Acks - which sets the threshold for the minimum number of acknowledgments before a fresh
connection can join a group, for which we use 3. The need for a Min Acks requirement comes
from the lack of information about a new connection, before it joins a group. We see two major
45
problems with our approach. First, the number used has no specific reasoning. It was used so
that connections have enough packets in flight to have a converged rtt, which could then be used
to join a group. Second, only using the current round trip time is not indicative that the connection
belongs inside a preexisting group. Each connection should be able to estimate the variability of
the network before inserting itself into a group, using the previously sampled rtts.
46
Chapter 6
Conclusions
6.1 Summary
In this document, we presented Heracles, a new congestion control protocol for TCP, designed as a way
of improving network performance for same path connections, by decreasing losses and increasing the
maximum per connection throughput.
The goals were to improve the performance for same host parallel connections, by sharing informa-
tion between connections that share similar paths to close receivers in the same subnet, mainly the
slow start threshold and the congestion window. For each individual connection, these are the means to
decide on a safe interval from which to control the rate of packet sending.
For hosts sharing the same network path, information from concurrency becomes redundant, where
every connection reaches the same results separately, wasting time in the process. We proposed an
alternative, by sharing path specific information, we enable connections to skip the slow start procedure
by having information on an estimation on the minimum amount of outstanding packets it can have.
Connections can share congestion events, dividing the window decreases evenly on losses, to improve
fairness. Finally, for exiting connections, the share of throughput they leave behind can be reused by
other connections in the group. This allows individual connections to increase their maximum throughput
ceiling, stealing a higher share of the network than connections using other protocols.
To enable group creation, we presented Hydra, the data structure responsible for managing different
groups for TCP connections, giving same path connections access to information about each other.
Hydra is designed with performance in mind, for network operations that require it to be fast. The
congestion control is Heracles, which controls access to the Hydra structure. It takes care of interfacing
with the kernel TCP stack through a module interface. From which it receives information to manage the
different groups and take decisions for the sender, based on finer network estimations.
Our proposal was evaluated using a prototype implemented on the Linux kernel, we observed that it
performs better for short connections, allowing them to finish sooner.
47
6.2 Achievements
We implemented a protocol, inspired by previous work in the field referred to in this work, that encapsu-
lates common path connections into groups, allowing them to share path specific information between
each other. This should help in reducing slow start time for burst connections and help reduce total net-
work losses. We then evaluated the protocol performance on its throughput gains over different network
environments. The protocol should help reduce: slow start time for short connections and total network
losses.
As an improvement to all previous proposals, we allow the sender to group connections by subnet, to
increase the number of hosts affected by the protocol, important in server to client communication that
occurs on the Internet. We also made the protocol secure against a number of different environments:
where same subnet/address connections have different delay values; which the other protocols didn’t,
allowing network throughput to be compromised in a number of different scenarios.
The protocol is completely Linux 3.16 compatible and could be easily adapted to older versions of
the kernel. Tests were performed using Linux, so results should be close to the performance of a real
use case.
6.3 Future Work
From this work, some problems were left unsolved and new ones arose. The Heracles protocol evalu-
ation showed a decreased throughput when working with an increased number of parallel connections.
It should be tuned to fix this. In its current state, the protocol has a low adaptability to highly parallel
environments, for which it is more suited, as is in the case of server to clients connections, where some
clients are in close network proximity to each other.
The memory usage of the protocol wasn’t evaluated. For cases where memory can become a
bottleneck, its important to know how much memory the protocol requires on average per connection,
when compared to the normal network stack memory usage.
The protocol doesn’t deal with some problems that can arise after prolonged use, namely integer
overflows. We use machine dependent integers as timestamps, but perform no checks. During normal
behavior, an event timestamp is increased for the group, and if the connection’s timestamp is lower,
then it accepts the event and increases its own timestamp. After an integer overflow on the group’s
timestamp, connections will stop receiving events.
The evaluation should be extended to use TCP based applications, like HTTP which can make
connections behave in many different ways. Results should be analyzed from the point of view of a
HTTP server.
The Heracles protocol performance should be evaluated with tweaked variables, under parametrized
tests. There are different aspects of the protocol that can influence fairness and losses, which were
discussed:
• cwnd decrease for join events;
48
• minimum amount of acks required to join a group;
• cwnd decrease for loss events;
• cwnd increase for leave events;
• group score calculation.
Tests should then be extended to the Internet. The protocol would have to be analyzed over a longer
period of time, compete against a bigger amount of congestion control protocols and have to deal with
different types of traffic, with higher delay variability.
Finally, some performance tuning should be done. The code should be appropriately profiled to be
tuned and cleaned. The complexity of the protocol is much higher than others, increasing time spent
processing congestion parameters, that influences the total delay of connections. These operations are
done at the network level and they should be as simple as possible.
49
50
Bibliography
[1] Jacobson, V.: Congestion avoidance and control. SIGCOMM Comput. Commun. Rev. 18(4) (Au-
gust 1988) 314–329
[2] Maier, G., Feldmann, A., Paxson, V., Allman, M.: On dominant characteristics of residential broad-
band internet traffic. In: Proceedings of the 9th ACM SIGCOMM Conference on Internet Measure-
ment Conference. IMC ’09, New York, USA, ACM (2009) 90–102
[3] Ihm, S., Pai, V.S.: Towards understanding modern web traffic. In: Proceedings of the 2011 ACM
SIGCOMM Conference on Internet Measurement Conference. IMC ’11, New York, NY, USA, ACM
(2011) 295–312
[4] Allman, M.: A web server’s view of the transport layer. SIGCOMM Comput. Commun. Rev. 30(5)
(October 2000) 10–20
[5] Touch, J.: TCP Control Block Interdependence. RFC 2140 (June 1997)
[6] Balakrishnan, H., Padmanabhan, V., Seshan, S., Stemm, M., Katz, R.: Tcp behavior of a busy
internet server: analysis and improvements. In: INFOCOM ’98. Seventeenth Annual Joint Confer-
ence of the IEEE Computer and Communications Societies. Proceedings. IEEE. Volume 1. (Mar
1998) 252–262 vol.1
[7] Cho, S., Bettati, R.: Collaborative congestion control in parallel tcp flows. In: Communications,
2005. ICC 2005. 2005 IEEE International Conference on. Volume 2. (May 2005) 1026–1030 Vol. 2
[8] Postel, J.: Transmission Control Protocol. RFC 793 (September 1981)
[9] Fall, K., Stevens, W.: TCP/IP Illustrated Volume 1: The Protocols. 2 edn. Addison-Wesley Profes-
sional (2011)
[10] Allman M., F.S., C., P.: Increasing TCP’s Initial Window. RFC 3390 (January 2001)
[11] M. Allman, V.P., Blanton, E.: TCP Congestion Control. RFC 5681 (September 2009)
[12] Paxson V., Allman M., C.J., M., S.: Computing TCP’s Retransmission Timer. RFC 6298 (June
2011)
[13] Braden, R.: Requirements for Internet Hosts – Communication Layers. RFC 1122 (October 1989)
51
[14] Allman M., B.H.F.S.: Enhancing TCP’s Loss Recovery Using Limited Transmit. RFC 3042 (January
2001)
[15] Berners-Lee, T., F.R., Frystyk, H.: Hypertext Tranfer Protocol (HTTP/1.1): Message Syntax and
Routing. RFC 7230 (June 2014)
[16] Fielding, R., Reschke, J.: Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content. RFC
7231 (June 2014)
[17] Fielding, R., Reschke, J.: Hypertext Transfer Protocol – HTTP/1.0. RFC 1945 (May 1996)
[18] Padmanabhan, V.N., Mogul, J.C.: Improving http latency. Comput. Netw. ISDN Syst. 28(1-2)
(December 1995) 25–35
[19] Belshe, G., Peon, R., Thomson, M.: Hypertext Transfer Protocol version 2 (HTTP/2). RFC 7540
(May 2015)
[20] Braden, R.: Extending TCP for Transactions – Concepts. RFC 6247 (November 1992)
[21] Braden, R.: T/TCP – TCP Extensions for Transactions Functional Specification. RFC 1644 (July
1994)
[22] M. Duke, R. Braden, W.E., Blanton, E.: A Roadmap for Transmission Control Protocol (TCP)
Specification Documents. RFC 4614 (September 2006)
[23] Eggert, L.: Moving the Undeployed TCP Extensions RFC 1072, RFC 1106, RFC 1110, RFC 1145,
RFC 1146, RFC 1379, RFC 1644, and RFC 1693 to Historic Status. RFC 6247 (May 2011)
[24] Mathis M., Mahdavi J., F.S., A., R.: TCP Selective Acknowledgment Options. RFC 2018 (October
1996)
[25] Padmanabhan, V.N.: Addressing the Challenges of Web Data Transport. PhD thesis (1998)
[26] Eggert, L., Heidemann, J., Touch, J.: Effects of ensemble-tcp. SIGCOMM Comput. Commun. Rev.
30(1) (January 2000) 15–29
[27] Balakrishnan, H., Rahul, H.S., Seshan, S.: An integrated congestion management architecture for
internet hosts. SIGCOMM Comput. Commun. Rev. 29(4) (August 1999) 175–187
[28] Mo, J., La, R.J., Anantharam, V., Walrand, J.: Analysis and comparison of tcp reno and vegas. In:
In Proceedings of IEEE Infocom. (1999) 1556–1563
[29] Ha, S., Rhee, I., Xu, L.: Cubic: A new tcp-friendly high-speed tcp variant. SIGOPS Oper. Syst.
Rev. 42(5) (July 2008) 64–74
[30] Sedgewick, R., Guibas, L.J.: A dichromatic framework for balanced trees. (1978) 8–21
[31] Andel’son-Vel’skii, G.M., Landis, E.M.: An algorithm for the organization of information. Doklady
Akademii Nauk USSR 146(2) (1962) 263–266
52