The Raymond and Beverly Sackler Faculty of Exact Sciences
The Blavatnik School of Computer Science
Heavy Hitters Extensions for
Advanced Traffic Anomalies
Detection
Thesis submitted for the degree of Doctor of Philosophy
by
Shir Landau Feibish
This work was carried out under the supervision of
Prof. Yehuda Afek
Submitted to the Senate of Tel Aviv University
August 2017
Abstract
In recent years, the explosion of network traffic, together with new attack vectors has
brought the need for new tools which can detect and pinpoint specific phenomena in
traffic, such as particular repetitions and patterns. In this dissertation we study some
of these new complex data repetitions and offer fundamental techniques for identifying
them both in Software Defined Networks and in classical network settings. Additionally
we examine the implications of such traffic on both network monitoring and security
and offer mechanisms for attending to them.
A principal building block that we expand and generalize is the Heavy Hitters
problem. We consider three variations of Heavy Hitters, proposing new concepts and
problem definitions for each, as well as new efficient algorithms to detect and output
them. First, in Chapter 3, we suggest new definitions for Heavy Hitters based on time
locality. Second, in Chapter 4, we study the problem of varying length string heavy
hitters in a stream of strings (messages). Finally, in Chapter 6, we explore the problem
of distinct heavy hitters in a stream of < key, subkey > pairs, which is the problem of
detecting a key with many different subkeys.
Using these algorithms, we provide three applications, each utilizing one of the
above new techniques. First, based on our time locality definitions, we have developed
mechanisms for the detection of different types of large flows in Software Defined Net-
works (Chapter 3), for the purpose of network security, measurement and monitoring.
Second, using our algorithms for the detection of varying length string heavy hitters
we have developed a zero-day signature extraction system for mitigation of application
level DDoS attacks (Chapter 5). Finally, we present a system for mitigation of ran-
domized attacks on the Domain Name System (Chapter 7), which makes use of our
algorithms for finding distinct heavy hitters. Both of these attacks, and especially the
latter, have gained much interest recently, due to the large affects they have had on
3
millions of Internet users, and our systems offer a new approach to their mitigation.
We evaluated our tools and prove their effectiveness both analytically and empiri-
cally. To do so, we have implemented our systems and tools using various technologies,
and have performed testing on real traffic traces, including captures of actual attacks.
Acknowledgements
First I would like to express my deepest gratitude to my advisor Prof. Yehuda Afek, for
teaching me that research itself is a science, and that every problem and solution should
be explored from all angles. And, for pushing me to do my very best while reminding
me to keep focused on the important things in life. It has truly been a pleasure.
Second, I would very much like to thank my mentor Prof. Anat Bremler-Barr. An
outstanding women researcher in a field dominated by men, and especially in Israel, you
are my role model. Thank you for always reminding me that research is a marathon, not
a sprint, and for endless conversations of encouragement and guidance about research,
academia, reviewers, students, kids, life and so much more.
Third, I would like to thank my collaborators, with whom I have had the great
pleasure to work over these years, Prof. Edith Cohen, Dr. Liron Schiff, Moshe Sulamy
and Michal Shagam. You have brought teamwork into my research, with enlightning
conversations and great insight. I am also greatful to the wonderful members of the
DEEPNESS Lab, and especially Prof. David Hay, Dr. Yaron Koral and Dr. Yotam
Harchol for their help, guidance and friendship over the years.
Last but certainly not least, I would like to thank my family. My parents, Gadi
and Orith, who have been there at every step of the way, making sure that I find my
path and that I endure it. Standing by my side through bright days and darker ones.
My amazing kids, Nadav, Michal and Noa who make everything in life so much more
interesting, and who always put both success and failure into clear perspective. And
finally, to my husband and closest friend Nir, who has known this day would come since
we met at age 15 and has done everything in his power to help me get here.
This work was partially supported by the chief scientist in the Israeli ministry of
Industry, Trade and Labor and by the Ministry of Science and Technology, Israel.
5
Contents
1 Introduction 1
1.1 Overview of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Time locality in Heavy Hitters and Detection of Heavy Flows in
Software Defined Networks . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Heavy Hitters in Textual Data: String Heavy Hitters . . . . . . . 6
1.1.3 Zero-Day Signature Extraction for High Volume Attacks Using
String Heavy Hitters . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.4 Heavy Hitters in a Stream of Pairs: Distinct and Combined Heavy
Hitters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.5 Random Subdomain DNS Attacks Mitigation using Distinct
Heavy Hitters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Published Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Background 13
2.1 Frequent Items in Data Streams . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 The Data Streaming Model . . . . . . . . . . . . . . . . . . . . . 13
2.1.2 The Heavy Hitters Problem . . . . . . . . . . . . . . . . . . . . . 13
2.1.3 Related Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 DDoS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Deep Packet Inspection . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Software Defined Networks . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Time Locality in Heavy Hitters and Detection of Heavy Flows in
Software Defined Networks Match and Action Model 23
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6
3.1.1 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.1 Network Measurement . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Time Locality Definitions for Heavy Hitters . . . . . . . . . . . . . . . . 27
3.4 Heavy Flows Detection in SDN . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.1 Towards a Solution . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.2 The Sample&Pick Algorithm . . . . . . . . . . . . . . . . . . . . 29
3.4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Interval Heavy Flow and Bulky Flow Detection . . . . . . . . . . . . . . 38
3.6 Distributed Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4 Finding Heavy Hitters in a Stream of Strings 45
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.1 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 String Heavy Hitters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.6 The Double Heavy Hitters Algorithm . . . . . . . . . . . . . . . . . . . . 50
4.6.1 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.6.2 Algorithm Details . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.6.3 Error Rate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 55
5 Zero-Day Signature Extraction for High Volume Attacks 57
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1.1 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2.1 Automated Signature Extraction . . . . . . . . . . . . . . . . . . 60
5.2.2 DDoS Defense Mechanisms . . . . . . . . . . . . . . . . . . . . . 61
5.3 The Zero-Day High-Volume Attack Detection System . . . . . . . . . . 62
5.3.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.3 System Requirements . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.4 System Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3.5 Identifying Common Combinations of Signatures . . . . . . . . . 67
5.4 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4.1 Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4.2 System Quality Test Results . . . . . . . . . . . . . . . . . . . . 73
5.4.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4.4 Frequency Estimation . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4.5 Threshold Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.4.6 Testing Frequent Signature Combinations . . . . . . . . . . . . . 77
5.4.7 Signature Examples . . . . . . . . . . . . . . . . . . . . . . . . . 78
6 Heavy Hitters in a Stream of Pairs: Distinct and Combined Heavy
Hitters 80
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.1.1 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.2.1 Problem Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.2.2 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.3 Background - Approximate Distinct Counters . . . . . . . . . . . . . . . 83
6.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.5 The Distinct Weighted Sampling Algorithms . . . . . . . . . . . . . . . 85
6.5.1 Fixed-Threshold Distinct Heavy Hitters . . . . . . . . . . . . . . 86
6.5.2 Fixed-Size Distinct Weighted Sampling . . . . . . . . . . . . . . 86
6.5.3 Analysis and Estimates . . . . . . . . . . . . . . . . . . . . . . . 87
6.5.4 Estimate Quality and Confidence Interval . . . . . . . . . . . . . 89
6.5.5 Integrated dwsHH Design . . . . . . . . . . . . . . . . . . . . . . 90
6.6 The Combined Weighted Sampling Algorithm . . . . . . . . . . . . . . . 92
6.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.7.1 Theoretical Comparison . . . . . . . . . . . . . . . . . . . . . . . 94
6.7.2 Practical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 94
7 Mitigating DNS Random Subdomain DDoS Attacks Using Distinct
Heavy Hitters 100
7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.1.1 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.2 Attack Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.2.1 Current Detection Techniques . . . . . . . . . . . . . . . . . . . . 104
7.3 Random Subdomain Attack Mitigation System . . . . . . . . . . . . . . 105
7.3.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.3.2 System Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.4.1 University Network Captures . . . . . . . . . . . . . . . . . . . . 115
7.4.2 ISP Attack Captures . . . . . . . . . . . . . . . . . . . . . . . . . 116
8 Discussion and Conclusion 118
8.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Bibliography 123
List of Figures
1.1 DDoS mitigation systems overview . . . . . . . . . . . . . . . . . . . . . 4
3.1 Sample&Pick overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Resource consumption and accuracy comparison . . . . . . . . . . . . . 35
3.3 Affect of varying t values . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Affect of varying T values . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Affect of varying v values . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.6 The modified heavy hitters data structure using counter arrays. In this
example the active counter is currently c1. . . . . . . . . . . . . . . . . . 41
3.7 Marking sampled packets in the distributed setting. . . . . . . . . . . . 44
4.1 An example of the process of creating varying length strings from con-
secutive k-grams. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Non-consecutive heavy hitters . . . . . . . . . . . . . . . . . . . . . . . . 56
5.1 Signatures requirement overview . . . . . . . . . . . . . . . . . . . . . . 64
5.2 The process of extracting attack content signatures. . . . . . . . . . . . 66
5.3 Extracting attack signatures with the additional minimization process. . 69
5.4 An example of different sets of signatures found in different packet types. 70
5.5 Signature frequency: algorithm estimation vs. the actual frequency . . . 77
5.6 Comparing peace-high values. . . . . . . . . . . . . . . . . . . . . . . . . 78
5.7 Testing the algorithm for minimizing the number of signatures . . . . . 79
6.1 Distinct Weighted Sampling (dWS): Modified cache size . . . . . . . . . 95
6.2 Distinct Weighted Sampling (dWS): Modified Number of Buckets . . . . 96
6.3 Distinct Weighted Sampling (dWS): 32 Buckets, 1000 Items . . . . . . . 97
6.4 Combined Weighted Sampling (cWSHH) Modified rho: accuracy . . . . 98
10
6.5 Combined Weighted Sampling (cWSHH) Modified rho: combined weight 98
6.6 Distinct Weighted Sampling (dWS): Modified cache size . . . . . . . . . 99
7.1 DNS Random Subdomain attack overview . . . . . . . . . . . . . . . . . 101
7.2 DNS Random Subdomain mitigation High-level approach . . . . . . . . 105
7.3 DNS Random Subdomain mitigation system overview . . . . . . . . . . 106
7.4 Hierarchy of heavy distinct domains. Bold edged nodes are in the cover,
dashed edge nodes do not surpass the minimum cardinality. . . . . . . . 109
7.5 Heavy Distinct Domain Hierarchy (HDDH) Extractor . . . . . . . . . . 110
7.6 Distribution of the number of distinct subdomains per domain level . . 112
7.7 Attack time signature extraction . . . . . . . . . . . . . . . . . . . . . . 113
7.8 Distinct queries for campus authoritative server per hour, over 1 day. . . 116
List of Tables
3.1 Matrix definitions for large flows . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Comparison of the heavy flow detection techniques presented in this
work. Denote t the threshold for candidate heavy hitters in Sample&Pick . 30
3.3 Illustration of switch flow table configuration. Rule priority decreases
from top to bottom. Actions: 1- increment counter; 2 - apply sampling
technique (goto sampling tables / apply group) . . . . . . . . . . . . . . 30
3.4 Resource consumption test results . . . . . . . . . . . . . . . . . . . . . 34
4.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Summary of the statistics of the tests performed. Note that the captures
are samples of the traffic. . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.2 Theoretic Comparison between methods . . . . . . . . . . . . . . . . . . 94
7.1 System Parameters and Notations . . . . . . . . . . . . . . . . . . . . . 108
7.2 Results on Real DNS Attack Captures . . . . . . . . . . . . . . . . . . . 117
12
Chapter 1
Introduction
The amount of data that traverses the global communication networks has increased
tremendously over the past twenty years. As the amount of traffic grows, the huge
amounts of packets going through the networks create new risks to network function-
ality, and the networking community is at a constant battle to keep legitimate data
flowing.
First, the sheer volume of the traffic makes it difficult to keep the networks up
and running. Flash events or large bursts of traffic need to be promptly detected and
carefully handled, using load balancing and other mechanisms in order to maintain
quality of service. Second, malicious entities around the world perform countless attacks
daily, which are becoming increasingly sophisticated. Attackers are creating new kinds
of attacks, for which no previous knowledge exists (zero-day exploits). Furthermore,
attackers are using large groups of compromised machines called botnets, causing a
continuous rise in the volume and intensity of the attacks.
These risks create an ongoing need for new big-data solutions which can take on
tens of millions of packets per second. That is, while the vast amount of packets work
their way around the network harmlessly, even a small percentage of abnormal packets
may have tremendous impact on the network, and in order to mitigate the risks, these
packets need to be skillfully handled.
These unusual packet phenomenons take on many different shapes. Packets may
be awkwardly formed, having, for instance, a payload that is bigger than usual or
containing a header field that is not often in use. In other cases, a sequence or group of
packets may have some special characteristics. For example, many packets headed to a
single destination from many different sources, or unusually many requests for a single
1
2 CHAPTER 1. INTRODUCTION
site.
My dissertation presents advanced techniques for characterizing and identifying
some of the extraordinary network phenomenon observed recently. We provide new in-
sight to the world of big-data, providing fundamental tools and algorithms for identify-
ing data repetitions in network traffic for a variety of network applications. Specifically,
we focus on traffic abnormalities that are security related, and devise mechanisms for
the mitigation of different types of zero day attacks, including recent attacks on the
Domain Name System (DNS) which threaten the very core of the Internet functionali-
ty [65].
One of our main challenges is the identification of large amounts of traffic that
share some similar properties, often referred to as a large or heavy flow of traffic (See
definition in Chapter 3). In the classic sense, a flow has often been characterized by
packets sent from a single source to a single destination. In today’s networks, this
definition has been expanded to a sequence of packets sharing some common header
fields. Heavy flow detection in traffic is one of the fundamental capabilities required in a
network. It is a key capability in providing Quality of Service (QoS), capacity planning
and efficient traffic engineering. Furthermore, heavy flow detection is crucial for the
detection of Distributed Denial of Service (DDoS) attacks in the network.
Traditional heavy flow detection was based on flow measurements [4, 33], yet these
suffer from a significant lack in scalability [49]. Therefore, with the rising amounts
of traffic, more sophisticated techniques are being developed, such as those presented
in [28, 49, 124]. We continue the research of efficient methods for heavy flow detection
and propose various solutions, which are based on a family of streaming algorithms
devised for the Heavy Hitters problem.
The Heavy Hitters problem is the well studied problem of finding the popular items
in a data stream. In the classic definition, given a stream of N items, a heavy hitter
is an item which appears at least θN times, for some given 0 < θ < 1 [83]. Solutions
such as the Space-Saving algorithm of Metwally et. al. [81] or the Sample and Hold
algorithm of Estan et. al. [49] detect the heavy hitters with a certain probability and
provide an estimate of the amount of times they appeared in the stream.
We broaden the traditional scope of the problem, and examine heavy hitters in
different types of traffic or network architectures. As we exhibit throughout this disser-
tation, identifying different types of heavy flows requires specially crafted algorithms.
3
First, we expand the classic definition of heavy hitters to introduce time locality,
providing different problem definitions. We propose methods for identifying heavy traffic
in the context of recent Software Defined Networking (SDN) technology and provide
algorithms for the detection of different types of heavy flows within a software defined
architecture, for both a single switch and a distributed setting (Chapter 3).
Furthermore, while heavy hitters algorithms have traditionally been developed for
streams of numbers, we study the problem of identifying heavy hitters in other forms
of data, namely streams of strings or pairs. We explore the concept of heavy hitters in
textual data (i.e., in a stream of strings), and present efficient algorithms for identifying
frequent substrings of varying length or String Heavy Hitters in a large stream of strings
(Chapter 4). Additionally, we propose new algorithms for finding Distinct Heavy Hitters
in a stream of 〈key, subkey〉 pairs (Chapter 6). Our approach for heavy hitters in pairs
makes use of algorithms for the approximate distinct counting problem, which is the
problem of estimating the number of distinct items seen. Meaning, given a stream of
items, how many unique items have been encountered up to some point in the stream.
Various sketch-based solutions have been proposed for this problem such as [25, 37, 47,
58, 66, 96].
We exhibit the usefulness of the above algorithms in the identification of unusual
sequences of packets and data patterns and show they are instrumental in the detection
and mitigation of new types of DDoS attacks witnessed in recent years. Our approach in
general, depicted in Figure 1.1, includes a two-stage process. First, peacetime traffic is
analyzed to create a baseline of patterns which are found in the traffic on a normal basis.
Second, during an attack, the traffic is analyzed to detect repetitions. The peacetime
baseline is used to identify patterns which are exhibited during an attacks and are not
part of the traffic on a normal basis. These repetitions form the attack signatures which
can then be used in order to mitigate the attack.
In Chapter 5, we use our algorithm for identification of String Heavy Hitters to build
a system for the mitigation of application level DDoS attacks [11, 12]. The packets which
comprise these attacks often contain a small footprint caused by the attack generation
tools. This footprint can be as small as an extra carriage return (newline) not normally
found in such packets. Our algorithms are able to find this footprint within the context
of the attack packets to allow the detection of consequent attack packets and therefore
mitigate the attack.
4 CHAPTER 1. INTRODUCTION
Attack time traffic
Peace time traffic
Attack signatures
Find repetitions in attack time
traffic
Generate whitelist/ baseline
Take only signatures found in
attack and not in peace NIDS,
Firewall, mitigation unit etc.
Figure 1.1: DDoS mitigation systems overview
In Chapter 7, we utilize our distinct heavy hitters algorithms for finding heavy
hitters in pairs to build a system for the mitigation of randomized attacks on the Domain
Name System [50]. In these attacks many unique requests containing pseudo-random
subdomains are sent for a specific domain. Our algorithms detect pseudo-randomly
generated traffic by identifying keys which appear in the stream with a large number
of different subkeys, and can therefore be used to detect such attacks.
1.1 Overview of Results
1.1.1 Time locality in Heavy Hitters and Detection of Heavy Flows
in Software Defined Networks
Software Defined Networks (SDN) have emerged in recent years as a framework for
creating configurable networks with improved network management abilities. While
SDN is not limited to OpenFlow [80], it is currently the de-facto SDN standard both
in industry and academia.
OpenFlow is based on the notion of match-action rules. The OpenFlow switch
maintains rule tables which are mostly TCAM based, called flow tables. Each rule in
these tables contains a match and an action. If a packet matches a certain rule, the
rule’s action will be applied to it. An action can be to modify parts of the packet, a
forward port specification, a group assignment etc. The controller, which manages the
switch flow tables, defines and installs these flow table rules.
While SDN switches are very efficient and considerably simpler to manage than
existing routers and switches, they do not offer direct means for sampling and detection
of large flows, which are both important for various basic network applications. These
1.1. OVERVIEW OF RESULTS 5
applications include: traffic monitoring and QoS, security (DDoS detection and other
high volume attacks), anomaly detection, Deep Packet Inspection (DPI) and billing.
Naive use of OpenFlow for sampling and traffic measurements results in excessive
use of two important resources: the number of flow entries in the flow-tables, and the
amount of traffic between the switch and the controller and/or other monitoring devices.
It is not scalable and even infeasible to place an entry in the flow table for every flow,
and therefore more efficient solutions are required.
In Chapter 3 we propose Sample&Pick, an efficient algorithm which detects large
or heavy flows going through an SDN switch. The Sample&Pick algorithm divides the
detection and monitoring work between the switch and the controller, coordinating
between them to efficiently identify large flows. Our constructions are based on the
paradigm of the Sample and Hold [49] algorithm along with other classic heavy hitters
algorithms such as the Space-Saving algorithm [81]. In our solution, sampled packets at
the switch are sent to the controller where the suspected heavy flows are detected. For
each suspected heavy flow, an exact count or hold is placed in the switch. Subsequent
packets of this flow are not sampled, and therefore sampled packets from this flow are no
longer sent to the controller. Counters accumulated in the switch are periodically sent to
the controller, which integrates these counters into the general heavy hitters structure.
In this manner, our algorithm minimizes both the switch - controller communication
and the number of entries in the switch flow table.
Based on different parameters, we differentiate between long lasting and short lived
flows and accordingly define heavy flows, elephant flows and bulky flows and present
innovative algorithms to detect the different types of flows in an SDN switch. Addition-
ally, we consider a distributed model with multiple switches and propose initial methods
for scaling out our techniques, to support both sampling and large flow detection in
the distributed setting.
Our methods rely on standard and optional features of OpenFlow 1.3 and can also
be implemented in the P4 language. Additionally, the techniques presented are efficient
both in terms of flow-table size and switch-controller communication.
We evaluate the performance of our Sample&Pick algorithm by measuring its in-
accuracy rates and resource consumption. Our evaluations demonstrate that our algo-
rithm is able to identify the heavy hitters while providing a good trade-off between the
amount of switch-controller communication and the amount of space required in the
6 CHAPTER 1. INTRODUCTION
switch.
1.1.2 Heavy Hitters in Textual Data: String Heavy Hitters
In Chapter 4, we consider the problem of finding popular substrings in a stream of
strings. That is, given a stream of strings of different lengths, we would like to find
substrings that appear in some fraction of the strings in the stream.
We define the String Heavy Hitters problem. The input is a sequence S =
〈S1, .....SN 〉 of N strings, a constant k > 0 and θ s.t. 0 < θ < 1. The strings in S
may be of different lengths. A string s of length at least k is referred to as a string
heavy hitter when it is a substring of at least θN strings in S.
Define the weight bs as bs =∑
yifs ⊆ Sy : 1, else : 0, meaning, the number of
strings in S of which s is a substring. Given this definition, a string s is a string heavy
hitter if bs ≥ θN .
We present the Double Heavy Hitters algorithm for efficiently solving the String
Heavy Hitters problem. This algorithm finds popular strings of variable length in a set
of messages, using classic algorithms for heavy hitters detection (e.g. the Space-Saving
algorithm [81]) as a building block. Our algorithm uses a construction of two separate
instances of the classic heavy hitters algorithm. The first instance is used to identify
popular strings of a fixed length k and the second is used to identify popular strings
of varying length. This algorithm, runs in a single pass over the input and its space
depends only on the predefined heavy hitters threshold as in [81].
1.1.3 Zero-Day Signature Extraction for High Volume Attacks Using
String Heavy Hitters
Content signatures are a widely used tool in computer networks. Attack signatures
are one or more precise strings or regular expressions that are common to packets in
an attack. They are usually generated a-priori and then kept in large databases in
order to identify the attack in future traffic. Traditional intrusion detection systems
(IDS) such as Snort [9] and Bro [94], maintain a large database of signatures of attacks
and malwares. Traffic that goes through the IDS is compared to the known signatures
and traffic containing one or more signatures is dropped, thus preventing attacks that
are similar to previous ones. This mechanism is very effective in identifying future
recurrences of past attacks. However, in order to prevent yet-unknown attacks, new
1.1. OVERVIEW OF RESULTS 7
signatures must be created and inserted into the database.
Two basic techniques are traditionally used to identify DDoS attacks: flow authen-
tication based on challenge response and flow behavioral analysis based on statistics
and various machine learning methods. Recent attacks with millions of zombies gener-
ating seemingly legitimate flows go under both radar screens. In these types of attacks,
behavioral analysis does not succeed to detect the malicious traffic, as each zombie
generates little traffic which in itself may appear to be benign. Furthermore, the huge
amount of attack sources makes it infeasible to stop the attack at the source. This
therefore leaves a loophole in the defense mechanisms and creates the demand for zero
day DDoS attack signature extraction.
Identifying signatures for unknown DDoS attacks is extremely difficult due to the
seemingly legitimate content found in the packets which comprise the attack. Most
traditional signatures are based on the malicious code that is expected in the attack
packets, which may not be the case with DDoS attacks. Leading industry experts
confirm, that the signatures found in recent zero-day application-level DDoS attacks
are usually a bi-product of the attack tools which the attackers use. These tools, often
leave some footprint caused unintentionally by the program, such as a short string or
some (protocol complying) anomaly in the packet content structure. Extracting such
signatures allows fine grained identification of attack packets during an attack with
minimal false positives or negatives.
These subtle signatures are not identified by the current automated defense mech-
anisms, but rather by a manual process which may take hours or days. Clearly, in
order to stop such unknown attacks while they are occurring, such signatures must be
extracted quickly and automatically.
In Chapter 5 we present an innovative system for automatic extraction of signatures
for high volume attacks, using a single pass over the input, and space dependent only
on the predetermined size of the data structures used by the heavy hitters detection
algorithm. This system is based on our Double Heavy Hitters algorithm.
Our system takes as input two streams (or stream samples) of traffic collected
during an attack and during peacetime. A peacetime traffic sample may be collected
as a routine scheduled procedure. The attack traffic sample can be collected once the
attack has been detected. We note that for DDoS attacks there are existing mechanisms
(for example in [93]) for identifying when an attack has started and for differentiating
8 CHAPTER 1. INTRODUCTION
between Flash events and DDoS attacks. The system then analyzes both traffic samples
to identify content that is frequent in the attack traffic sample yet appears rarely or
not at all in the peacetime traffic (As illustrated in Figure 5.1).
Our system makes no assumptions on traffic characteristics such as client behaviour,
address dispersion, URL statistics and so forth. Therefore, it is generic in that it can
be easily adapted to solving other network problems with similar characteristics.
We test our system on traffic from real attacks that have occurred recently. We show
that our solution has good performance in real life, with a recall average of 99.95% and
an average precision of 98%.
In [12] we improve this solution by minimizing the amount of signatures required
to identify malicious packets.
1.1.4 Heavy Hitters in a Stream of Pairs: Distinct and Combined
Heavy Hitters
Formally, our input is modeled as a stream of elements, where each element has an a
primary key x from a domain X and a subkey y from domain Dx. For each key, the
(classic) weight hx is the number of elements with key x, the distinct weight wx is the
number of different subkeys in elements with key x, and, for a parameter ρ 1, the
combined weight is b(ρ)x ≡ ρhx +wx. Combined weights are interesting as they can be a
more accurate measure of the load due to key x than one of hx or wx in isolation: All
hx requests are processed but the wx distinct ones are costlier.
A key x with weight that is at least an ε fraction of the (respective) total is referred
to as a heavy hitter: When hx ≥ ε∑
y hy, x is a (classic) heavy hitter (HH), when
wx ≥ ε∑
y wy, x is a distinct heavy hitter (dHH) or superspreader [116], and when
b(ρ)x ≥ ε
∑y b
(ρ)y , x is a combined heavy hitter (cHH).
The Distinct Heavy Hitters problem (also known as the Superspreaders problem
[116]) was formulated by Venkataraman et al. [116] and studied further [24, 76] but
existing algorithms do not match those of classic heavy hitters detection in performance
and practicality.
Our algorithms, presented in Chapter 6, are novel and efficient sampling-based
structures for dHH and cHH detection which are able to track only O(ε−1) keys. Our
dHH design significantly improves over existing work. We demonstrate, via experimen-
tal evaluations, the effectiveness of each of our algorithms.
1.1. OVERVIEW OF RESULTS 9
1.1.5 Random Subdomain DNS Attacks Mitigation using Distinct
Heavy Hitters
The Domain Name System (DNS) service is one of the core services in internet func-
tionality. Attacks on the DNS service typically consist of many queries coming from a
large botnet. These queries are sent to the root name server or an authoritative name
server along the domain chain. The targeted name server will receive a high volume of
requests, which can degrade its performance or disable it completely. Such attacks may
also contain spoofed source addresses which would cause a reflection of the attack or
may send requests that generate large responses (such as an ANY request) to use the
DNS for amplification.
In randomized attacks on the DNS service, queries for many different non-existent
subdomains (subkeys) of the same primary domain (key) are issued [74]. Since the re-
sult of a query to a new subdomain is not cached at the DNS resolver, these queries are
propagated to the authoritative server for the domain, overloading both these server-
s and the resolvers of the internet service provider (ISP). Such attacks are recently
becoming increasingly common and pose a challenge to Internet service providers.
In the described attacks, and other anomalies such as flash crowds, the impacted
keys are characterized by a large number of requests, number of distinct subkeys, or a
combination of the two. Fast and automated detection and mitigation of such attacks
or anomalies is important for maintaining robustness of the network service. Efficient
detection requires streaming algorithms which maintain a state (memory) that is much
smaller than the number of distinct keys and/or subkeys, which can grow rapidly during
an attack.
In Chapter 7, we design a system to detect Random Subdomain attacks on the
DNS service. The design makes use of our structures for dHH and cHH detection to
identify domains that have many different subdomains in DNS queries for that domain.
Our system generates a baseline during times of normal traffic load which allows it to
differentiate between domains that have many different subdomains on a regular basis
and those that are possibly being attacked.
We demonstrate, the effectiveness of our DNS Random Subdomain attack detection
system as an application-specific tool, which we test on actual attack traces captured
by a large ISP, as well as attack traces captured at our university. Our evaluation shows
that an attack may be identified by our system with high accuracy after processing only
10 CHAPTER 1. INTRODUCTION
a small number of attack packets.
1.2 Methods
For our research we have used several types of methodologies from a variety of disci-
plines.
Algorithms design: One of our main focuses has been algorithm design, using tools
from various fields. Our algorithms are based on methods from the field of Stream-
ing and big data such as techniques for identifying frequent items, distinct counters
and sampling techniques. Additionally, all of our algorithms are designed to function
in communication networks and are therefore designed with consideration for the d-
ifferent limitations imposed by these networks such as traffic volume and speed and
the resources available in them such as memory and computational capability. Our
algorithms for the String Heavy Hitters problem (Chapter 4) and Randomized Subdo-
main Attacks (Chapter 7) are inspired by techniques from the field of Stringology, and
our mechanisms for identifying heavy hitters in software defined networks (Chapter 3)
make use of tools offered by Programmable data planes, as well as ideas from the field
of Distributed Computing.
Algorithms analysis: We have analyzed our algorithms theoretically in terms of
correctness and quality, using standard measures such as false positive rate, false neg-
ative rate, recall and precision. Define the universe of items as U , the set of items that
need to be in the output of our system or algorithm as S (hence the set of items that
do not need to be in the output is U \ S), and the actual set of items output by our
system or algorithm as S′.
We define the following measures:
1. False positive rate: an item j is a false positive if j ∈ S′ and j /∈ S. The false
positive rate is defined as |j|j is a false positive||U\S| .
2. False negative rate: an item j is a false negative if j ∈ S and j /∈ S′. The false
negative rate is defined as |j|j is a false negative||S| .
3. Recall: Defined as S′∩SS . Intuitively, it measures how many of the relevant items
have been selected.
1.3. PUBLISHED MATERIAL 11
4. Precision: Defined as S′∩SS′ . Intuitively, it measures how many of the selected items
are relevant.
Our analysis has been done on worst-case settings as well as an analysis of expected
behavior using reasonable assumptions on the traffic. Performance has been analyzed
using complexity measures of time and space as well as the amount of traffic overhead
generated by our systems.
Implementation and evaluation: We have put great emphasis on testing the qual-
ity and performance of our algorithms using experimentation and simulations. We have
implemented our algorithms using various programming languages including: C, C++
and Python. An implementation of our system for automatic signature extraction can
be found in [13].
We have used many different technologies for our implementations and simulations
including: Wireshark, WinPcap, Python 2.7, Python 3.3, Microsoft Visual Studio and
a large number of open source libraries.
We have performed testing on both simulated attacks and on real ones, using both
real and synthetic traffic. Our data sources include:
1. CAIDA traces [1–3].
2. Traces captured on campus at Tel-Aviv University.
3. Traces captured on campus at the Interdisciplinary Center, Herzeliya.
4. Traces captured locally using tools such as Wireshark.
5. Traces from various companies in the industry including top security companies
and a large ISP.
6. Traces containing DDoS attacks captured by UCLA’s DWARD lab [7].
1.3 Published Material
This thesis is based on the following published works:
• Yehuda Afek, Anat Bremler-Barr, and Shir Landau Feibish. Automated Signature
Extraction For High Volume Attacks. In Symposium on Architecture for Network-
12 CHAPTER 1. INTRODUCTION
ing and Communications Systems, ANCS ’13, San Jose, CA, USA, October 21-22,
2013, pages 147-156. IEEE Computer Society, 2013.
• Yehuda Afek, Anat Bremler-Barr, and Shir Landau Feibish. Zero-Day Signature
Extraction For High Volume Attacks. In IEEE/ACM Transactions on Networking
(TON), Submitted.
• Yehuda Afek, Anat Bremler-Barr, Edith Cohen, Shir Landau Feibish, and Michal
Shagam. Mitigating DNS Mitigating DNS random subdomain DDoS attacks by
distinct heavy hitters sketches. In Proceedings of the fifth ACM/IEEE Workshop
on Hot Topics in Web Systems and Technologies, HotWeb 2017, San Jose, CA,
USA, October 12 - 14, 2017, pages 8:1-8:6. ACM/IEEE Computer Society, 2017.
• Yehuda Afek, Anat Bremler-Barr, Edith Cohen, Shir Landau Feibish, Michal
Shagam: Efficient Distinct Heavy Hitters for DNS DDoS Attack Detection. In
CoRR abs/1612.02636, 2016.
• Yehuda Afek, Anat Bremler-Barr, Shir Landau Feibish, and Liron Schiff. Sampling
and large flow detection in SDN. In Proceedings of the 2015 ACM Conference
on Special Interest Group on Data Communication, SIGCOMM 2015, London,
United Kingdom, August 17-21, 2015, pages 345-346. ACM, 2015.
• Yehuda Afek, Anat Bremler-Barr, Shir Landau Feibish, and Liron Schiff. Detect-
ing Heavy Flows in the SDN Match and Action Model. In CoRR abs/1702.08037,
2017.
• Yehuda Afek, Anat Bremler-Barr, Shir Landau Feibish, and Liron Schiff. Detect-
ing Heavy Flows in the SDN Match and Action Model. In Computer Networks
Journal (ComNet): Special Issue on Security and Performance of Software-defined
Networks and Functions Virtualization, Submitted.
Chapter 2
Background
2.1 Frequent Items in Data Streams
2.1.1 The Data Streaming Model
Network traffic is often referred to as a data stream. The data streaming model, as
defined in [21], is a model in which the input data is not available for access locally,
but rather arrives online as a continuous data stream. The order of the elements or
the size of the stream is not known a-priori and can not be determined by the system.
Furthermore, the entire data usually can not be stored and therefore after the system
has completed processing an element it is discarded, such that often, only a single pass
over the input is possible.
2.1.2 The Heavy Hitters Problem
The problem of finding the frequent items in a stream of data evolved from the problem
of finding the majority value in a stream, which was first introduced by Moore and
Boyer [31, 84].
2.1.2.1 The Majority Problem
The Majority problem is defined as follows: given a sequence of N values from universe
U , using a constant amount of space, and in one pass over the values decide if a single
value appears more than N2 times.
Formally, given a sequence of N values, the algorithm should output:
• If ∃j : fj >N2 : output j
13
14 CHAPTER 2. BACKGROUND
• Else: output null
In [83] Misra and Gries presented a novel algorithm for this problem. Their al-
gorithm, which is detailed in Procedure Misra and Gries Majority maintains a single
counter initialized to zero and a VAL variable. Upon the first value that is seen, VAL
is set to that value and the counter is increased by 1. For each value in the stream, if
the value is equal to VAL, the counter is incremented. Otherwise, if the counter equals
zero, VAL is set to be the new value and the counter is incremented. Otherwise, the
counter is decremented.
Procedure Misra and Gries Majority
Data: 〈α1, .....αN 〉Result: the majority value, if existsV AL = NULL and count = 0;V AL = α1;count = 1;for i = 2→ N do
if count == 0 then V AL = αi;count = 1 ;else
if V AL == αi then count+ + ;else
count−−;
if count > 0 then return V AL ;else
return NULL;
If after passing through all of the values the counter is greater than zero, than the
value of VAL is the only candidate for being the majority. To verify that it is indeed
the majority, a second pass is needed to count the number of times it appears.
2.1.2.2 Heavy Hitters Algorithms
We present the following definitions which form the basis of our discussion.
Definition 1. Item frequency: Given a sequence 〈α1, .....αN 〉 of N items from universe
U , the frequency of item x, denoted fx = |j|αj == x.
Definition 2. Heavy Hitter: Given a sequence of N values α = 〈α1, .....αN 〉 from
universe U and a threshold 0 ≤ θ ≤ 1, x is a heavy hitter if fx > θN .
2.1. FREQUENT ITEMS IN DATA STREAMS 15
Finding all of the heavy hitters in a stream with an exact solution would require
maintaining knowledge about the frequency of every item in the stream which would
take up O(U) space [34, 45] and is impractical for applications in which U is sufficiently
large.
Following [40], we provide the following approximation problem definition:
Definition 3. The Heavy Hitters Problem (also known as the ε-Approximate Frequent
Items Problem): Given a sequence S of N values α = 〈α1, .....αN 〉 from universe U , a
threshold 0 ≤ θ ≤ 1 and an error value ε, find a set of items F such that for any item
x ∈ F , fx > (θ − ε)N and for any item j such that j ∈ S and j /∈ F , fj < θN .
Note that a streaming algorithm for the Heavy Hitters problem, can use only a
constant amount of space and make a single pass over the input.
Heavy hitter detection in streams was widely studied and deployed. Many solutions
have been proposed for the classical Heavy Hitters problem, for example, the solutions
suggested in [18, 43, 56, 79, 81, 83]. We provide a detailed explanation of two of these
algorithms, specifically those presented in [83] and [81]. Most of the algorithms for
the Heavy Hitters problem can be categorized as counter-based, such as [81, 83] or
sketch-based. A well known example of a sketch-based algorithm is the Count-Min
algorithm of Cormode et al. which maintains a sketch of the stream. A sketch in this
sense is a data structure which at any time allows us to quickly compute the estimated
frequency of any item. A description of some of the substantial algorithms for heavy
hitters, as well as other significant results regarding the heavy hitters problem can be
found in [41].
The Misra-Gries Algorithm One of the first solutions for this problem was pro-
posed by Misra and Gries in [83]. Given θ = Nk+1 , the algorithm maintains a structure T
of k items, each item consists of a VAL and a counter. The algorithm works as follows:
Upon the first stream element v, the VAL of the first item in T is set to v and its
counter is set to 1. For each consequent element v′ in the stream:
1. If v′ is already the VAL of an item in T , its counter is incremented by 1.
2. Otherwise, that is, if v′ is not the VAL of one of the k items:
(a) If one of the items in T has a counter equal to zero, the VAL of that item is
set to be v′ and its counter is set to 1.
16 CHAPTER 2. BACKGROUND
(b) Otherwise decrement all k counters.
After completing the pass over all of the element, the VALs found in the k items of
T are the candidates for being the heavy hitters. The time complexity of the algorithm
is O(1) for the dictionary operations plus the cost of decrementing all of the counters.
Misra and Gries propose an amortized O(1) solution using a balanced search tree.
Demaine et al. propose a worst case O(1) solution using linked lists [45].
The intuition behind this procedure is that if an element occurs at least Nk+1 times
then it must get inserted at some point and there are not enough elements to erase
it completely. It should be noted that the algorithm potentially produces many false
positives. Some or even many of the candidates found in T may not be real heavy
hitters. Suppose for example, that the stream is made up of some k− 1 distinct values,
followed by a stream of a single value. The output in this case will include k − 1
candidates which appeared only one time in the stream and another candidate which
is the real heavy hitter and appeared N − k+ 1 times. To deal with this we perform an
additional procedure on the results to identify the real heavy hitters. In addition, it is
possible to maintain meta-data while performing the above algorithm to provide some
intermediate indication of who the heavy hitters actually are.
The Space Saving Algorithm In [81] an additional counter-based algorithm for
the Heavy Hitters problem is proposed called the Space-Saving algorithm, which is
detailed in Procedure Metwally et al. Heavy Hitters.
As in the above algorithm, a structure T of k items is maintained, each item consists
of a VAL and a counter. For each element v in the stream, if v is already the VAL of
an item in T , its counter is incremented. Otherwise, the item having the lowest count
in T is replaced by v and its counter is incremented.
The error rate ε of this algorithm is ε = Nnv
[81], meaning that each counter in the
output of the algorithm is at most ε higher than the actual number of times that the
value appeared in the stream. The algorithm requires O(1) time per stream item, it
makes only a single pass over the input therefore running in O(N) time, and requires
constant space.
For our systems implementation (See Section 5.4 and 7.3.2), we chose to implement
the Space-Saving algorithm of Metwally et al. [81], since it provides quite accurate
counter estimations for values seen early in the stream [41].
2.1. FREQUENT ITEMS IN DATA STREAMS 17
Procedure Space-Saving Heavy Hitters
Data: 〈α1, .....αN 〉, constant nv << NResult: nv heavy hitters// Maintain nv heavy hitter candidates.
Frequent[1...nv] = item = NULL and count = 0;for i = 1→ N do
// If in Frequent, increment count.
if ∃j s.t. Frequent[j].item == αi then Frequent[j].count+ +;else
// Look for item with smallest count, and replace it.
find j s.t. ∀h Frequent[j].count ≤ Frequent[h].count;Frequent[j].item := αi;Frequent[j].count+ +;
return Items;
Frequent Items using Sampling: Sample and Hold The Sample and Hold family
of streaming algorithms [39, 48, 53] are sampling based solutions for frequent item
detection in streams.
We provide an overview of these algorithms. Given a stream of elements, a set
of cached elements or keys is maintained (the cached elements make up the sample).
A counter cx is maintained for each cached key which tracks the number of times it
occurred in the stream since it entered the cache. When an element with key x that is
not cached is processed, a biased coin flip is used to determine whether to add it to the
cache.
Two basic designs are the fixed threshold and fixed-size paradigms. The fixed thresh-
old design is specified for a threshold τ . The algorithm maintains a cache S of keys,
which is initially empty, and a counter cx for each cached key x. A new element with
key x is processed as follows: If x ∈ S is in the cache, the counter cx is incremented.
Otherwise, a counter cx ← 1 is initialized with probability τ . In the fixed-threshold
paradigm, the bias of the coin is specified, yet it has the disadvantage that the memory
usage (sample size) can increase.
The fixed-size design is specified for a fixed sample (cache) size k and works by
effectively lowering the threshold τ to the value that would have resulted in k cached
keys. In this case the bias is modified as the stream is processed.
An important property of Sample and Hold is that the set of sampled keys is a
probability proportional to size without replacement (ppswor) sample of keys according
to weights hx [102].
18 CHAPTER 2. BACKGROUND
2.1.3 Related Problems
Data stream analysis and particularly, item frequency in streams, has received much
attention in both the research community and in the industry. We mention a few
problems which offer a variety of different flavors of the heavy hitters problem. These
problems are sometimes confused with the Heavy Hitters Problem and we define them
here to disambiguate them from the problems we deal with.
The Top-k Problem Following [34], we define the problem as follows: Given a se-
quence S of N values α = 〈α1, .....αN 〉 from universe U and values k and ε Assume that
value vi occurs ni times and that n1 ≥ n2 ≥ n3.... The Top-k Problem is to find a set
T of k values from S such that for all nj ∈ T , nj > (1 − ε)nk. This problem has been
widely studied and we refer the interested reader to solutions proposed, for example,
in [22, 34, 53, 81].
The Item Frequency Problem Given a sequence S of N values α = 〈α1, .....αN 〉
from universe U and a value ε, the Item Frequency Problem is given any j return
f ′j such that f ′j ≤ fj ≤ f ′j + εN [40]. This problem requires a different processing of
the stream than the solutions proposed for the Heavy Hitters problem and they may
be useful for different types of applications. A discussion of this problem in various
streaming models can be found in [45].
Hierarchical Heavy Hitters The last variant which we discuss offers a variation on
the type of data which is handled, which creates the need for a different definition of the
Heavy Hitters problem as well as different algorithms for its solution. The Hierarchical
Heavy Hitters problem (HHH) [42, 111] seeks to find the heavy hitters for data that
has a well defined hierarchical structure such as IP addresses. For example, IP addresses
are formed in a way that 123.∗ .∗ .∗ includes 123.456.∗ .∗, therefore forming a hierarchy
of the data.
Following the definition of [42], the problem is defined as follows: Given a set S of
N items from a hierarchical domain D of height h. For a set P of prefixes from D,
define elements(P ) to be the union of items that are descendants of P in the hierarchy.
Given a threshold θ, the set of Hierarchical Heavy Hitters is defined inductively:
• Level 0: HHH0 is the set of Hierarchical Heavy Hitters at level 0. This is simply
2.2. DDOS 19
the heavy hitters of S by Definition 3.
• Level i:HHHi is the set of Hierarchical Heavy Hitters at level i. Given a prefix p in
level i of the hierarchy, define Fp =∑f(e) : e ∈ elements(p) ∧ e /∈ ∪i−1l=0HHHl.
Then HHHi is the set p : Fp ≥ θN
The HHH of S is the set ∪hj=0HHHj.
An elegant and efficient solution for this problem can be found in [111]
2.2 DDoS
Denial of Service (DoS) attacks have been threatening the security of Internet users and
services for approximately two decades. One third of service downtime in the Internet
is caused by Distributed Denial of Service (DDoS) attacks and over 2000 DDoS attacks
are witnessed globally each day [20].
In networks, denial of service occurs when a network entity can not reach another
node in the network or can not get a legitimate service from that node [64]. In the early
’90s, such attacks were used in online gaming and Internet relay chat communities
[63]. Since then, the Internet and the attackers have evolved significantly, and in recent
years, DDoS attacks, have been posing a significant risk to the Internet and its users.
A DDoS attack occurs when the attacker uses many computers to launch the denial of
service attack. Another important difference between the two types of attacks, is that
while in DoS attacks the attacker sends a relatively small number of packets targeting
a bug in the victim’s program or application, in DDoS attacks, the attacker sends a
huge amount of traffic to a valid, seemingly unexposed victim [29].
DDoS flooding attacks can generally be classified to Network/Transport level at-
tacks and application level attacks [127]. Network/Transport level attacks generally
refers to attacks which consume a large portion of the bandwidth or exploit some fea-
ture or bug of a protocol to consume resources such as in a TCP SYN flood. The
TCP SYN attack [8], is a well known DDoS attack in which the attacker floods the
server with TCP/SYN packets from forged senders, causing the server to keep half-
open connections for responses that will never arrive, using up the available connection
resources.
Application level attacks refer to attempts to consume the server resources such as
CPU, ports, sockets, bandwidth, memory etc. In these attacks, the attacker floods the
20 CHAPTER 2. BACKGROUND
network with seemingly legitimate traffic aimed at a server’s incoming link, in order
to consume as much of the server’s resources as possible. This thereby prevents or
significantly impairs the server from handling traffic from legitimate users. This can
be done for example, by using an army of zombies in which each zombie sends traffic
to the server, and in itself, acts as a legitimate user. More sophisticated attackers can
send more complicated request to the server thereby causing it to use up computation
resources as well. Attacks against the Domain Name System (DNS) service may also
contain spoofed source addresses which would cause a reflection of the attack or may
send requests that generate large responses (such as an ANY request) to use the DNS
for amplification.
There has been a great deal of work done on mitigation of different types of DDoS
attacks. Recent advances include solutions for mitigation of DDoS attacks in Software
Defined Networks or cloud environments (For example [119, 122]).
In existing DDoS defense mechanisms, the system has several layers of detection and
defense [10]. Detection layers may look for different types of anomalies and suspicious
findings in the traffic. The defense layers may include several levels of escalation. If
some traffic is considered suspicious, it can be escalated and the source of the traffic
is then presented with different types of challenges, based on the level of escalation. It
is important to understand that escalated traffic is not rejected but rather challenged.
One type of challenge is known as a CAPTCHA (Completely Automated Public Turing
test to tell Computers and Humans Apart [118]), though there are many other types
as well. During a DDoS attack, the attacked resources become unavailable to most if
not all legitimate users. The system then needs to quickly identify the attack traffic so
that it can be escalated and challenged appropriately. Using the signatures extracted
by our algorithm, traffic is filtered. Content that contains some minimal number (1 or
more) of signatures is considered malicious and the rest is considered legitimate.
DDoS attacks have been widely studied in the literature. Following [127], defense
mechanisms may be classified by location as source based, destination based, or network
based.
For network level attacks, source based solutions include traffic filtering or moni-
toring at the source’s edge routers or networks, such as in [51, 82]. However, with the
increasing use of botnets these mechanisms are becoming less useful for today’s DDoS
attacks and therefore network and destination based defenses are more effective [127].
2.3. DEEP PACKET INSPECTION 21
Network based solutions include route based packet filtering (e.g. [72]), detection of
malicious routers, (e.g. [55]). Destination based solutions include packet marking (e.g.
[35]) and filtering (e.g. [95]).
Our work focuses on defenses against application level attacks. Destination based
solutions include various mechanisms which protect a server or group of servers against
potential threats. Examples of such mechanisms include [100, 101]. Hybrid based so-
lutions are also common for application level DDoS attacks, and can be a combined
mechanism in different locations, such as detection at the destination and mitigation
at the network. Examples of such solutions are traffic anomaly detection (e.g. [77, 92]),
admission control (e.g. [107]) and methods for differentiating bots from humans using
mechanisms such as CAPTCHA. Recent SDN and cloud technologies have also brought
the development of various cloud based solutions such as [87, 119].
DDoS attacks are constantly growing in both numbers and strengths, therefore
DDoS detection and mitigation continue to be extensively researched.
2.3 Deep Packet Inspection
Deep packet inspection (DPI ) is one of the core techniques used by security tools such
as Web Application Firewalls (WAF), Network Intrusion Detection/Prevention Systems
(NIDS/IPS) (e.g. Snort [9] and Bro [32]) and others, to detect malicious traffic.
DPI refers to the process of examining the payload and the header of packets, as
they go through different components of the communication network, and indicating
when traffic may contain a malicious signature. To do so, packets are searched for
signatures of malicious traffic using pattern matching techniques.
Signatures can be either precise strings or regular expressions. DPI makes use of
algorithms which can efficiently match multiple signatures. For exact strings, signatures
are detected using classic pattern matching algorithms, usually based on deterministic
finite automata (DFA). Commonly used algorithms include as Aho-Corasick [15] and
Wu-Manber [78]. Regular expression matching can be performed using deterministic
finite automata (DFA) or non-deterministic finite automata (NFA) [26, 71].
Signatures can be generated offline and then inserted into the DPI engine for match-
ing against future traffic. As explained in Chapter 5, signatures generated by our system
can be used in such a manner to detect and mitigate zero-day application level DDoS
22 CHAPTER 2. BACKGROUND
attacks in any middlebox that contains a DPI component.
It should be noted, that DPI is one of the most resource and time consuming process-
es within different network security components. It is usually the string manipulation
or pattern matching procedures which account for much of the time and resource de-
mands of DPI. There is ongoing research being done to improve the efficiency of these
processes, such as [71].
2.4 Software Defined Networks
SDN has emerged in recent years as a framework for creating configurable network-
s, with improved network management abilities. While SDN is not limited to Open-
Flow [80], it is currently the de-facto SDN standard both in industry and academia.
OpenFlow switches operate flow tables, mostly TCAM based, that are used to match
packets header fields with a limited set of actions, such as set field and add label.
In general, the OpenFlow protocol is based on a match-action concept: OpenFlow
switches store rules (installed by the controller) consisting of a match and an action
part. A packet matched by a certain rule will be subject to the associated action. For
example, an action can define a port to which the matched packet should be forwarded.
An action can also add or change a tag of a packet (a certain part in the packet header).
The controller, which manages the switch flow tables, defines and installs these flow
table rules and uses them to manage the traffic going through the switch.
We will assume that each switch has a group table: a forwarding table whose rules
include an ordered list of action buckets. Each action bucket contains a set of actions to
execute, and provide the ability to define multiple forwarding behaviors. Each bucket
in a fast failover type table, is associated with a parameter that determines whether
the bucket is live; A switch will always forward traffic to the first live bucket. As the
parameter to determine liveness, the programmer either specifies an output port or a
group number (to allow several groups to be chained together).
Chapter 3
Time Locality in Heavy Hitters
and Detection of Heavy Flows in
Software Defined Networks
Match and Action Model
3.1 Overview
We present techniques for large flows detection in traffic that passes through an SDN
switch with Openflow. While SDN switches are very efficient and considerably simpler
to manage than existing routers and switches, they do not offer direct means for the
detection of large flows.
Existing network monitoring tools for classic IP networks have been available for
over 20 years, with one of the earliest tools being Cisco Netflow [4]. Over the years,
traffic visibility, and specifically measurements and monitoring in IP networks has be-
come an increasingly difficult task due to the overwhelming amounts of traffic and
flows [128]. While existing tools may be very useful for classic networks, monitoring
in SDN networks requires new tools and technology. The SDN network architecture
places the controller as the focal point of the network. Therefore, using existing tool-
s would require extensive communication between the controller and the monitoring
tools, which would place significant overhead on the controller. It is therefore necessary
to provide new monitoring methods for SDN networks based on the SDN architecture.
23
24 CHAPTER 3. TIME LOCALITY IN HH AND HEAVY FLOWS IN SDN
We design ways to implement monitoring methods with the widespread OpenFlow
and the recent P4 [30] standard for SDN switches. OpenFlow switches provide counters
that measure the number of bytes and packets per flow entry, yet traffic measurement
remains a difficult task in SDN for two reasons: First, the hardware (usually Ternary
Content Addressable Memories (TCAMs)) constraints limit the number of flows which
the switch can maintain and follow. Second, the limited number of updates that the
switch can process per second [108], which hence limits the number of updates that the
controller can make to the flow table. The algorithms provided herein overcome these
limitations by providing efficient building blocks for large flow detection and sampling
which may be used by various monitoring applications.
3.1.1 Our Contribution
First, we propose our Sample&Pick algorithm which is an efficient method to detect
large or heavy flows going through an SDN switch. The Sample&Pick algorithm is
designed for protocols which are based on the match and action model (e.g., OpenFlow,
P4, etc.), and performs a division of labour between the switch and the controller,
coordinating between them to identify the large flows. Sample&Pick achieves very high
accuracy using a fixed amount of rules in the switch and requiring little communication
between the switch and the controller.
Second, we consider a distributed model with multiple switches and propose so-
lutions for efficient scaling of our techniques, to support large flow detection in the
distributed setting.
Finally, we have implemented and evaluated our Sample&Pick algorithm compar-
ing it with OpenSketch [124]. The sampling methods rely on standard and optional
features of OpenFlow 1.3 (or the P4 language) and are implemented with the NoviKit
(hardware) switch[5] (operated with NoviWare switching software [6]). The heavy flow
detection also relies on a standard OpenFlow controller and was evaluated as a w-
hole using a dedicated virtual time simulation for both the data and control planes.
Additionally, the techniques presented are both flow-table size and switch-controller
communication efficient.
3.2. RELATED WORK 25
3.2 Related Work
3.2.1 Network Measurement
Network measurement tools are a key component in creating quality networks and
are crucial for providing advanced network abilities such as QoS and security. Cisco
Netflow [4], was one of the earliest network monitoring tools. It provided a variety
of monitoring capabilities allowing the collection of IP flow level statistics. Netflow
provided the ability to gather information from the router about every IP flow, including
byte and packet counts yet suffered from high processing and collection overheads. In
the variant Sampled Netflow, sampling was used to partially decrease these overheads,
yet Sampled Netflow provided reduced accuracy caused by the straightforward use of
sampling [49]. In [49] Estan and Varghese significantly improve the accuracy of the
sampling process by introducing the Sample and Hold algorithm which provides better
accuracy while reducing the processing and collection overhead. The sample and hold
algorithm is essentially sampling with a ”twist”. As in regular sampling, each packet is
sampled with some probability, and if there is no entry for the packet’s flow, an entry
is created. Once an entry for a flow exists, it is updated for every packet thereafter in
that flow.
In a usual setup, monitoring devices are placed in central locations in the network
(such as Arbor’s Peekflow [60], or other security detection devices) and samples of traffic
are being sent to the monitoring devices for various additional processing for which the
switch/router are not suitable, such as heavy hitters analysis, DPI, and behavioral
analysis. These monitoring devices usually cannot absorb and process all the traffic.
Therefore, traffic must be sampled, and only the samples or relevant flows should be
forwarded to these devices.
As the networks evolved, network monitoring tools with more advanced capabilities
were developed. In [104], for example, a flow monitoring tool was presented. There,
they discussed adding flow sampling abilities as an inherent capability of the router-
s. They provide a framework for distributing the monitoring across routers, allowing
for network-wide monitoring. By using uniform hash functions, flow sampling is not
duplicated across different routers which route the same flow.
In OpenFlow the flow table allows us to define rules which support counting of
bytes and packets per flow. However, this is not sufficient for more advanced measure-
26 CHAPTER 3. TIME LOCALITY IN HH AND HEAVY FLOWS IN SDN
ments. Recently there have been several works that discuss or suggest enhancements to
network measurement capabilities for both OpenFlow and for SDN in general. FleX-
am is a sampling infrastructure for OpenFlow proposed in [105], which adds sampling
capabilities, using random number generation. Opensketch [124] provides a simple ap-
proach to collect and use measurement data, separating the measurement data plane
from the control plane. The paper suggests a new architecture, where in the data plane,
a pipeline of three essential building blocks is provided: hashing, filtering and counting,
and in the control plane, a wide library of measurement tasks is provided. The above
works suggest an alternate to the OpenFlow architecture, while our work relies on fea-
tures that already appear in the current OpenFlow standard as required or optional
features, in addition to the common extensions such as matching on an extra field in
the packet. These extensions follow the concepts described in [44], that suggests that
the OpenFlow standard should allow the user to configure the headers that the switch
can examine. All our modification are in the spirit of the OpenFlow architecture. We
note that there are works that do not require changes to the OpenFlow standard. For
instance, OpenNetMon described in [115] is a controller module for monitoring flow
level metrics, such as packet loss, delays and throughput in OpenFlow networks.
A recent work [125], proposes a method for distributing the monitoring tasks be-
tween different switches in order to reduce the number of rules needed in each switch.
This method is orthogonal to our distributed solution (see Section 3.6), and can be
combined to further reduce the number of switch entries.
Another recent work, [85], proposes DREAM, a framework for identifying heavy
hitters (see Section 2.1) in traffic using TCAM based hardware. As shown in [85], the
algorithm they use for heavy hitters detection may require more TCAM entries than
a commodity switch may have available. Therefore DREAM performs efficient multi-
switch resource allocation between switches to achieve the desired accuracy rates. The
Sample&Pick algorithm we propose (Section 3.4.2) requires significantly less counters
in the switch and can be used by DREAM to reduce the overall number of switch
entries used.
3.3. TIME LOCALITY DEFINITIONS FOR HEAVY HITTERS 27
3.3 Time Locality Definitions for Heavy Hitters
In the past, a flow was defined as a sequence of packets considered to be logically
equivalent to a call [33]. A slightly broader definition of a flow is a sequence of packets
from a specific source to a specific unicast, anycast or multicast destination [99]. A more
robust definition may be found in [49], where a flow is considered to be a sequence of
packets defined by a set of header field values, which act as the flow identifier and
identify the flow as well as an optional pattern which identifies the packets that make
up the flow. Using different combinations of identifiers and patterns many different flow
types can be defined.
We follow [80] where a flow is defined to be any sequence of packets which can
be matched to rules in the flow table, such as, for example, those defined by a set
of header field values. Note that our algorithms can be used for any flow definition,
including those which pertain to matches in the payload or any of the headers as long
as it is supported by the controller and switch implementation.
A flow entry in an OpenFlow flow table can be defined to match packets according
to (almost) any selection of header field bits thereby allowing various flow definitions.
A large flow is usually defined as a flow that takes up more than a certain per-
centage of the link traffic during a given time interval [49]. For some applications other
definitions of large flows are required, for instance network analysis tools may need to
identify flows that consist of a certain amount of packets regardless of link capacity.
Therefore we refine the large flow definition, considering both the time aspect as well
as the type of measurement performed.
We consider the following definitions of large flows, which are summarized in Ta-
ble 3.1:
Definition 4. Heavy flow: Given a stream of packets S, a heavy flow is a flow which
includes more than T percent of the packets since the beginning of the measurement.
Considering the definition of flow provided above, this can be useful for identifying
flows which remain heavy over a significant period of time, for example in Distributed
Denial of Service (DDoS) attacks. On the other hand this will miss large flows if the
measurement continues for a very long period of time.
Definition 5. Interval Heavy flow (Elephants): Given a stream of packets S, and
a length of time m, an interval heavy flow is a flow that includes more than T percent
28 CHAPTER 3. TIME LOCALITY IN HH AND HEAVY FLOWS IN SDN
of the packets seen in the previous m time units.
This can be used for standard traffic management and resource allocation.
Definition 6. Bulky flow at a point of time: Given a stream of packets S, and a
length of time m, a bulky flow is a flow that contains at least B packets in the previous
m time units.
XXXXXXXXXXXcount typetime
Limited Time Interval Unlimited time
Percent of traffic Interval Heavy flow Heavy flow
Amount of traffic Bulky flow —-
Table 3.1: Matrix definitions for large flows
The algorithms we present for large flows follow the above definitions which consider
traffic volume measurements in terms of packets. Nevertheless, we note that certain
traffic management capabilities require volume, i.e., byte size, analysis. For instance, if
we wish to identify the flow which takes up the most bandwidth, then we are required
to count the number of bytes in the flow rather than the number of packets. The
algorithms presented here work well for both definitions.
3.4 Heavy Flows Detection in SDN
3.4.1 Towards a Solution
Fundamental counter based algorithms for finding heavy hitters (or flows) such as the
Space-Saving algorithm [81], cannot be directly implemented in the SDN framework
since in the worst case they would require rule changes for every packet that traverses
the switch. A different approach is therefore needed.
First we consider a naive solution which we name Sample&HH, that samples packets
in the switch and then sends all sampled packets to the controller. The controller
computes the heavy flows using a heavy hitters algorithm. However, as can be seen in
Figure 3.2a (and other works [49]), relying solely on the samples is not accurate enough.
Next, we consider a solution based on the Sample&Hold paradigm of [49] which was
devised for identifying elephant flows in traffic of classic IP networks. In Sample&Hold
3.4. HEAVY FLOWS DETECTION IN SDN 29
sampled packets are sent to the controller, which installs a counter rule for each new
flow that is sampled. Every consequent packet from that flow will be counted by the rule
and will not be sampled. By using sampling together with accurate in-band counters
for sampled flows Sample&Hold achieves very accurate results, yet the high amount of
counters and the rate of installing them make Sample&Hold incompatible with SDN
switch architecture. Therefore we only consider it as a reference point to evaluate our
algorithm.
To deal with the problems of the above solutions, we present our Sample&Pick algo-
rithm. Sample&Pick uses sampling to identify flows that are suspicious of being heavy.
For these suspicious flows a special rule is placed in the switch flow table providing
exact counters for them. The Sample&Pick algorithm considers both the bounded rule
space in the switch as well as the time it takes for the controller to install a rule in the
switch. Therefore we use two separate thresholds, the first, T , for determining which
flows are heavy and a second lower threshold, t, for detecting potentially large flows.
This lower threshold allows us to install rules in the switch early enough to get an
accurate count of the large flows, yet we do not install rules for too many flows that
will remain small. The Sample&Pick algorithm is described in detail in Section 3.4.2.
Table 3.2 depicts the conceptual differences and the resource consumption overhead
of the Sample&Pick algorithm, the SDN Sample&Hold algorithm and the Sample&HH
algorithm
3.4.2 The Sample&Pick Algorithm
3.4.2.1 Algorithm Overview
Our algorithm operates as follows: in the first step we sample the flows going through
the switch. Note that sampling can be achieved using Openflow weighted groups as
explained in [14]. As can be seen in Fig. 3.1, these samples are sent to the controller,
that feeds them as input to a heavy hitters computation module in order to identify
the suspicious heavy flows (steps 2 and 3). Once a flow’s counter in the heavy hitters
module has passed some predefined threshold t, a rule is inserted in the switch to
maintain an exact packet counter for that flow (steps 4 and 5). This counter is polled
by the controller at fixed intervals and stored in the controller (steps 6 and 7). Finally,
30 CHAPTER 3. TIME LOCALITY IN HH AND HEAVY FLOWS IN SDN
Technique Switch memo-ry usage
Controllerfunctionali-ty
Controller toSwitch traffic
Switch to con-troller traffic
Sample&Pick Sampling rules+ at most 1
tcount rules
Heavyhitters com-putation+ counteraggregation
Every intervalat most 1
t newcount rules
Sample of allnon-hold pack-ets + counterseach interval.
Sample&Hold(OpenFlowvariant)
Sampling rules+ unlimitedcount rules
Counter ag-gregation
Every newsample createmessage witha new countrule
Sample ofall non-holdpackets + finalcounters.
Sample&HH Sampling rules Heavyhitters com-putation
None Sample of al-l packets
Table 3.2: Comparison of the heavy flow detection techniques presented in this work.Denote t the threshold for candidate heavy hitters in Sample&Pick .
name match actions
Countflow1
(src ip, src port, dst ip, dst port) = flow1 1
... ... ...
Countflowm
(src ip, src port, dst ip, dst port) = flowm 1
Sample (src ip, src port, dst ip, dst port) = ∗ 2
Table 3.3: Illustration of switch flow table configuration. Rule priority decreases fromtop to bottom. Actions: 1- increment counter; 2 - apply sampling technique (goto sam-pling tables / apply group)
the last step increments the counters that are processed by the Heavy Hitters module
to maintain correct counters of non-sampled flows.
3.4.2.2 Switch Components Design
As seen in Fig. 3.1, two kinds of rules are used in the switch flow tables. The sam-
pling rules, which are created as needed by the sampling algorithm, and the counter
rules used for precisely counting packets of potentially heavy flows. An example of this
configuration can be seen in Table 3.3.
First, each packet is matched against counter rules. In case of a successful match,
the relevant counter is increased. Only if the packet does not match any counter rule,
it is matched against the sampling rules, and if the packet is selected by the sampling
3.4. HEAVY FLOWS DETECTION IN SDN 31
Figure 3.1: Sample&Pick overview
rules it (or only the headers) is sent to the controller. Counters of the counter rules are
only sent to the controller when polled by the controller.
3.4.2.3 Controller Components Design
As seen in Fig. 3.1, the controller maintains the heavy hitters computation module and
a collection of the exact counters accumulation.
The heavy hitters computation module: Maintains the data structure used for de-
tection of heavy hitters according to the Space-Saving algorithm [81] (described in
Section 2.1).
As the heavy hitters module only receives sampled data which is sent to the con-
troller from the switch, the traffic of the heavy flows which are not sampled is not
inserted at all into the heavy hitters module and therefore it may seem as though the
flows are no longer heavy. To simulate the sampling of these heavy flows, when the
controller polls the switch for the updated counters, it uses those counters to update
the heavy hitters module accordingly. That is, we simulate a sampling of the heavy
flows by updating the heavy hitters module with the number of new packets that have
been counted since the previous polling, multiplied by the sampling ratio p. As noted,
this mechanism saves a substantial amount of sample traffic from the switch to the
controller.
The exact count data structure: The accumulated counters of the flows that are
32 CHAPTER 3. TIME LOCALITY IN HH AND HEAVY FLOWS IN SDN
suspected to be heavy are maintained in a simple ordered data structure. It is used to
compute the delta from the previous time the counters were polled. This delta is then
fed (with a factor) into the heavy hitters module.
An additional counter is maintained in the controller to count the total number of
items inserted into the heavy hitters module, which is necessary to calculate the rates
from the individual counters inside the heavy hitters module. At any point the heavy
flows may be identified as the flows in the heavy hitters module that have passed the
threshold T , relative to the total counter.
3.4.2.4 Analysis
Here we discuss how to choose the parameters, t and v of Sample&Pick algorithm for
given problem parameters, the threshold T for heavy flow and the sampling probability
p.
By definition, if a total of N packets have passed so far, each heavy hitter flow
contains at least TN packets. Our controller receives each packet with probability p.
The number of samples is then on average (or exactly depending on the sampling
method) n := Np. The number of packets sampled out of x original packets is a
random binomial variable with average xp and variance xp(1 − p). When x is high
this converges to normal distribution with similar parameters. For normal distribution,
w.h.p the random variable is within a distance of 3 times the standard deviation from
the average. Therefore the number of packets sampled from x packets is w.h.p greater
than xp− 3√xp(1− p).
Our scheme uses a threshold t < T , in order to detect possible heavy flows that
might be missed due to sampling errors. For a heavy flow (with at least T ·N packets)
w.h.p at least TNp− 3√TNp(1− p) packets are sampled. We need to set t to ensure
that the above expression is higher than t · n. Thus,
t < T − 3
√T (1− p)√Np
(3.1)
Since t must be a positive number, we get the following constraint on the flow weight
(ratio) our scheme is expected to detect: T 2 − 9T (1−p)Np > 0 which is valid when
T > 91− pNp
(3.2)
3.4. HEAVY FLOWS DETECTION IN SDN 33
For example, assuming a line rate of 6 · 105 packets per second and a controller
throughput of only a few thousands messages per second, we need a sampling rate of
at most 1 : 100, i.e., p < 10−2. Assuming that the tested interval is at least 10 seconds
long, more than six million packets pass through the switch during the interval, i.e.,
N > 106. From Equation 3.2 we get that the threshold, T , can then be roughly 10−3
or more.
Next we consider the fact that the flows that are monitored by exact counters are
updated in batches (when reading the switch flow entry counters). To make sure that
their counters in the approximate HH structure are not evicted between updates, we set
the number of entries, v, to be high enough considering the threshold, t, for monitored
flows.
Next we show that by choosing v = 2/t the number of samples that would cause
the eviction of one of the monitored flows, that is a flow that is located at the top part
of the approximate heavy hitters structure, is very high.
Assume we have k monitored flows, the sum of their counters is at least k · n · t.
The number of other values in the table is v − k, and their sum is at most n− knt. In
order for the minimal monitored flow to be evicted, all lower values in the table should
exceed it, i.e., all smaller counts need to become higher than nt. Their sum should thus
be at least (v− k)nt, increasing by at least (v− k) ·nt− (n− knt) = vnt−n. Since the
counts change by the number of incoming samples, if we set v = 2t then the number of
new samples received between batch updates should be as large as the number of all
samples received so far (n) which is highly unlikely.
3.4.3 Evaluation
3.4.3.1 Comparison of Algorithms
We compare our Sample&Pick algorithm to the two additional solutions described
above Sample&Hold and Sample&HH (See algorithms overview in Table 3.2). We an-
alyze the resource consumption and accuracy of each of the algorithms in fixed time
intervals. We use 10 intervals of 5 seconds each, and we collect the counters of each
algorithm at the end of each interval. In addition, we compare the results of these al-
gorithms to that of the OpenSketch Heavy Hitters detection mechanism [124]. For our
analysis, we use a one-hour packet trace collected at a backbone link of a Tier-1 ISP in
San Jose, CA, at 12pm on September 17, 2009 [1].
34 CHAPTER 3. TIME LOCALITY IN HH AND HEAVY FLOWS IN SDN
We chose the following simulation parameters T = 5 · 10−3, p = 11024·102 Bytes,
t = 2 · 10−3, v = 2000.
Figure 3.2a shows a comparison of the three algorithms based on accuracy criteria.
The counter error refers to the ratio between the real count of the heavy hitters and the
algorithm estimates. The false negative and false positive errors is the ratio between
Heavy Hitter (HH) flows missed to the total number of HH flows, and the HH flows
wrongly detected to the total number of HH flows respectively. Figure 3.2b shows a
comparison of the three algorithms based on the amount of traffic they generate and the
amount of memory they use in the switch. As can be seen, while Sample&Hold provides
the best accuracy results, it requires an increasing amount of counters and therefore its
switch memory consumption is significantly higher than that of the other algorithms. In
contrast, Sample&HH requires the least amount of switch memory since all of the heavy
hitters computation is performed in the controller yet it relies on sampling alone and
provides significantly lower accuracy results. Our testing shows that the Sample&Pick
provides accuracy results only slightly inferior to those of Sample&Hold yet requires
significantly less switch memory.
Technique OpenFlowCompatibility
Error Rate Switchmemoryusage
Controller ↔Switch Traffic
Sample&Pick Yes 3.3% 2KB 220KB/s
Sample&Hold Yes 1.15% 400KB 140KB/s
Sample&HH Yes 11.3% ≤ 1KB 270KB/s
OpenSketch[124] No 0.05−10% 94KB −600KB
NA
Table 3.4: Resource consumption test results
As can be seen in Table 3.4, Sample&Hold gives the smallest error rate, since it
performs an actual count of all flows that it samples, yet it uses significantly more
switch memory. Sample&HH uses only samples for the counter estimates without using
any counters in the switch yet incurs significantly higher error rates. Sample&Pick has
relatively small error rates due to the actual counting of potentially heavy flows, yet due
to the careful selection of which counters to place in the switch, the switch memory usage
in Sample&Pick is very low. According to our testing, the error rate of Sample&Pick
3.4. HEAVY FLOWS DETECTION IN SDN 35
(a) Comparison of algorithms by Counter error, False negative errors and Falsepositive errors.
(b) Comparison of algorithms by Overall traffic (between switch and controller)and Switch memory usage.
Figure 3.2: Resource consumption and accuracy comparison
36 CHAPTER 3. TIME LOCALITY IN HH AND HEAVY FLOWS IN SDN
may be further reduced with increased sampling rate or counter polling rate, yet the
switch memory requirement remains steady at 2KB as determined by our parameters.
The controller↔switch traffic (sum of traffic in both directions) of each of the presented
algorithms is directly influenced by the sampling rate (recall in this case p = 11024·102
Bytes) and the counter polling rate of the controller. In the case of Sample&Pick the
polling rate is set to be every 0.1 seconds in these tests, while in Sample&Hold the
controller only polls for the counters once at the end of the interval. As can be seen,
Sample&HH produces a larger traffic overhead since all sampled messages are sent to
the controller whereas in the other two algorithms the counters in the switch perform
the aggregation locally.
Additionally, we compare our results to testing done on the OpenSketch Heavy Hit-
ters detection mechanism [124]. OpenSketch is a very efficient measurement architec-
ture, yet it is not compliant with the OpenFlow standard. Our Sample&Pick algorithm
was designed with the current OpenFlow and P4 ablities in mind and it can therefore
be implemented using the current standards. We base our comparison on the evaluation
results shown in [124]. Note that while we perform our test on the same data as used
in [124], we provide an average of 10 intervals of 5 seconds each, as opposed to 120
intervals used in the OpenSketch evaluation. As can be seen in Table 3.4, Sample&Pick
requires very little switch memory while achieving counter errors which are similar to
those achieved by OpenSketch which uses significantly more switch memory. The traffic
overhead for OpenSketch is not provided in [124] and therefore we do not indicate it.
3.4.3.2 Parameters Evaluation
We evaluate the affects of different parameters on our system. For our analysis, we use
a one-hour packet trace collected at a backbone link of a Tier-1 ISP in San Jose, CA,
at 12pm on March 20, 2014 [3].
The base setting of our system for all of the following tests used the following
parameters: T = 0.01, p = 1256 Packets, t = 0.005, v = 400.
The first parameter we examine is t, which is the threshold for detecting potentially
large flows. The output line marked as Sample&Pick1 uses the parameters as indicated
above, hence t = 0.005. The output line marked as Sample&Pick2 uses t = 0.0025, and
the output line marked as Sample&Pick3 uses t = 0.00125.
As can be seen in Figure 3.3a, the smaller the value of t is, the lower the error rate.
3.4. HEAVY FLOWS DETECTION IN SDN 37
0 500 1000 1500 2000 2500 3000#Packets(K)
0
20
40
60
80
100
Err
orR
ate
(%)
Sample&Pick1
Sample&Pick2
Sample&Pick3
(a) Comparison of error rate and convergence.
0 500 1000 1500 2000 2500 3000#Packets(K)
600
800
1000
1200
Pac
ket-
InR
ate
(msg
/sec
)
Sample&Pick1
Sample&Pick2
Sample&Pick3
(b) Comparison of PacketIn messages (samples) from switch to controller.
Figure 3.3: Affect of varying t values
38 CHAPTER 3. TIME LOCALITY IN HH AND HEAVY FLOWS IN SDN
This is due to the fact that exact counters are placed for smaller flows, therefore allowing
them to be counted exactly earlier in the stream, hence increasing accuracy. Figure 3.3b,
shows the PacketIn messages generated by the system when using different values of
t. As can be seen, a smaller value of t causes more flows to have exact counters and
therefore not be sampled. This causes a decrease in the number of PacketIn messages
as t decreases.
We now examine the affect of different T values. Recall that T is the threshold
for determining which flows are heavy. Using the base parameters indicated above, the
output line marked as Sample&Pick1 uses T = 0.01, with t = 0.005. The output line
marked as Sample&Pick2 uses T = 0.005 and t = 0.0025, and the output line marked
as Sample&Pick3 uses T = 0.0025 and t = 0.00125.
As can be seen in Figure 3.4a, a smaller T incurs a larger error, even with a smaller
value of t. This is due to the fact that a smaller T calls for the detection of smaller
flows. Figure 3.4b, shows the PacketIn message rate. The rates achieved by each test
are similar to those in Figure 3.3b due to the values of t which significantly influence
the PacketIn message rate.
Additionally, we look at different values of v (the number of items maintained by
the heavy hitters module in the controller) and its affect on the system. Using the base
parameters indicated above, the output line marked as Sample&Pick1 uses v = 400.
The output line marked as Sample&Pick2 uses v = 800, and the output line marked as
Sample&Pick3 uses v = 1600.
As can be seen in Figure 3.5a, the more space allocated in the controller, meaning,
the higher v is, the lower the error rate. Figure 3.5b, shows the PacketIn messages
generated by the system when using different values of v. As can be seen, the value of
v alone does not affect the PacketIn message rate.
3.5 Interval Heavy Flow and Bulky Flow Detection
Recall that, an interval heavy flow is a flow whose volume is more than T percent of
the traffic seen in the last time interval of length m. While the problem is defined in a
continuous manner, that is, an interval can begin at any point in time, considering the
inherent subtle delays caused by the OpenFlow architecture, an approximate solution
is sufficient.
3.5. INTERVAL HEAVY FLOW AND BULKY FLOW DETECTION 39
0 1000 2000 3000 4000 5000 6000#Packets(K)
0
20
40
60
80
Err
orR
ate
(%)
Sample&Pick1
Sample&Pick2
Sample&Pick3
(a) Comparison of error rate and convergence.
0 1000 2000 3000 4000 5000 6000#Packets(K)
400
600
800
1000
1200
Pac
ket-
InR
ate
(msg
/sec
)
Sample&Pick1
Sample&Pick2
Sample&Pick3
(b) Comparison of PacketIn messages (samples) from switch to controller.
Figure 3.4: Affect of varying T values
40 CHAPTER 3. TIME LOCALITY IN HH AND HEAVY FLOWS IN SDN
0 1000 2000 3000 4000 5000 6000#Packets(K)
0
20
40
60
80
100
Err
orR
ate
(%)
Sample&Pick1
Sample&Pick2
Sample&Pick3
(a) Comparison of error rate and convergence.
0 1000 2000 3000 4000 5000 6000#Packets(K)
700
800
900
1000
1100
1200
1300
Pac
ket-
InR
ate
(msg
/sec
)
Sample&Pick1
Sample&Pick2
Sample&Pick3
(b) Comparison of PacketIn messages (samples) from switch to controller.
Figure 3.5: Affect of varying v values
3.5. INTERVAL HEAVY FLOW AND BULKY FLOW DETECTION 41
Figure 3.6: The modified heavy hitters data structure using counter arrays. In thisexample the active counter is currently c1.
Our solution makes use of the Sample&Pick algorithm, specifically we take the
array of counters in the heavy hitter module in the controller as the starting point. We
modify this structure so that instead of maintaining one counter per item (flow), an
array of counters is maintained for each flow that is kept in the heavy hitters module.
In addition, for each flow we maintain an additional accumulative counter. The updated
counter structure is depicted in Fig. 3.6.
The array of counters for each flow maintains the history of the flow’s counter values
in fixed intervals of time. The flow’s accumulative counter is the sum of all the counters
in the flow’s array. Let m seconds be the selected time interval, and let there be r history
counters maintained for each flow, we get a sub-interval that is mr seconds long. The
basic idea is that in each sub-interval a different counter in the array is updated by the
HH module, in addition to updating the accumulative counter. Thereby, consecutive
(cyclicly) counters in the array can be used to calculate the number of times the value
appeared in the entire interval. At the beginning of the sub-interval, for each flow, the
value of the active counter is decreased from the value of the accumulative counter.
Then all active counters in all flows are reset to zero. In this manner, at the end of each
sub-interval, for any flow, the active counter equals the number of times the flow was
sampled during that sub-interval, and the value of the accumulative counter equals the
number of times the flow was sampled in the last interval m. It follows that if the index
of the active counter is a s.t. 0 ≤ a ≤ r − 1 for any r′ ≤ r − 1 the sum of the cyclically
consecutive counters between index a− r′ mod r and a equals to the number of times
the item was seen during the r′ previous sub-intervals.
42 CHAPTER 3. TIME LOCALITY IN HH AND HEAVY FLOWS IN SDN
Note that if an interval does not begin at the beginning of an exact sub-interval, we
will consider it to begin at the start of either the current or the consequent sub-interval.
The accumulative counter has two additional important uses: 1) it is used to main-
tain the threshold ratio; 2) it is used by the heavy hitters algorithm as the de-facto
counter for deciding which flow has the minimum counter and should be evicted.
Using the accumulative counter in this manner is the basis for the correctness of our
algorithm, which we will now briefly show. Given an interval i of length m, denote N to
be the number of items seen in i. If i is made up only of whole sub-intervals, it is easy to
see that at the end of interval i the accumulative counter of each flow in the structure
is equal to what its counter would be had we reset all of the counters at the beginning
of the interval. Therefore, using the accumulative counters as described above provides
us with a heavy hitters mechanism which supports the same counter error rate (i.e. Nv )
as that of [81]. If, however, i begins in the middle of a sub-interval, the counter error
rate is slightly higher. In this case, suppose i contains j complete sub-intervals, and
at most 2 partial sub-intervals. The additional error contains appearances of the flow
which occurred in the partial sub-intervals, which may incur an additional error of at
most Nv since otherwise it would be heavy for an interval comprised of only complete
sub-intervals as well, making the overall error rate in this case to be 2Nv .
Notice that bulky flows can be detected by using the above mechanism without di-
viding the counters sum by the relevant sum of counters, but rather taking the absolute
values.
3.6 Distributed Setting
In many cases, in order to achieve a comprehensive view of the network, it is required
to distributively monitor traffic at multiple switches. There are two main challenges
to deal with when detecting large flows in this distributed setting; false negatives due
to split flows and false positives due to sequential flows. Split flows are large flows
that their traffic is split to small sub flows, each going through a different monitoring
switch, and therefore monitored in parallel. Sequential flows are small flows that each
of their packets traverse multiple monitoring switches and are therefore over sampled
or counted.
In this section we extend our Sample&Pick solution in order to support this dis-
3.6. DISTRIBUTED SETTING 43
tributed setting. We describe the changes that need to be done to the sampling and to
the large flow detection scheme. We note that our solution easily scales with the num-
ber of monitoring switches. To support multiple controllers, a hierarchy of controllers
needs to be defined and data should be collected by the controllers and forwarded up
the hierarchy.
Sampling: In order to handle over sampling of sequential flows, flows that each
of their packets traverse multiple switches, we need to prevent each packet from being
sampled more than once. We suggest to do so by marking packets after they are sampled
(whether selected or not) and by applying sampling only to unmarked packets. Marking
of packets can easily be managed in SDNs (with OpenFlow and especially with P4), for
example by utilizing one bit in the VLAN tag. Matching the VLAN tag of each packet
can be easily done and allows to skip sampled packets. Note that the marks should be
removed at egress ports so that they do not affect the traffic leaving the network.
Heavy Flow Detection: As described in Section 3.4.2, our Sample&Pick algorithm
makes use of both sampling and exact counter rules in the switch. To support the
distributed setting, and to handle split flows, that each of their sub flows goes through
a different monitoring switch, all of the samples and counter values from all monitoring
switches should be aggregated centrally by the controller. The controller will receive the
samples and counter values from the different switches and treat them as if they were
generated by a single monitoring switch. One of the implications of that is that when
a flow becomes suspect of being large, exact counter rules should be installed on all
monitoring switches, to assure that all consequent packets going through the network
are counted.
Similarly to sampling, in case of sequential flows that traverse multiple switches,
exact counters (on different switches) should not count the same packet more than
once. The same packet marking technique we suggest to avoid over sampling, can be
used in order to prevent multiple counting (see Figure 3.7), i.e., marked packets are
not matched against exact counter rules nor sampled. Moreover, packets which match
exact count rules are marked even if they have not been sampled.
44 CHAPTER 3. TIME LOCALITY IN HH AND HEAVY FLOWS IN SDN
Figure 3.7: Marking sampled packets in the distributed setting.
Chapter 4
Finding Heavy Hitters in a
Stream of Strings
4.1 Overview
Often times in network management and security applications, identifying recurring
content is important. While in some cases, working with a predefined content length is
sufficient, the length of the recurring content is not always known. Developing streaming
algorithms for identifying recurring varying-length strings poses inherent difficulties
especially given the tight space and time requirements. We tackle the problem of finding
popular strings of varying lengths in textual data streams using counter-based heavy
hitters algorithms.
4.1.1 Our Contribution
We propose the String Heavy Hitters problem. Additionally, we propose an efficient
algorithm for solving this problem. This algorithm finds popular strings of variable
length in a set of messages, building upon the classic algorithms for heavy hitters
detection. The algorithm runs in linear time requiring one-pass over the input and
uses a constant amount of memory. In addition to the detection of various types of
DDoS attacks, such an algorithm can be very useful for additional applications both
in networking and in other fields. For example, identifying recurring content in emails
can assist in identification of spam messages. Furthermore, it can be used in detecting
common parts of worm code or even in DNA sequence analysis.
45
46 CHAPTER 4. FINDING HEAVY HITTERS IN A STREAM OF STRINGS
4.2 String Heavy Hitters
Heavy hitters algorithms are usually performed on numeric data, whereas our work
focuses on textual values. We present the following definitions which form the basis of
our discussion.
Definition 7. String frequency: Given a sequence S = 〈S1, .....SN 〉 of N strings, and
some string s, the frequency of s, denoted fs is the number of strings in S in which s
appears. That is fs =∑
yifs ⊆ Sy : 1, else : 0. Note that the frequency of a substring
s can be defined either as the total number of times s appears in S, or as the number
of strings in S in which s appears. For our purposes we will be using the latter.
Definition 8. String Heavy Hitter: Given a sequence S = 〈S1, .....SN 〉 of N strings
and constants k and θ, a string s is a string heavy hitter if the following hold:
1. s is a substring of one or more strings in S
2. The length of s is at least k. (|s| ≥ k)
3. s has a frequency above the threshold fs ≥ θN
Definition 9. The String Heavy Hitters Problem: Given a sequence S = 〈S1, .....SN 〉
of N strings and constants k, θ and v, using a constant amount of space and a single
pass over the sequence S, find at most v string heavy hitters, such that no output string
is contained in another output string. We denote these output strings as signatures.
4.3 Challenges
The String Heavy Hitters problem is closely related to the Heavy Hitters problem,
defined in section 2.1, however, applying the known heavy hitters algorithms to textual
data is not at all trivial.
The following problems arise and must be resolved:
1. The substring pollution problem: The textual data needs to be somehow converted
into a sequence of values. One approach might be to consider every string in
the text as a value in the sequence. This would, however, make the size of the
sequence quadratic in the size of the input, causing a significant decrease in the
time efficiency of the algorithm. Another approach, which we use in our algorithm
4.4. NOTATIONS 47
is to consider the constant length k-gram from each position in the text as a value
in the sequence. Define a k-gram to be a string of length exactly k. If a string
s, |s| > k appears many times in the input text, then all the k-grams which
are substrings of s show up as heavy hitters and are output by the heavy hitters
algorithm. We name this problem the substring pollution problem. The following
is an example of the problem: suppose the string is abcabc and k = 4, than all the
4-grams which make up the string, i.e., abca, bcab and cabc will be heavy hitters
and will therefore pollute the data structure. As explained in Section 4.6.1, our
Double Heavy Hitters algorithm deals with this problem by combining k-grams
that have repeatedly appeared in a sequence, therefore creating varying length
grams. The process of creating a string from consecutive k-grams, is a key factor
in substantially reducing the substring pollution in the output. For each such
consecutive sequence, the process creates a single input of varying length to HH2,
that has been naturally filtered by a preceding heavy hitters procedure, HH1.
2. The frequency estimation problem: Another problem which arises when creating
values from textual data is that heavy hitters may be substrings of one another.
This can occur, for example, if both the strings ABCDEF and BCDE recur
frequently in separate locations in the text. The counter of BCDE provided
by the algorithm would not reflect the times that BCDE appeared as part of
ABCDEF . In order to provide a better estimation of the frequency of each
string, the algorithm as described in Section 4.6.1 must be modified to support
this. We treat this issue using an additional procedure, which we describe in
section 4.6.2.1.
4.4 Notations
The notations we use throughout this section are summarized in Table 4.1.
4.5 Related Work
The String Heavy Hitters problem which we present here, has roots in the field of
Stringology in a variety of problems. Searching for a common substring (or suffix or
48 CHAPTER 4. FINDING HEAVY HITTERS IN A STREAM OF STRINGS
k minimal signature length (gram length)
r ratio between the frequencies of consecutivek-grams
m desired number of signatures
HHj heavy hitters module j (j ∈ 1, 2)
nHHj number of items in the HHj data structure(j ∈ 1, 2)
Table 4.1: Notations
prefix) in two texts is a classic Stringology problem [19]. This problem has been extend-
ed to multiple texts in the well studied Longest Common Substring problem. In this
problem, given m documents of total length n, we wish to find the longest substring
common to at least d ≥ 2 documents. Classical solutions such as those in [121] and [59]
require O(n) time and O(n) space. In [69] a time space tradeoff is provided giving a
solution requiring O(τ) space and O(n2
τ ) time for any 1 ≤ τ ≤ n. The Stringology
solutions are well suited for biological, data mining and other applications.
Most of the previous works which focus on identifying recurring strings in a network
setting, were done for signatures of a fixed length [57, 67, 106]. Finding varying-length
popular strings poses inherent difficulties. We note two works which have been done
which generate varying length strings. The first is Honeycomb [70] which was presented
by Kreibich and Crawcroft. There, signatures are created for suspicious traffic using
pattern matching techniques. Specifically using searches for longest common substrings
within packet payloads, using suffix trees. While this method allows creating signatures
composed of varying-length strings, and the suffix tree can be created in linear time
using Ukkonen’s online suffix tree construction algorithm [114], the space complexity of
the suffix tree is at least linear in the size of the input, and therefore not scalable when
dealing with large amounts of data. This is perhaps the most substantial difference from
our solution which uses a configurable fixed amount of space while still maintaining a
time complexity which is linear in the size of the input.
Another work in which signatures composed of varying-length strings are generated,
is Autograph [68], presented by Kim et al. To generate varying length signatures, the
payload of suspicious traffic is divided into variable-length content blocks based on the
Content based Payload Partitioning method first presented in [86]. Content blocks are
chosen as signatures based on their prevalence in the traffic flows. While the signatures
4.5. RELATED WORK 49
produced are indeed of varying length, the Content based Payload Partitioning per-
formed is done using a predetermined breakmark which is used to partition the payload
into blocks whose size is no more and no less than some predefined values. Addition-
ally, the average block size is also predetermined. Evaluation done in [68], shows that
a larger minimum content block, such as 32 or 64 bytes is needed to avoid a high false
positive rate. Signature structure is therefore based on predefined parameters which
determine the breakmark and the signature length. The system presented in our work
allows shorter signatures to be generated, and more importantly, does not use a pre-
defined breakmark for content partition so that signatures can vary significantly from
one another.
Another variation on heavy hitters that has been discussed is the Hierarchical Heavy
Hitters [42, 111]. Notice that although the Hierarchical Heavy Hitters algorithms (see
Chapter 2), may seem suited for textual data, they work well on data which forms a well
defined hierarchical structure such as a sequence of IP addresses. Since our algorithm
searches for recurring strings in the traffic, and the context of the strings is not relevant
for our purposes, identical strings need to be grouped together regardless of what comes
before them or after them in the content. Inserting a string into an hierarchical structure
would not account for common substrings that appear in different places in different
strings. For example, a common substring found both at index 10 of string s1 and index
100 of string s2 would need to be inserted in the same place in the hierarchy in order for
both appearances to be counted by the same counter. Furthermore, identical substrings
which appear multiple times in the same string would also need to be grouped together
in the hierarchy. Additionally, inserting a string into an hierarchical structure would
require inserting each character into a separate hierarchical level, making the hierarchy
both very wide and very long. Alternatively, inserting more than one character into each
level may cause the algorithm to miss common substrings, therefore causing errors in
the counter estimations of the algorithm.
Another related problem is that of compressed sensing. Many interesting works
have been done in this field such as [54, 90, 97]. It has yet to be seen if the solutions
presented for the compressed sensing problem can be adapted to outperform the above
heavy hitters algorithms for the frequent items problem.
50 CHAPTER 4. FINDING HEAVY HITTERS IN A STREAM OF STRINGS
4.6 The Double Heavy Hitters Algorithm
We propose the Double Heavy Hitters algorithm. The purpose of this algorithm is to
identify frequent substrings of varying lengths in the given packets.
4.6.0.1 Packet String Heavy Hitters
Based on Definition 9, we define the Packet String Heavy Hitters problem as follows:
given a sequence P = 〈P1, .....PN 〉 of N packets and constants k, θ and v, using a
constant amount of space and a single pass over the sequence P , find at most v string
heavy hitters, such that each has a packet frequency (the number of packets in which
it appears) which is over threshold θN , that is for pfs =∑
y s ⊆ Py, pfs ≥ θN , such
that no output string is contained in another output string.
Solutions which perform an exact count of the strings would use at least a linear
amount of space [41], therefore a more efficient solution must be found.
4.6.1 Algorithm Overview
Our algorithm makes use of the Heavy Hitters algorithm as a building block. Denote as
HH a component which performs the heavy hitters algorithm. The Double Heavy Hit-
ters algorithm, denoted DHH, makes use of two independent heavy hitters components,
HH1 and HH2, as follows:
1. HH1 finds k-grams that appear frequently, i.e., that are heavy hitters.
2. HH2 finds varying length strings that occur frequently in the input (which are
combinations of the strings found in step 1).
Define a k-gram to be a string of chars of length exactly k. The input to the DHH
algorithm is a sequence of np packets, a constant k which will determine the size of the
k-grams used and a constant r which is the ratio between the frequencies of consecutive
k-grams explained shortly. Conceptually, the process works as follows: the algorithm
traverses the packets one by one. For each index in the packet, a k-gram is formed
by taking the k characters starting from that index. These k-grams are given as an
input to HH1. To form the varying length strings which are the input to HH2, while
HH1 processes the k-grams, the algorithm seeks to find the longest run of consecutive
k-grams such that:
4.6. THE DOUBLE HEAVY HITTERS ALGORITHM 51
1. They are all already in HH1 (i.e., at this stage they are heavy hitters).
2. They have similar counters. The objective is that combining two k-grams should
occur only if they should be part of the same signature. Without this ratio r,
if some k-gram appears very frequently, but the character that usually follows
this k-gram is inconsistent, then the preferred signature should not combine this
k-gram with the one that follows it. Specifically, counters of two consecutive k-
grams maintain a ratio of r. In our experiments we tested values of r from 0 to
1. Since for our purposes a longer signature was preferable we use a ratio of 0.1
in our testing. Testing with a ratio of 0.5 or higher produced significantly shorter
signatures. An example of this process can be seen in Figure 4.1.
Once the entire input has been traversed, the algorithm outputs the items found in
HH2.
abca cabc bcab k-grams:
Is already
in HH1? No Yes
abcabc
Check ratio
between grams
labeled “Yes”
abca cabc bcab abcd
Input: a b c a b c a b c d
No No No Yes Yes
Figure 4.1: An example of the process of creating varying length strings from consecutivek-grams.
4.6.2 Algorithm Details
The pseudo code of the DHH algorithm is found in Procedure DoubleHeavyHitters,
and makes use of the support functions Init(nv) and Update(α), which are derived
from the algorithm in [81], and the sub-procedure InputToHH2. The output of the
DoubleHeavyHitters procedure is the list of heavy hitter values found in HH2 at the
end of the procedure.
The input provided to the algorithm is a sequence of np packets, and constant
52 CHAPTER 4. FINDING HEAVY HITTERS IN A STREAM OF STRINGS
integers: k and r as explained above, and nHH1 and nHH2 which indicate the number
of items HH1 and HH2 will be configured to hold respectively.
The algorithm works as follows: the packets are traversed one by one. For each
index in the packet, a k-gram is formed by taking the k characters starting from that
index. The k-gram is given as an input to HH1, which in return provides the k-gram’s
counter in HH1 (a return value of zero indicates that this is a new k-gram).
In order to account for varying length strings, while performing the above traversal,
an additional string stemp is maintained. For any location in the packet, stemp is the
last longest heavy hitter string found until that location. stemp is maintained in the
following manner: at the beginning of each packet, the string stemp is empty. For each
k-gram that is inserted to HH1, we check its returned value:
1. If stemp is empty and the returned value is greater than zero, stemp is set to be
this k-gram.
2. Otherwise, if stemp is not empty, one of the following two occur:
(a) If the returned value is equal to zero, stemp which is the longest ”heavy”
string we found until this point, is given as an input to HH2, and stemp is
reset to empty.
(b) Otherwise, the returned value is greater than zero. In this case, this value is
compared with the counter value of the previous k-gram. If the ratio between
the two values is over some predefined ratio r, stemp is concatenated with
the last character in the current k-gram. Else, stemp is given as an input to
HH2, and stemp is set to be this k-gram.
The algorithm then proceeds to treat the next index. When all of the packets have
been traversed, the algorithm outputs the item in HH2.
We note, that the algorithm also maintains a set of all the treated strings in each
packet so that each string is counted only once. This allows us to find strings that
appeared frequently in different packets rather than strings that have a high overall
frequency.
The strings are checked for uniqueness before being inserted into HH2 to ensure
that each signature is only counted once per packet.
4.6. THE DOUBLE HEAVY HITTERS ALGORITHM 53
Function Init(V )
Items[V];for i = 1→ V do
Items[i].count = 0;Items[i].ID = null;
end
Function Update(α)
if ∃jItems[j].ID == α thenItems[j].count+ +;output = Items[j].count;
elsefind j s.t. ∀h Items[j].count ≤ Items[h].count;Items[j].ID = α;Items[j].count+ +;output = 0;
endreturn output;
Procedure DoubleHeavyHitters
Data: sequence of np packets, constants k, nHH1 , nHH2 , and ratio rResult: the nHH2 candidates for being the heavy hittersstemp = empty, temp counter = 0;HH1.Init(nHH1), HH2.Init(nHH2);for i = 1→ np do
// Denote α1, ..., αh the bytes of packet pifor j = 1→ h− k + 1 do
counter = HH1.Update(αj ...αj+k−1);if counter > 0 then
if stemp == empty thenstemp = (αj ...αj+k−1);temp counter = counter;
elseif counter > r · temp counter then
stemp = stemp||αj+k−1;temp counter = counter;
elseProcedure InputToHH2;stemp = (αj ...αj+k−1);temp counter = counter;
else InputToHH2;
Procedure FixSubstringFrequencyInHH2;
54 CHAPTER 4. FINDING HEAVY HITTERS IN A STREAM OF STRINGS
Procedure InputToHH2
temp counter = 0;if stemp! = empty then
HH2.Update(stemp);stemp = empty;
Procedure FixSubstringFrequencyInHH2
for i = 1→ nHH2 dofor j = 1→ nHH2 do
if i! = j and item[i].ID is a substring of item[j].ID thenitem[i].count+ = item[j].count ;
4.6.2.1 Improving the Frequency Estimation
Due to the frequency estimation problem, as explained in Section 4.6.0.1, it is possible
that a string t inHH2 may contain a substring t′ which is also a string inHH2. However,
when processing t in HH2, the counter of t′ is not incremented. The reason for this is
that only disjoint strings from each packet may be inserted to HH2. Therefore, if for
example in packet pi at index j, the string ”example” appears and the algorithm decides
to insert this string into HH2, the string ”exam” which is a substring of ”example” and
also appears in packet pi at index j is not inserted into HH2. If the string ”exam” is an
item in HH2 (i.e. it has been inserted into HH2 when processing a different appearance
of the string in a different packet or at another index in the same packet), its counter
does not account for its appearance in packet pi in index j.
The goal of our algorithm is to provide an estimate of the actual number of times
that a string was encountered. In order to achieve a better estimation, we perform an
additional procedure on the strings found in HH2 at the end of the above algorithm,
to find which items in HH2 are substrings of other items in HH2. The counter of the
contained item is incremented by all of the counters of the items that contain it. In this
manner, our final counters provide a better estimation of the number of packets in which
each string was encountered. Note that since only disjoint strings from each packet may
be inserted into HH2, this procedure does not result in an additional overestimation
of the counters.
4.6. THE DOUBLE HEAVY HITTERS ALGORITHM 55
4.6.3 Error Rate Analysis
The heavy-hitters algorithm that we use is an approximation algorithm, and therefore
the DHH algorithm is also an approximation. As can be seen in the below analysis,
the error rate of our algorithm is only a factor of 3 higher than that of the heavy hitters
algorithm that we use as a building block. In fact, as can be seen in the experimental
results in Section 5.4, the error rate of our algorithm is significantly smaller in practice.
Theorem 10. Bounds of the Double Heavy Hitters Algorithm: The final coun-
ters provided by the algorithm may incur an error of at most 3 nknHH
where nHH =
minnHH1 , nHH2 and nk denotes the total number of k-grams processed by the algo-
rithm.
Proof. In order to analyze the error rate of our algorithm, we must first analyze the
error rate of each of its components. As described in Section 2.1, the error rate of
each of the HH items is ε = NnHH
, where nHH is the number of items maintained by
the HH, and N is the number of values in the input. We have defined the number of
items maintained by HH1 and HH2 to be nHH1 and nHH2 respectively. Given an input
sequence of packets, the size of the input is calculated as follows:
1. For HH1: Define the total number of k-grams in all the packets in the sequence
to be nk which is the bound on the size of the input to HH1.
2. For HH2: The input to HH2 is made up of the strings which are a sequence
of consecutive k-grams. Denote nc the number of such strings. nc is maximized
when the inputs to HH2 are all a single k-gram. To understand how these strings
can be formed lets look at the example in Fig. 4.2. Suppose the k-gram abcd is a
heavy hitter. In order for the string beginning with this occurrence of abcd to be
made up of a single k-gram, the following character e must be of a high variability
in this context throughout the input. Otherwise, the k-gram bcde would also be a
heavy hitter, and therefore abcd would be merged with bcde, meaning the string
would be longer than a single k-gram. One can see that this would be true for all
the following k-grams which contain the character e, and therefore they too can
not be heavy hitters. The closest following k-gram that can be a valid candidate
for being a heavy hitter is the k-gram following the character e. It follows that
nc ≤ nkk+1 .
56 CHAPTER 4. FINDING HEAVY HITTERS IN A STREAM OF STRINGS
abcd e fgab efgfsdghjghn……
Heavy
Hitter
High variability –
otherwise we get
longer consecutive
Next
possible
heavy hitter
Figure 4.2: Non-consecutive heavy hitters
It follows from the above calculation that the error rate of HH1 is nknHH1
, and the
error rate of HH2 is ncnHH2
≤ nknHH2
(k+1) .
In order to complete the analysis, it remains to account for occurrences of strings
that are not produced as part of the input to HH2. Generally, a string s is produced
as an input to HH2, if the k-grams that comprise it are already found in HH1. Lets
take a look at the sequence of k-grams processed by HH1. For some index j, the jth
k-gram will be found in HH1 only if its frequency is over jnHH1
. Since this must be true
for all k-grams that comprise s, it follows that there can be at most nknHH1
appearances
of S that are not produced as part of in the input to HH2.
It follows that the overall error rate of our algorithm is 2 nknHH1
+ nknHH2
(k+1) . Taking
nHH = minnHH1 , nHH2, we get that the error rate of the algorithm is bound by
3 nknHH
.
Chapter 5
Zero-Day Signature Extraction
for High Volume Attacks
5.1 Overview
Signature extraction is an important tool in several network security problems. In
Distributed Denial of Service (DDoS) mitigation, for example, there has recently been
a growing demand for zero day attack signature extraction solutions.
Two basic techniques are traditionally used to identify DDoS attacks, flow authen-
tication based on challenge response and flow behavioural analysis based on statistics
and learning (further details are provided in Chapter 2). Recent attacks with millions
of zombies generating seemingly legitimate flows go under the behavioural radar screen.
In these types of attacks, behavioural analysis does not succeed to detect the malicious
traffic, as each zombie generates little traffic which in itself may appear to be benign.
Furthermore, the huge amount of attack sources makes it unfeasible to stop the attack
at the source. Recent use of Internet-of-Things (IoT) devices in Botnets has caused
further increase in the number of compromised machines which may take part in the
attack [46]. This therefore leaves a loophole in the defense mechanisms and creates the
demand for a DDoS zero day attack signature extraction solution.
Identifying signatures for unknown DDoS attacks is extremely difficult due to the
seemingly legitimate content found in the packets which comprise the attack. Most
traditional signatures are based on the malicious code that is expected in the attack
packets, which may not be the case with modern DDoS attacks. Leading industry
experts confirm, that the signatures found in recent zero-day application-level DDoS
57
58 CHAPTER 5. ZERO-DAY SIGNATURE EXTRACTION
attacks are usually a bi-product of the attack tools which the attackers use. These tools
often leave some footprint caused unintentionally by the program, such as a short string
or some (protocol complying) anomaly in the packet content structure. Such signatures
allow fine grained identification of attack packets during an attack with minimal false
positives or negatives.
These subtle signatures are not identified by the current automated defense mech-
anisms, but rather by a manual process which may take hours or days.
Generally speaking, leading security companies provide systems which offer several
layers of defense against high-volume attacks. When all layers of defense fail, the at-
tacked customer contacts the security company’s support team to alert them and get
their assistance in stopping the attack. This manual assistance may be composed of
a number of procedures, including the identification of attack signatures. The attack
mitigation process is therefore long and may take hours to days, in addition, it is labor
intensive. Moreover, in many cases the human eye misses the identifying string which
could be an extra space, line-feed etc.
Clearly, in order to stop such unknown attacks while they are occurring, such sig-
natures must be extracted quickly and automatically.
5.1.1 Our Contribution
We present a system for automatic extraction of signatures for high volume attacks,
using a single pass over the input, and space dependent only on the predetermined
size of the heavy hitters data structure. Our system takes as input two streams (or
stream samples): one of traffic collected during an attack and a second collected during
peacetime. A peacetime traffic sample may be collected a priori as a routine scheduled
procedure. The attack traffic sample can be collected in real attack time, once the attack
has been identified. We note that for DDoS attacks there are existing mechanisms for
identifying when an attack has started and for differentiating between Flash events
and DDoS attacks, for instance that of Park et al. [93]. The system then analyzes
both traffic samples to identify content that is frequent in the attack traffic sample yet
appears rarely or not at all in the peacetime traffic.
Our system makes no assumptions on traffic characteristics such as client behaviour,
address dispersion, URL statistics and so forth. Therefore, it is generic in that it can
be easily adapted to solving other network problems with similar characteristics. That
5.1. OVERVIEW 59
said, while our algorithms can generically work on different data types, our evaluation
focuses on application-level DDoS attacks.
The following are the basic requirements of our system:
1. Signatures should not be found frequently in legitimate traffic. One of the main
difficulties in differentiating between malicious traffic and traffic from legitimate
sources, lies in the fact that malicious requests may have legitimate payloads.
Identifying these malicious requests therefore becomes a significant challenge.
2. Allow signatures of varying lengths. The signatures produced by the algorithm
must be of varying length. Setting a predefined constant length for signatures
would create very problematic outcomes, as described in section 4.6.0.1.
3. Find a minimal set of signatures. Since filtering devices may have a limited ca-
pacity, the algorithm must aim to produce a small number of signatures.
4. Minimize space and time usage. Our solution must maintain a high level of effi-
ciency, such that the attack can be stopped quickly with minimal space usage.
More specifically, given some constant k we wish to find all strings s1, ..., sm, s.t.
∀i, 1 ≤ i ≤ m:
1. |si| ≥ k
2. si appears frequently enough in the attack traffic.
3. Either one of the following holds:
(a) The frequency of si in peacetime is very low.
(b) The frequency of si in peacetime is moderate, yet in the attack traffic its
frequency is significantly higher.
4. In order to have a minimal set, no string si is contained in another string sj .
These requirements are formally explained in Section 5.3.3.
In Section 5.4, we test our system on real life traffic logs of attacks and peacetime
that from real attacks that have occurred in recent years. We show that our solution
has good performance in real life, with a recall rate average of 99.95% and an average
precision rate of 98%.
60 CHAPTER 5. ZERO-DAY SIGNATURE EXTRACTION
Additionally, our system makes use of an algorithm we have devised for finding
heavy hitters in textual data which is described in Chapter 4.
An implementation of our solution is publicly available and may be used for sig-
nature extraction from user uploaded files. It is found on our website [13]. Users are
advised to prepare a peacetime pcap file and an attack time pcap file which they may
upload to the website for immediate signature extraction.
5.2 Related Work
5.2.1 Automated Signature Extraction
In the past, automated signature extraction has been mostly used as a tool for identify-
ing computer malware such as worms and viruses. As such, most algorithms presented
for this problem generally consist of two stages:
1) Identifying suspicious traffic which contains malware with high probability. This is
done using methods such as honeypots [70], behavioural traffic analysis [106], etc.
2) Generating signatures for the suspicious content.
Therefore, the signature generation process of the previous works [57, 67, 68, 70, 98,
106] designed for malware identification, was based on the use of traffic that is known
to be malicious. Contrary to the scenario where the suspicous traffic is identified before
hand, our work deals with the case in which the suspicious traffic can not be detected a
priori, but rather, the suspicious traffic contains some unique prevalent content which
needs to be identified. Our solution does require a sample of peacetime traffic to be
collected prior to the attack, which can be collected by the system on a routine basis
when it is experiencing regular load.
Attack-time traffic that is analyzed, may contain both malicious parts and legiti-
mate parts. Therefore, it is crucial to identify which prevalent content is found only in
malicious packets and create signatures for that content alone. Furthermore, our meth-
ods allow us to identify not only seemingly legitimate malicious content, but it can in
fact, be legitimate in other traffic. For example, in HTTP level attacks, an attacker
can make use of a legitimate yet not commonly used HTTP header field. Use of a this
field can, in this case, be an identifier of malicious traffic, yet in a different case be
completely legitimate.
5.2. RELATED WORK 61
An interesting variation of the above problem is that of signature extraction solu-
tions with the ability to support morphisms in malware. This problem was addressed
in various works [68, 73, 88, 109], where different algorithms for automatic signature
generation for polymorphic worms are presented. We are currently in the process of
expanding our solution so that it may deal with such variations as well.
In [112], the authors present an automated system for detection of new application
signatures for the purpose of traffic classification. In this work, the authors present
a system for automatically identifying keywords of unknown applications. The key
difference between the solution presented in [112] and the solution we present here, is
that in [112] it is assumed that flows of the same application can be identified a priori
and therefore the analysis can look for the common strings in the specified flows. In our
solution, one of the main difficulties is that we do not know which of the packets are
contained in the attack and are therefore malicious and we therefore can not process
these packets alone to find the attack signature.
In [126], a mechanism is presented for botnet C&C signature extraction. The mech-
anism identifies frequent strings in the traffic and then ranks the frequent strings based
on traffic clustering methods. While in [126] it is not assumed that the C&C connections
can be identified a priori, their analysis is based on characteristics of the connection
and the traffic. Our solutions makes no such assumptions and is therefore more robust
for dealing with specially crafted packets and attacks.
5.2.2 DDoS Defense Mechanisms
In order to place our solution on the map of available DDoS solutions, we follow the clas-
sification of DDoS defense mechanisms according to place and time presented in [127].
The solutions we present are generally destination based solutions used during an at-
tack (with a preparation stage to be performed before an attack) targeting application
level attacks. Our solution is a content based packet filtering method and is not based
on the packet route or parameters.
It may seem natural to compare our solution to solutions based on traffic anomaly
detection. While our method does look for changes in content from peace time to attack
time that exceed some predefined threshold, traffic anomaly detection methods in DDoS
attacks, are usually network or destination based solutions searching for abnormal traffic
patterns. Unusual traffic patterns may be detected using techniques such as machine
62 CHAPTER 5. ZERO-DAY SIGNATURE EXTRACTION
learning [77, 120] or entropy [87, 92]. Our solution is not based on traffic behaviour
and makes no assumptions on normal patterns of traffic. Solutions which use traffic
behavioural analysis, may fail to detect large-scale DDoS attacks that simulate normal
traffic behaviour. Since our solution makes no assumptions on the traffic behaviour it
may be used to detect such attacks.
5.3 The Zero-Day High-Volume Attack Detection System
The main purpose of our system is to efficiently extract a minimal set of signatures
that distinguish malicious packets from legitimate ones. Therefore, a major factor in
producing signatures which achieve both a low false negative rate (i.e., a high detection
rate) and a low false positive rate (i.e., a low rate of legitimate traffic that is wrongly
identified as malicious), is the algorithm’s ability to identify strings which appear very
frequently in malicious traffic and which are hardly found in legitimate traffic.
5.3.1 Notations
The notations we use throughout this section are summarized in Table 5.1.
k minimal signature length (gram length)
r ratio between the frequencies of consecutivek-grams
m desired number of signatures
HHj heavy hitters module j (j ∈ 1, 2, 3)
nHHj number of items in the HHj data structure(j ∈ 1, 2, 3)
Table 5.1: Notations
5.3.2 System Overview
Given a sample of peacetime traffic and a sample of the attack traffic, the following
three stages are performed:
1. Analyzing peacetime traffic: the peacetime traffic is analyzed to identify strings
which appear frequently during peacetime.
5.3. THE ZERO-DAY HIGH-VOLUME ATTACK DETECTION SYSTEM 63
2. Analyzing attack traffic: the attack traffic is analyzed to identify strings that are
very frequent in the attack traffic yet seldom or not found at all during peacetime.
3. Filtering the signature candidates: the strings found in the above step are fil-
tered according to predefined frequency and containment requirements as will be
explained in the following sections.
Note that for DDoS mitigation for example, the traffic that will be analyzed by our
system can either be captured in the DDoS mitigation apparatus or in the cloud by
sampling the traffic from several collectors. The signatures produced by our algorithm
can be used by the anti-DDoS devices and firewalls to stop the attack. Using our
algorithm, mitigation can be achieved in minutes, allowing proper defense against such
attacks. Also, since DDoS attacks are usually high-volume attacks, a sample of the
traffic is sufficient.
5.3.3 System Requirements
The system generates a white-list and a maybe-white-list using the following thresholds:
1. Attack-high: a string s can only be an attack signature if its frequency in the
attack traffic is greater than attack-high.
2. Peace-high: a string s with a peacetime frequency over peace-high can’t be a
signature for the malicious traffic, and will enter the white-list.
3. Peace-low: a peacetime frequency below peace-low is deemed irrelevant for the
attack signature selection process, and the string will be placed in the not-white-
list.
4. Delta: a string s with a peacetime frequency between peace-low and peace-high
can be considered as a possible signature for the malicious traffic only if its fre-
quency in the attack traffic is at least delta higher than its peacetime frequency,
in this case it will enter the maybe-white-list.
As illustrated in Figure 5.1, given a sequence of packets P of traffic captured during
peace time and a sequence of packets A of traffic captured during an attack, and
given the thresholds: peace-high, peace-low, delta and attack-high, and some constant
gram size k the problem is formally defined as follows: Find all strings s1, ..., sm, s.t.
∀i, 1 ≤ i ≤ m:
64 CHAPTER 5. ZERO-DAY SIGNATURE EXTRACTION
frequency ≥ attack-High
frequency ≥ peace-Low
Signatures only if attack frequency at least ‘delta’ more than peace frequency
False positives
Signatures
frequency ≥ peace-high
Strings found during attack
Strings found during peace
Figure 5.1: Signatures requirement overview
1. |si| ≥ k
2. The frequency of si in the attack traffic is at least attack-high.
3. One of the following holds:
(a) The frequency of si in peace time is less than peace-low.
(b) Both of the following hold: 1) The frequency of si in peace time is between
peace-low and peace-high. 2) The difference between the frequencies of si in
the attack traffic and in the peacetime traffic is at least delta.
4. To avoid redundancy, no string is contained in another (i.e., @j : sj ⊆ si or
si ⊆ sj).
5.3.4 System Details
Our zero-day high-volume attack detection system makes use of our DHH algorithm,
to analyze both the peace-time traffic and the attack traffic.
5.3.4.1 Analyzing Peacetime Traffic
The DHH algorithm is performed with the peacetime traffic as input. The strings in
the output are categorized into three lists of strings, white-list, maybe-white-list and
5.3. THE ZERO-DAY HIGH-VOLUME ATTACK DETECTION SYSTEM 65
not-white-list, as explained above.
Note that to speed up mitigation, the peacetime traffic can be analyzed in advance
to produce these lists. Additionally we note, that in some cases it is difficult to get a
capture of peacetime traffic in advance since the mitigation device only receives attack
time traffic. As can be seen in our evaluation (Section 5.4), those cases can be handled
by other means.
5.3.4.2 Analyzing Attack Traffic
The DHH algorithm is performed with the attack traffic as input, with the modification
that the algorithm omits potential output strings if they are equal to or contained in
a string in the white-list, to reduce false-positives. The other way around is allowed
(i.e., www.facebook.com may appear frequently in the legitimate traffic, yet the string
www.facebook.com/BadPerson could appear frequently in the malicious traffic). We
name this property the one-way containment property. Due to this problem, we can
not filter out strings which appear frequently in legitimate traffic a priori, but rather
a more intricate solution is needed. Intuitively, the algorithm performs as follows: it
receives as an input the sequence of packets captured during an attack, and a list of
white-list strings. In order to avoid creating a signature for the attack traffic which
appears as a string or a substring of a string in the white-list, the algorithm will only
add a string to the input of HH2 if it is not contained in a white-list string.
The main difference, therefore, between the DHH algorithm and the Attack-DHH
algorithm, is that the Attack-DHH is provided with the white-list. Therefore, HH2 is
now updated with an stemp only if stemp is not found (as a whole white-list string or
as part of one) in the white-list (see Fig.5.2). The only change therefore is in the sub-
procedure InputToHH2. The pseudo-code of the modified sub-procedure can be seen
in Procedure Modified-InputToHH2. We note that there can be numerous options for
creating a data structure to support the search in the white-list. In our implementation
we chose to maintain a hash table of all of the substrings in the white-list of length
greater than k. This implementation is very good in terms of time complexity, though
there is a tradeoff in that it takes a bit more space than other possible solutions.
The strings output by the attack traffic analysis will be referred to as the signature
candidates. A graphical depiction of the attack traffic analysis process and the filtering
process described in the following step can be seen in Fig.5.2.
66 CHAPTER 5. ZERO-DAY SIGNATURE EXTRACTION
Procedure ModifiedInputToHH2
temp counter = 0;if stemp! = empty then
if stemp is not a string or part of a string in the white-list thenHH2.Update(stemp) ;stemp = empty;
Heavy
Hitters
1
Heavy
Hitters
2
hagdhdadjashdklahdjkasfjasbfjabfhfgahfvhsbdfjkasnkiaywtqyeffcgfacsdxasdbas
b1=hagd b2 = agdh b3 = gdhd ……
Output
values Signatures
Attack traffic packets payload:
White list: discard if equal
to or contained
in whitelist string
Maybe white list:
attack-high
≥ delta
peace-
high
peace-
low
peace-
high
peace-
low
Merged
string
Figure 5.2: The process of extracting attack content signatures.
5.3.4.3 Filtering the Signature Candidates
Notice that all signature candidates in the output of the attack traffic analysis have a
frequency below peace-high in the peacetime traffic. The strings in the output of the
above step are narrowed down as follows:
1. Discard strings with a frequency in the attack traffic that is below the threshold
attack-high.
2. Check if any of the strings are equal to or contained in a string in the maybe-
white-list. For such strings, calculate the difference between the frequency of the
string during the attack and the frequency during peacetime of the relevant string
in the maybe-white-list. If this difference is greater than the threshold delta, the
string is kept, otherwise, it is discarded. We note that strings not found in the
maybe-white-list must have a frequency below peace-low in the peacetime traffic.
3. Once the final signature candidates are acquired by the above process, they are
5.3. THE ZERO-DAY HIGH-VOLUME ATTACK DETECTION SYSTEM 67
checked for containment. If a signature candidate is contained in another signa-
ture candidate, the algorithm will only choose one signature based on user policy
(i.e., the longest, the shortest, the one that produces the smaller number of false
positives, etc.). Furthermore, the algorithm may further reduce the number of
signatures by finding which signatures usually appear together in the same pack-
ets, therefore removing the redundant signatures. Selecting which signatures to
discard can also be done based on user policy as described above.
5.3.5 Identifying Common Combinations of Signatures
In many cases it is interesting to identify signature combinations which are often found
together in the same packets. These combinations can be of great use in attack detec-
tion mechanisms. First, they can be used to minimize the number of signatures which
are needed to identify the attack (see subsection 5.3.5.2). Second, signature combina-
tions can be used to increase the confidence level in the detection of the attack (see
subsection 5.3.5.3).
5.3.5.1 The Triple Heavy Hitters Algorithm
In order to identify the frequent signature combinations, we propose the Triple Heavy
Hitters algorithm denoted THH. This algorithm makes use of three heavy hitter mod-
ules. Two modules will be used as in the Double Heavy Hitters algorithm (denoted
DHH), and the third module will be used to find heavy hitters of signature com-
binations. While performing the DHH algorithm, for each packet treated, the THH
algorithm maintains the set of strings which were identified as potential signatures, and
therefore inserted into HH2 while processing the packet. Once the entire packet has
been processed, this set contains all of the signatures found in the packet and it will be
inserted into the third heavy hitters calculation unit HH3. To do so, each string in the
set is concatenated with a special end-of-string delimiter and the delimited strings are
concatenated together in lexicographical order to form a single string which is inserted
into HH3. Once all of the packets have been traversed, HH3 contains the heavy hitter
sets of signatures. This procedure is illustrated in Figure 5.3. The pseudo code can
be seen in Procedure TripleHeavyHitters, which makes use of a sub-procedure called
Procedure InputToHH3.
The THH algorithm has the same time complexity as the DHH algorithm, since
68 CHAPTER 5. ZERO-DAY SIGNATURE EXTRACTION
Procedure TripleHeavyHitters
Data: sequence of np packets, constants k, nHH1 , nHH2 , nHH3 , and ratio rResult: the nHH2 candidates for being the heavy hitters, nHH3 frequent
signature setsstemp = empty, temp counter = 0, strings counter = 0,signature set = empty;HH1.Init(nHH1), HH2.Init(nHH2), HH3.Init(nHH3);for i = 1→ np do
signature set = empty;for j = 1→ h− k + 1 do
counter1 = HH1.Update(αi...αi+k−1);if counter1 > 0 then
if stemp == empty thenstemp = (αi...αi+k−1);temp counter = counter1;
elseif counter1 > r · temp counter then
stemp = stemp||αi+k−1;temp counter = counter1;
elsecounter2 =Procedure InputToHH2;if counter2 > r · strings counter then
signature set.Add(stemp);strings counter = counter2;
stemp = (αi...αi+k−1);temp counter = counter1;
elsecounter2 =Procedure InputToHH2;if counter2 > r · strings counter then
signature set.Add(stemp);strings counter = counter2;
if signature set.Size > 0 then Procedure InputToHH3 ;
Procedure FixSubstringFrequencyInHH2;
Procedure InputToHH3
Data: set of ns signatures, delimiter string s′
stemp = empty;for i = 1→ ns do stemp = stemp||sigi||s′ ;HH3.Update(stemp)
5.3. THE ZERO-DAY HIGH-VOLUME ATTACK DETECTION SYSTEM 69
the input to HH3 is created as the strings are inserted into HH2. The space complexity
of the THH algorithm is dependent on the number of items in each of the HH modules.
Heavy Hitters
1
Heavy Hitters
2
hagdhdadjashdklahdjkasfjasbfjabfhfgahfvhsbdfjkasnkiaywtqyeffcgfacsdxasdbas b1=hagd b2 = agdh b3 = gdhd ……
Merged string
Output values
Signatures
Attack traffic packets payload:
White list: discard if contained in whitelist string
Heavy Hitters
3
Sets: create sets of all strings in packet that pass filter
Output: signature sets and frequencies
Output
Minimize signatures
Maybe white list:
Attack-high ≥Del
ta
Figure 5.3: Extracting attack signatures with the additional minimization process.
5.3.5.2 Minimizing the Number of Signatures
Minimizing the number of signatures can be very significant, as some of the filtering
mechanisms have a limited capacity. In addition, having less signatures can cause a
reduction of the false positive rate of the signatures.
The ability to minimize the number of signatures is depicted in an example presented
in Figure 5.4. There, we look at a scenario in which there are six different types of attack
packets. In this example, we can see, that four signatures have been extracted. However,
since either the signature ”bad” or ”guy” appear in all of the different attack packets,
they alone can be used, hence the number of signatures can be minimized without
creating false negatives.
Once the signatures and the frequent signature sets are extracted by the system,
we would like to check if the number of signatures can be minimized. To do so, we
propose a greedy process. The pseudo code for this process is shown in Procedure
MinimizeSignatures. Each such set represents a packet type that had been found in the
traffic sample. Intuitively, if some group of signatures appears together in some packet
70 CHAPTER 5. ZERO-DAY SIGNATURE EXTRACTION
Packet type 1: … bad…guy…
Packet type 2: …really… bad…guy…
Packet type 3: … mean…guy…
Packet type 4: … really…bad…
Packet type 5: … bad…mean… guy…
Signatures
bad guy really mean
10%
20%
20%
25%
15%
1%
Packet type frequency:
Signature frequency: 71% 65% 45% 35%
Packet types
Packet type 6: … bad…
Figure 5.4: An example of different sets of signatures found in different packet types.
type, then only the signature with the highest frequency is needed to cover this packet
type. To identify these signatures with the highest frequency, the process sorts all of
the signatures according to decreasing frequency. The signatures are then traversed
one by one, and checked to see how many ”un-covered” packet types are covered by
each signature. We denote a ”covered” packet type as a packet type that contains at
least one signature that has been chosen as a final signature. Looking at the example
shown in Figure 5.4, the process would work as follows: The signatures are sorted in
decreasing frequency. The most frequent signature is ”bad”, therefore we start with it.
Since ”bad” is the first signature we deal with, no packet types have been covered yet.
Denote the cover rate of a signature to be the percent of the packets that it covers
in this calculation. Therefore ”bad” covers the packet types:1, 2, 4, 5, 6, and therefore
we indicate its cover rate to be 71%. The next signature we traverse is ”guy”. The
only packet type that remains un-covered is number 3. The signature ”guy” covers
packet type number 3, and therefore we indicate its cover rate to be 20%. Since all of
the packet types have been covered, the cover rate of the remaining signatures is 0%.
Therefore, the only signatures needed to cover all of the packet types are ”bad” and
”guy”.
The time complexity of this procedure in the worst case is O(number of signatures
∗ number of sets) which is O(nHH2 ∗ nHH3) which is therefore dependent only on the
predefined size of each HH module. The space requirements are linear in nHH1 , nHH2
and nHH3 which are configurable parameters. Since this procedure is only done one
5.3. THE ZERO-DAY HIGH-VOLUME ATTACK DETECTION SYSTEM 71
time it only adds a constant overhead to the time complexity of the THH algorithm.
Procedure MinimizeSignatures
Data: list Lsigs of nHH2 signatures, list Lsets of nHH3 sets of signaturesResult: the final list of signatures// Initialize the cover rate of all signatures to be zero
list Lfinal = empty;for i = 1→ nHH2 do Lsigs[i].cover rate = 0 ;Sort Lsigs by frequency;i = 0;while i < nHH2 and Lsets not empty do
for j = 0→ Lsets.size() doif Lsets[j] contains Lsigs[i] then
Lsigs[i].cover rate+ = Lsets[j].frequency;remove set Lsets[j] from Lsets;
for i = 0→ nHH2 doif Lsigs[i].cover rate > 0 then Lfinal.insert(Lsigs[i]) ;
return Lfinal;
5.3.5.3 Reducing the False Positives
Signature combinations can be used to increase the confidence level in the detection of
the attack. This can be done by creating rules which are meant to identify specific attack
content. Such specific rules reduce the chance of falsely identifying benign content as
malicious, therefore making the identification of the attack traffic more certain.
In the example presented in Figure 5.4, suppose packet type 6 doesn’t contain
malicious content. The signature ”bad” is found in the packet types:1, 2, 4, 5, 6, therefore
if we create a rule which simply searches for the signature ”bad”, it will catch packets
of type 6 as well, creating false positives. These false positives can be eliminated using
detection rules which combine signatures with ”AND” and therefore can be used to
catch specific types of attack packets. For example, we can create a rule which catches
packets that contain ”bad” AND ”really”. Such a rule will catch only packets of type 4.
If we specify an additional rule that catches ”bad” AND ”guy”, the packet types which
will be caught are 1, 2, 4, 5, whereas packet type 6 will not be caught. Using such rules
therefore reduces the likelihood of false positives and increases the confidence level of
the detection.
72 CHAPTER 5. ZERO-DAY SIGNATURE EXTRACTION
5.4 Evaluations
In our evaluation, we focus on high volume DDoS attacks, and specifically on unknown
application layer attacks in HTTP requests, commonly known as HTTP-GET flooding
attacks.
5.4.1 Test Setup
In our evaluations we used real captures from a top security company. Each test in-
cluded a real HTTP-GET flooding attack time capture and a peacetime capture that
included either real traffic or synthetically generated traffic. In some cases peacetime
traffic was not available, and synthetically generated peacetime traffic was created by
crawling through the victim site. If no such traffic is available, we synthetically gener-
ate a peacetime capture by sending requests to the attacked server and capturing the
traffic we create (i.e., a synthetic peacetime traffic capture). Our evaluation included
11 different attacks as follows:
1. We tested 3 attacks for which both the peace time capture and the attack time
capture were recorded on the same server during a time of normal functioning
and then later during an actual DDoS attack. We name these tests real-real.
2. We tested 6 attacks for which the attack time capture was recorded during an
actual DDoS attack, and the peace time capture was created after the attack
by recording traffic created by crawling the victim’s site. We name these tests
real-synthetic.
3. We tested a single attack which included textual log files of the HTTP GET
requests during an actual DDoS attack, and a log file of HTTP GET requests
which were identified as being legitimate during the time of the attack, which was
used as the peace time traffic. We name these tests log.
4. We tested a single synthetic attack which was made up of peace time traffic which
was created by us and then a synthetic attack was merged into the peacetime
traffic. We name these tests synthetic-synthetic.
For each of the above tests, the zero-day high-volume attack detection system was
used to extract attack signatures. In order to evaluate the system’s results, for each of
the above scenarios, we preformed three tests:
5.4. EVALUATIONS 73
1. System quality testing: Performed by evaluating both the recall and precision
rates of the signatures extracted by the system. Recall and precision, defined in
Chapter 1, are standard measures of relevance in fields such as pattern recognition
and information retrieval.
2. Frequency estimation accuracy test of the DHH algorithm: Performed by count-
ing the number of packets in the attack traffic in which each of the attack sig-
natures appears, and comparing the counters with the counters of the DHH
algorithm.
3. Threshold testing: Several threshold value sets were tested.
A summary of the test statistics can be found in Table 5.2 which is explained
in the next section. In addition, we performed a separate testing of the use of the
Triple Heavy Hitters algorithm (explained in Section 5.3.5) for identifying frequent
signature combinations to minimize the number of signatures needed, as described in
Section 5.4.6.
5.4.2 System Quality Test Results
A summary of the test statistics is presented in Table 5.2. All of the attacks analyzed,
are attacks that were not detected by any automated defense mechanism, and these
attack samples were therefore analyzed manually by a human expert. The columns in
the results section of the table are as follows:
1. Manual attack rate estimation: the estimated percent of the packets in the attack
traffic capture, that were identified as attack packets by the manual analysis.
2. System attack rate estimation: the percent of the packets in the attack traffic
capture, that contain one or more of the signatures extracted by the system.
3. Recall rate estimation: the percent of packets identified as attack packets by the
manual analysis which were identified by the signatures extracted by our system.
The aim is to have a recall of 100%, since the recall is an indication of how many
of the relevant results were identified.
4. Precision rate estimation: we estimate the precision rate of our system by two
methods, for both of which the aim is to have a precision of 100%, as precision
74 CHAPTER 5. ZERO-DAY SIGNATURE EXTRACTION
is an indication of how many relevant results were returned as opposed to non-
relevant results.:
(a) Peacetime based precision: the percent of peacetime traffic packets that were
not identified by the signatures extracted by our system either.
(b) Attack based precision: the percent of attack traffic packets which were not
identified by the manual analysis that were not identified by the signatures
extracted by our system either.
Test Statistics
Test Capture Files Data Test Results
TestTargetCategory
AttackTime
Testtypeattack-peace
Number ofPackets inSample
Manualattackrateestimation
Systemattackrateestimation
Recallrateestimation
Precision rateestimation
Attacktime
Peacetime
Peacetimebased
Attackbased
1 Telephony Nov2011
Real-Real
407 2347 59% 59% 100% 100% 100%
2 eGaming Jul 2012 Real-Real
157560 2468 98% 98% 99.8% 100% 100%
3 eGaming May2012
Real-Real
191192 47168 75% 75% 99.8% 100% 100%
4 Nationalbank
Jan 2012 Real-Syn.
7050 369 78% 99% 100% 100% 79%
5 News Mar2012
Real-Syn.
47569 216 99.9% 100% 100% 100% 99.9%
6 eCommerece Jan 2013 Real-Syn.
35014 253 NA 98% NA 100% NA
7 Mobile May2013
Real-Syn.
608 497 93% 94% 100% 100% 99%
8 Government Mar2012
Real-Syn.
6875 318 69.5% 90% 100% 100% 79.5%
9 Government Mar2012
Real-Syn.
5867 77 NA 92% NA 100% NA
10 News May2013
Log 34721 70322 47% 47% 100% 100% 100%
11 Synthetic NA Syn-Syn
57112 9016 84% 84% 100% 100% 100%
Table 5.2: Summary of the statistics of the tests performed. Note that the captures aresamples of the traffic.
We note several comments and conclusions regarding the results: 1) For each test,
the system identified the signatures that were found by the human expert in addition
to other signatures which were not identified by the expert.
2) For all of the attacks tested, one or more signature was found that creates a
false positive of 0%, meaning they do not appear in the peacetime traffic at all. As
5.4. EVALUATIONS 75
explained in section 5.3.4.3, the final signature candidates may be filtered according to
user policy. We chose to select the candidates with the lowest frequency in peacetime
traffic, meaning the lowest false-positive rate. The final filtering process of the signature
candidates, selected these signatures alone to achieve the results shown in the table.
This filtering process was done by searching through the peacetime traffic for the final
signatures candidates to select those with the lowest false positive rate. Another option
would be to minimize the signatures based on frequent signature combinations as we
have shown in Section 5.4.6, this also gives good results.
3) If both the attack and the peacetime captures are real, the system’s attack
detection rate is most likely to be very close or equal to the estimated detection rate of
the manual analysis. On the other hand, as can be seen in tests 4 and 8 for example,
a synthetic peacetime capture may cause a system detection rate which is higher than
the manual estimation. The difference between them could indicate the false positive
rate caused by the system’s signatures.
4) All tests were performed with thresholds: attack−high = 50%, peace−high = 3%
peace−low = 2%, delta = 90%. Except for test 10 which was done with: attack−high =
10%, peace−high = 3% peace−low = 2%, delta = 90%. The value of attack−high was
selected based on the characteristics of the attacks themselves and can be selected based
on, for example, performance variations in the attacked site and so forth. The rest of
the thresholds were selected based on testing done, which is presented in section 5.4.5.
There it is shown that a peace − high value of 3 should be selected and determining
the other two thresholds follows from setting this value.
Our testing included a preliminary phase for determining the settings and parame-
ters of the DHH algorithm. These include the values of k, nHH1 , nHH2 , r, attack-high,
peace-high, peace-low and delta. The value k indicates the length of the k-grams, and
nHH1 and nHH2 indicate the number of items each of the HH modules is configured to
hold. The value of k was set to 8, since testing showed that longer signatures are likely
to increase the rate of false negatives, and shorter signatures are often not substantial
enough therefore increasing the possibility of false positives. The values of nHH1 and
nHH2 were both set to be 3000. Our tests included values raging from 1000 to 10000,
and it was found that 3000 was sufficient for our purposes.
Note that as a rule of thumb, the size of nHH1 can be determined according to the
expected frequency of the signature that the system should identify and the average
76 CHAPTER 5. ZERO-DAY SIGNATURE EXTRACTION
length of each packet. In general, in order to extract a signature which is found in
fraction x of the packets (0 ≤ x ≤ 1) with average packet length being len, we would
need to set nHH1 to be no more than lenx . Furthermore, the sizes of nHH2 and nHH3
are bounded by the size of nHH1 .
The above values of these parameters were kept unchanged throughout the testing
of the detection system. An additional parameter used by the DHH algorithm is the
ratio r explained in Section 4.6.1. This value was tested within the detection system
with values ranging from 0 to 1. It was found that values closer to 1 yielded the
extraction of shorter signatures. This value should therefore be chosen based on the
desired characteristics of the output. The thresholds which are used to determine the
white-lists and the chosen signatures are configurable in the system and we discuss
some tested values of these thresholds in Section 5.4.5.
5.4.3 Performance
Our implementation was done using C++, and made use of the implementation pro-
vided in [41] of the heavy hitters algorithm presented in [81]. The code was compiled
using g+ +. We ran experiments on a 4-Core Intel(R) Core i7(R) 2.7 GHz with 16 GB
of RAM running Mac OS X 10.9 (Mavericks). We ran our our algorithm on a variety
of real traffic captures, our algorithm was able to process between 144 and 232 Mbps.
When running our algorithm on synthetically generated traffic which has skewed data
frequencies the algorithm performance reaches approximately 1.1 Gbps. The space re-
quired by our system is linear in nHH1 , nHH2 and nHH3 , which were set to nHH1 = 3000,
nHH2 = 200 and nHH3 = 100.
5.4.4 Frequency Estimation
Recall that to test the accuracy of the frequency estimation provided by the algorithm,
the estimated frequency of each signature was compared to an actual count of the
signature in the attack traffic. Figure 5.5 shows this comparison for the signatures of
a single test. We also note that the average difference exhibited in this test between
the estimated frequency and the actual frequency was under 1% over all of the 3000
signature candidates that were produced. This is much better than the analytical error
bounds of the algorithm, which is probably due to the fact that the number of strings
in the input to HH2 is significantly smaller than the worst-case bound provided in the
5.4. EVALUATIONS 77
analysis in Section 4.6.3. The results of the comparison in the other tests were similar.
Signatures
Freq
uen
cy
(Per
cen
t)
0
10
20
30
40
50
60
70
80
90
1 5 9 13 17 21 25 29 33 37
Estimated Frequency
Actual Frequency
Figure 5.5: Signature frequency: algorithm estimation vs. the actual frequency
5.4.5 Threshold Testing
Both the false positive and false negative rate achieved by our system are influenced by
the values of the thresholds discussed in Section 5.3.3. As part of our testing, a range of
thresholds were tested. While intuitively, it may seem reasonable to take a peace−high
threshold that is relatively high (i.e., at least 50%), testing showed that this would lead
to a very high false positive rate. An example of this can be seen in Fig. 5.6. This
graph shows testing of different peace-high values, on a single set of files. The graph
shows the false positive rates caused by the different peace-high values when all other
values remain unchanged. The false positive rate shown in the dotted line measures the
percent of peacetime packets identified by the generated signatures. The false positive
rate shown in the whole line measures the percent of attack traffic packets identified by
the generated signatures which are not malicious. As can be seen, a peace-high value
of 3 is the highest values that minimizes both false positive rates, therefore this is the
value that was chosen for our tests.
5.4.6 Testing Frequent Signature Combinations
We have performed testing on our enhanced system which makes use of the Triple
Heavy Hitters algorithm (explained in Section 5.3.5) for identifying frequent signature
78 CHAPTER 5. ZERO-DAY SIGNATURE EXTRACTION
Perc
ent
Peace-high threshold
0
5
10
15
20
25
30
35
40
50 40 30 20 10 7 5 3 1 0
rate in white
detected-estimated
False positive rate in peacetime
False positive attack time detection rate
Figure 5.6: Comparing peace-high values.
combinations. The graphs in Figure 5.7 depict the results of tests performed on two
different attacks, for which we have both real attack traffic captures and real peacetime
traffic captures as described above. The system identified the frequent signature com-
binations and then performed the algorithm for minimizing the number of signatures
presented in Section 5.3.5.2. The results presented show the tradeoff between the pre-
cision and recall rates when selecting an increasing number of signatures. The results
shown indicate that for the tested samples the number of signatures can be decreased
substantially, thereby increasing precision significantly with almost no reduction of the
recall rates.
5.4.7 Signature Examples
An interesting aspect of testing real attacks is to see the actual signatures for these
attacks. Some examples of signatures include: An extra carriage-return (i.e., newline)
somewhere in the packet payload where it was not usually found; Use of upper-case
characters in a field which is normally found in legitimate traffic with lower-case charac-
ters; Use of an HTTP field that is rarely used; Use of a rare user agent. These signatures
are a clear indication of the importance of analyzing the peacetime traffic.
5.4. EVALUATIONS 79
Perc
en
t
80
82
84
86
88
90
92
94
96
98
100
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33
Recall Precision Number of Signatures
(a) Test 2: best recall-precision tradeoff achieved for 3 signatures
80
82
84
86
88
90
92
94
96
98
100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Recall Precision
Perc
en
t
Number of Signatures
(b) Test 3: best recall-precision tradeoff achieved for 1 signature
Figure 5.7: Testing the algorithm for minimizing the number of signatures
Chapter 6
Heavy Hitters in a Stream of
Pairs: Distinct and Combined
Heavy Hitters
6.1 Overview
Consider a stream of IP packet headers going through some upstream point. A des-
tination IP address (the key) which receives a large number of packets constitutes a
”classic” heavy hitter (HH) of the request stream. We can also consider the associated
source IP addresses of each packet as an associated subkey. Destination addresses with
many different source IP addresses are then distinct heavy hitters (dHH). Destination
addresses which both receive many packets and have many different source IP addresses
are combined heavy hitters (cHH) (formal definitions are given in Section 6.2)
Exact detection of dHH and cHH would require large amounts of resources, and
therefore approximate solutions are needed. Generally, approximate distinct or com-
bined heavy hitters algorithms exhibit a tradeoff between detection accuracy and the
amount of space they require. Cardinality estimate accuracy is even more difficult to
achieve with a fixed-size structure, since a key may be evicted from the cache and
then re-enter the cache which presents some uncertainty with regards to cardinality.
We provide solutions for approximate detection of dHH and cHH using a fixed-size
structure which outperforms known solutions both in terms of cardinality accuracy
and practicality.
80
6.1. OVERVIEW 81
6.1.1 Our Contribution
Our main contribution are novel practical sampling-based structures for distinct heavy
hitters (dHH) and combined heavy hitters (cHH) detection which are able to track only
O(ε−1) keys. Our dHH design significantly improves over existing work. Our cHH struc-
ture is substantially better than the naive approach of maintaining separate structures
for HH and dHH. The latter has overhead due to the overlap in the sets of cached keys
in the two structures, and also requires much larger structure sizes.
Our algorithm uses the basic principle of Sample and Hold (S&H) used in algorithms
applied to streams of elements with keys [39, 48, 53]. The streaming algorithm maintains
a set of cached keys, which constitutes a sample. With each cached key, we maintain
a counter that tracks the number of occurrences of the key in the stream since it has
entered the cache. When an element with key x that is not cached is processed, a biased
coin flip is used to determine whether to add it to the cache. S&H is better known with
a fixed-threshold, where we specify the bias of the coin, but has the disadvantage that
the memory usage (sample size) can increase. There is also a fixed-size scheme, where
we specify the sample size (and memory usage of the algorithm) and instead modify
the bias.
S&H sampling was originally proposed for domain sum queries and our application
here of a S&H based scheme for heavy hitter detection is novel, even in the context of
classic heavy hitters. An important property of S&H, which makes it suitable for HH
detection, is that the set of sampled keys realizes a weighted sample taken according to
hx (defined in Section 6.2) [39]. In this weighted sample, the heavier keys, in particular
the heavy hitters, are much more likely to be included than other keys.
For the purpose of distinct HH detection, we would like to obtain a weighted sample
with respect to the distinct weights wx (defined in Section 6.2). For this purpose,
S&H (or other classic HH algorithms) can not be used out of the box. Our proposed
distinct S&H design replaces the random coin flips by a random hash function applied
to the key and subkey pair. This ensures that repeated occurrences do not affect the
sample. Moreover, instead of the simple counters we use approximate distinct counters
[25, 37, 38, 52, 96], which use space that is only logarithmic or double logarithmic in
the number of distinct elements. We also propose a combined S&H algorithm, designed
for cHH detection, which maintains both a basic counter and an approximate distinct
counter for each cached key.
82 CHAPTER 6. HEAVY HITTERS IN A STREAM OF PAIRS: DHH AND CHH
We show that our distinct S&H and combined S&H schemes have the property that
the set of cached keys realizes a respective weighted sample. More precisely, distinct
S&H computes a sample with respect to distinct weights wx whereas combined S&H
does so with respect to the combined weights b(ρ)x (defined in Section 6.2). Therefore, a
sample of size c/ε (for a given constant c) will include each heavy hitter with probability
at least 1− exp(−c). Note that this is a worst case lower bound on the probability. The
detection probability is higher for heavier keys and more critically, thanks to without-
replacement sampling, also increases for the more skewed distributions that prevail in
practice. If the goal is only to return a short list of candidates which includes heavy
keys, then we do not need to maintain the approximate counting structures and the
total size of our structure is only O(c/ε) (dominated by the storage of key IDs of the c/ε
sampled keys). The distinct counting structures are needed when we are also interested
in estimates on the weight of included keys. In this case, for each key x, with distinct
counters of size c2 + log log n, where n is the sum of the weights of all keys, we can
estimate the weight of x within a well-concentrated absolute error of εn/c.
We demonstrate, via experimental evaluations, the effectiveness of our distinct and
combined S&H algorithms.
Our proposed fixed-size dHH algorithm, named Distinct Weighted Sampling (dwsH-
H), requires a constant amount of memory as opposed to the well known Superspreaders
solution [116] which uses memory which is linear in the input stream length. Moreover,
our use of sampling-based distinct counters is a significant practical improvement over
Locher’s relatively new fixed-size solution [76] which utilizes linear-sketch based distinct
counters, which are much less efficient in practice. In addition, our dHH algorithm pro-
duces a cardinality estimate for each key. This estimate is of much higher accuracy than
the estimate produced by Locher, while the Superspreaders do not provide comparable
estimates.
6.2 Preliminaries
6.2.1 Problem Definitions
As mentioned in Section 1.1.4, our input is modeled as a stream of elements, where is
element is a pair 〈x, y〉. The primary key x from a domain ∈ X and the subkey y from
domain Dx.
6.3. BACKGROUND - APPROXIMATE DISTINCT COUNTERS 83
We differentiate between 3 types of weights for each key x:
1. The (classic) weight hx: is the number of elements in the stream with the same
key x
2. The distinct weight wx: is the number of different subkeys in all elements in the
stream having the same key x.
3. The combined weight b(ρ)x : is a combination of the classic and the distinct weight.
Given a parameter ρ 1, b(ρ)x ≡ ρhx + wx
We define a key x as being heavy in different forms accordingly:
1. x is a heavy hitter when hx ≥ ε∑
y hy
2. x is a distinct heavy hitter when wx ≥ ε∑
y wy
3. x is a combined heavy hitter when b(ρ)x ≥ ε
∑y b
(ρ)y
6.2.2 Notations
The notations used throughout this section are summarized in Table 6.1.
Symbol Meaning
x key
y subkey
hx number of elements with key x
wx number of different subkeys in elements with key x
m maxxwx
τ detection threshold
k cache size
` number of buckets
ρ combined weight parameter
Table 6.1: Notations
6.3 Background - Approximate Distinct Counters
A distinct counter is an algorithm that maintains the number of different keys in a
stream of elements. An exact distinct counter requires state that is proportional to the
84 CHAPTER 6. HEAVY HITTERS IN A STREAM OF PAIRS: DHH AND CHH
number of different keys in the stream. Fortunately, there are many existing designs and
implementations of approximate distinct counters that have a small relative error but
use state size that is only logarithmic or double logarithmic in the number of distinct
elements [25, 37, 38, 52, 96]. The basic idea is elegant and simple: We apply a random
hash function to each element, and retain the smallest hash value. We can see that
this value, in expectation, would be smaller when there are more distinct elements,
and thus can be used to estimate this number. The different proposed structures have
different ways of enhancing this approach to control the error. The tradeoff of structure
size and error are controlled by a parameter `: A structure of size proportional to ` has
normalized root mean square error (NRMSE) of 1/√`. In Section 6.5 we use distinct
counters as a black box in our dHH structures, abstracted as a class of objects that
support the following operations:
• Init: Initializes a sketch of an empty set
• Merge(x): merge the string x into the set (x could already be a member of the
set or a new string).
• CardEst: return an estimate on the cardinality of the set (with a confidence
interval)
In Section 6.5.5, we also propose a design where a particular algorithm for approximate
distinct counting is integrated in the dHH detection structure.
6.4 Related Work
The concept of distinct heavy hitters, together with the motivation for DDoS attack
detection, was introduced in a seminal paper of Venkataraman et al. [116]. Their algo-
rithm, aimed at detection of fixed-threshold heavy hitters, returns as candidate heavy
hitters the keys with an (initialized) Bloom filter that is filled beyond some thresh-
old. Keys with a high count in the sample are likely to be heavy hitters and almost
saturate their bloom filter. A related work adapts dHH schemes to TCAMs [24]. Our
fixed-threshold scheme is conceptually related to [116]. Some key differences are the
better tradeoffs we obtain by using approximate distinct counters instead of Bloom
filters, and our simpler structure with analysis that ties it directly to classic analysis of
weighted sampling, which also simplifies the use of parameters. More importantly, we
6.5. THE DISTINCT WEIGHTED SAMPLING ALGORITHMS 85
provide a solution to the fixed-size problem and also address the estimation problem.
The estimates on the weight of the heavy keys that can be obtained from the Bloom
filters in [116] are much weaker, since once the filter is saturated, it can not distinguish
between heavy and very heavy keys.
Locher [76] recently presented two designs for dHH detection which makes use of
approximate distinct counters. The first design is sampling-based and builds on the
distinct pair sampling approach of [116]. This design also only applies to the fixed-
threshold problem. The other design uses linear sketches and applies to the fixed-size
problem. Locher’s designs are weaker than ours both in terms of practicality and in
terms of theoretical bounds. The linear-sketch based design utilizes linear-sketch based
distinct counters, which are much less efficient in practice than the sampling-based ones.
The designs have a quadratically worse dependence of structure size on the detection
threshold τ , which is Ω(τ−2) instead of our O(τ−1). Finally, multiple copies of the same
structure are maintained to boost up confidence, which results in a large overhead, since
heavy hitters are accounted for in most copies. Locher’s code was not available for a
direct comparison.
Another conceivable approach is to convert to DHH classic fixed-size deterministic
HH streaming algorithms, such as Misra Gries [83] or the space saving algorithm [81], by
replacing counters with approximate distinct counters. The difficulty that arises is that
the same distinct element may affect the structure multiple times when the same key
re-enters the cache, resulting in much weaker guarantees on the quality of the results.
6.5 The Distinct Weighted Sampling Algorithms
We now present our distinct weighted sampling schemes, which take as input elements
that are key and subkey pairs. We build on the fixed-threshold and fixed-size classic
S&H schemes but make some critical adjustments: First, we apply hashing so that
we can sample the distinct stream instead of the classic stream. Second, instead of
using simple counters cx for cached keys as in classic S&H, we use approximate distinct
counters applied to subkeys. Third, we maintain state per key that is suitable for
estimating the weight of heavy cached keys (whereas classic S&H was designed for
unbiased domain queries).
Our algorithms, in essence, compute heavy hitters using weighted sampling. A sam-
86 CHAPTER 6. HEAVY HITTERS IN A STREAM OF PAIRS: DHH AND CHH
ple set of the keys is maintained during the execution of each of the algorithms (HH,
dHH, or cHH). The sample set constitutes a weighted sample according to the respective
counts so that the heavier keys, in particular the heavy hitters, are much more likely to
be included than other keys. The counts in each of the algorithms are different; number
of repetitions, measure of distinctness, and a combined measure, respectively. The al-
gorithms maintain counts with each cached key which allow to produce the cardinality
estimate for each output key.
6.5.1 Fixed-Threshold Distinct Heavy Hitters
Our fixed-threshold distinct heavy hitters algorithm is applied with respect to a spec-
ified threshold parameter τ . We make use of a random hash function Hash ∼ U [0, 1].
An element (x, y) is processed as follows. If the key x is not cached, then if Hash(x, y)
(applied to the key and subkey pair (x, y)) is below τ , we initialize a dCounters[x ] object
(and say that now x is cached) and insert the string (x, y). If the key x is already in
the cache, we merge the string (x, y) into the distinct counter dCounters[x ]. The pseudo
code is provided as Algorithm 1.
Algorithm 1: Fixed-threshold Distinct Heavy Hitters
Data: threshold τ , stream of elements of the form (key,subkey), where keys are fromdomain X
Output: set of pairs (x, cx) where x ∈ XdCounters← ∅ // Initialize a cache of distinct counters
foreach stream element with key x and subkey y do // Process a stream element
if x is in dCounters thendCounters[x ].Merge(x,y);
elseif Hash(x,y) < τ then // Create dCounters[x ]
dCounters[x ].InitdCounters[x ].Merge(x,y)
return(For x ∈ dCounters, (x, dCounters[x ].CardEst))
6.5.2 Fixed-Size Distinct Weighted Sampling
The fixed-size Distinct Weighted Sampling (dwsHH) algorithm is specified for a cache
size k. Compared with the fixed-threshold algorithm, we keep some additional state for
each cached key:
• The threshold τx when x entered the cache (represented in the pseudocode as
dCounters[x ].τ). The purpose of maintaining τx is deriving confidence intervals on
6.5. THE DISTINCT WEIGHTED SAMPLING ALGORITHMS 87
wx. Intuitively, τx captures a prefix of elements with key x which were seen before
the distinct structure for x was initialized, and is used to estimate the number of
distinct subkeys in this prefix.
• A value seed(x) ≡ min(x,y)in stream Hash(x,y) which is the minimum Hash(x,y)
of all elements with key x (in the pseudocode, dCounters[x ].seed represents
seed(x)). Note that it suffices to track seed(x) only after the key x is inserted
into the cache, since all elements that occurred before the key entered the cache
necessarily had Hash(x,y) > τx, as the entry threshold τ can only decrease over
time.
The fixed-size dwsHH algorithm retains in the cache only the k keys with lowest
seeds. The effective threshold value τ that we work with is the seed of the most recently
evicted key. The effective threshold has the same role as the fixed threshold since it
determines the (conditional) probability on inclusion in the sample for a key with
certain wx. A pseudo code is provided as Algorithm 2.
Algorithm 2: Fixed-size streaming Distinct Weighted Sampling (dwsHH)
Data: cache size k, stream of elements of the form (key,subkey), where keys are fromdomain X
Output: set of (x, cx, τx) where x ∈ XdCounters← ∅; τ ← 1 // Initialize a cache of distinct counters
foreach stream element with key x and subkey y do // Process a stream element
if x is in dCounters thendCounters[x ].Merge(x,y)dCounters[x ].seed← mindCounters[x ].seed, Hash(x,y)
elseif Hash(x,y) < τ then // Create dCounters[x ]
dCounters[x ].InitdCounters[x ].Merge(x,y)dCounters[x ].seed← Hash(x,y)dCounters[x ].τ ← τif |dCounters| > k then
x← arg maxy∈dCounters dCounters[y ].seedτ ← dCounters[x ].seedDelete dCounters[x ]
return(For x in dCounters, (x, dCounters[x ].CardEst, dCounters[x ].τ))
6.5.3 Analysis and Estimates
We first consider the sample distribution S of dwsHH. As we mentioned (in Chapter
2), it is known that classic S&H applied with weights hx has the property that the set
88 CHAPTER 6. HEAVY HITTERS IN A STREAM OF PAIRS: DHH AND CHH
of sampled keys is a ppswor sample according to hx [39]. A ppswor sampling scheme
with respect to weights hx can be described as an iterative process, where in each
step a key with weight hx is selected with probability wx/W , where W =∑
x 6∈S hx is
the total weight of keys that are not in the sample S. After selection, the key is added
to the sample S. Surprisingly, the sample distribution properties of S&H carries over
from being with respect to hx (classic S&H) to being with respect to wx (distinct WS):
Theorem 11. The set of cached keys by dwsHH is a ppswor sample taken according
to weights wx.
Proof. First note that repeated (key, subkey) pairs can not affect the structure, so the
structure only depends on the distinct stream of (key, subkey) pairs with all repetitions
omitted.
The set of cached keys can be fully characterized in terms of the set of seed(x)
values, as it contains a prefix of keys with smallest seeds: In the fixed τ scheme, a key
x is cached if and only if seed(x) < τ . In the fixed k scheme, the set of cached keys
corresponds to the k keys that have smallest seed values.
We now consider the distribution of seed(x), which is the minimum of wx inde-
pendent random variables selected uniformly from U [0, 1]. If we transform each hash u
to − ln(1− u), we obtain that each is exponentially distributed with parameter 1. The
minimum of wx such selections is exponentially distributed with parameter wx. This
transformation is monotone, so we can work with the uniform hashes and then trans-
form the seed, obtaining that − ln(1−seed(x)) ∼ Exp[wx] is exponentially distributed
with parameter wx.
Now note that seed(x) for different keys is independent. We now apply a classic
result of Rosen which shows that ppswor can be realized by associating with keys’ seed
values that are independent exponential random variables with parameter equal to the
weight of the key and taking as our sample the keys with smallest seed values [102].
Note that since the transformation of the seeds is minimum, the order according to
seed(x) is the same as the order according to − ln(1− seed(x)).
A ppswor sample with respect to weights wx provides the following guarantees on
inclusion probabilities of keys:
Lemma 12. When working with a fixed k, a key with weight wx is selected with prob-
ability ≥ 1 − (1 − wx/m)k, where m =∑
xwx is the sum of weights of all keys. If the
6.5. THE DISTINCT WEIGHTED SAMPLING ALGORITHMS 89
threshold is τ , a key with weight wx is selected with probability 1− exp(−τwx).
Proof. From the definition of ppswor, the probability that a key is selected at each step
is at least wx/m. Therefore, the probability that it is not selected in k steps is at most
1− (1− wx/m)k.
It follows that a key x is likely to be sampled when wx m/k. We can tighten this
bound when there are keys with weight much larger than m/k. We obtain that key x
is very likely to be sampled when:
wx maxi∈[0,k−1]
(m−∑x∈topi
wx)/(k − i) (6.1)
where topi is the set of i heaviest keys.
6.5.4 Estimate Quality and Confidence Interval
The set of sampled keys can be viewed as dHH candidates. Note that the sample can
be computed by only maintaining seed values for keys, without including the distinct
counters. The candidates include the heavy hitters but may also include keys with small
weight: With the fixed-threshold scheme, we expect the sample size to include τ∑
y wy
keys even when all keys have wx = 1. With the fixed-size scheme, we expect the cache
to include keys with wx ∑
y wy/k but it may also include some keys with small
weight.
For many applications, including the detection of DDoS attacks which we discussed
in the introduction, it is important to identify the actual distinct heavy hitters in
our candidate list by returning an estimate on their weight wx. We compute an esti-
mate with a confidence interval on wx for each cached key x, using the entry thresh-
old τ (or dCounters[x ].τ in the fixed-size scheme) and the approximate distinct count
dCounters[x ].CardEst.
The count dCounters[x ].CardEst estimates the number of distinct subkeys processed
after x entered the cache. This component is subject to the solution quality provided
by our approximate distinct counter. The variance on this estimate σ22 depends on
the specific distinct counter implementation. The implementation we worked with has
σ22 = 1/(2(`− 1))n2, where ` is the distinct counter parameter and n is the estimated
cardinality.
90 CHAPTER 6. HEAVY HITTERS IN A STREAM OF PAIRS: DHH AND CHH
The other component is bounding or estimating the number of distinct subkeys
processed before x entered the cache. We obtain this bound using the entry threshold
τ : In expectation, τ−1 distinct subkeys are processed before x enters the cache. As
with classic S&H, but considering distinct subkeys this time, the actual distribution is
geometric with parameter τ , and its variance is σ21 = (1− τ)/τ2.
These two estimates are well concentrated and we can apply the normal approxi-
mation to obtain confidence intervals. Now we observe that the set of subkeys viewed
before x enters the cache can be disjoint or can overlap with the subkeys processed after
x entered the cache. Because of that, we have uncertainty in our estimate and also can
not provide an unbiased estimate. Combining it all we have the confidence interval
[dCounters[x ].CardEst− aδσ2, (6.2)
dCounters[x ].CardEst− 1 + 1/τ + aδ
√σ21 + σ2
2
],
where aδ is the coefficient for confidence 1− δ according to the normal approximation.
E.g., for 95% confidence we can use aδ = 2.
We note that while the set of cached keys does not depend on the stream arrange-
ment (is a ppswor sample by wx), the confidence intervals are tighter (and thus better)
for keys that are presented earlier and thus have τx τ .
6.5.5 Integrated dwsHH Design
We propose a seamless design (Integrated dwsHH) which integrates the hashing per-
formed for the weighted sampling component with the hashing performed for the ap-
proximate distinct counters. We use a particular type of distinct counters based on
stochastic averaging (`-partition) [52, 96] (see [38] for an overview). This design hashes
strings to ` buckets and maintains the minimum hash in each bucket. These counters
are the industry’s choice as they use fewer hash computations. We estimate the distinct
counts using the tighter HIP estimators [38]. Pseudocode for the fixed-size Integrated
dwsHH is provided as Algorithm 3. The parameter k is the sample size and the param-
eter ` is the number of buckets. Note, we use two independent random hash functions
applied to strings: BucketOf returns an integer ∼ U [0, ` − 1] selected uniformly at
random. Hash returns ∼ U [0, 1] (O(logm) bits suffice).
As in the generic Algorithm 2, we maintain an object dCounters[x ] for each cached
key x. The object includes the entry threshold dCounters[x ].τ and dCounters[x ].seed,
6.5. THE DISTINCT WEIGHTED SAMPLING ALGORITHMS 91
which is the minimum Hash(x,y) of all elements (x, y) with key x. The object also
maintains ` values c[i] for i = 0, . . . , ` − 1 from the range of Hash, where c[i] is the
minimum Hash over all elements (x, y) such that the element was processed after x was
cached and BucketOf(x,y) is equal to i (c[i] = 1 when this set is empty). Note that
dCounters[x ].seed ≡ mini∈[0,`−1] c[i]. The object also maintains a HIP estimate CardEst
of the number of distinct subkeys since the counter was created.
Algorithm 3: Integrated dwsHH
Data: cache size k, distinct structure parameter `, stream of (key,subkey) pairsOutput: set of (x, cx, τx) where x ∈ XdCounters← ∅; τ ← 1 // Initialize a cache of distinct counters
foreach stream element with key x and subkey y do // Process a stream element
if x is in dCounters thenif Hash(x,y) < dCounters[x ].c[BucketOf(x,y)] then
dCounters[x ].CardEst+← `/
∑`−1i=0 dCounters[x ].c[i]
dCounters[x ].c[BucketOf(x,y)]← Hash(x,y)dCounters[x ].seed← mindCounters[x ].seed, Hash(x,y)
elseif Hash(x,y) < τ then // Initialize dCounters[x ]
for i = 0, . . . , `− 1 do dCounters[x ].c[i]← 1dCounters[x ].CardEst← 0dCounters[x ].c[BucketOf(x,y)]← Hash(x,y)dCounters[x ].seed← Hash(x,y)dCounters[x ].τ ← τif |dCounters| > k then
x← arg maxy∈dCounters dCounters[y ].seedτ ← dCounters[x ].seedDelete dCounters[x ]
return(For x ∈ dCounters, (x, dCounters[x ].CardEst, dCounters[x ].τ))
For a sampled x, we can obtain a confidence interval on wx using the lower end
point dCounters[x ].CardEst + 1, with error controlled by the distinct counter and the
upper end point
dCounters[x ].CardEst + 1/dCounters[x ].τ , with error controlled by both the distinct
counter and the entry threshold. The errors are combined as explained in Section 6.5.4
using the HIP error of σ2 ≈ (2`)−0.5dCounters[x ].CardEst .
The size of our structure is O(k` logm) and the representation of the k cached keys.
Note that the parameter ` can be a constant for DDoS applications: A choice of ` = 50
gives NRMSE of 10%.
We can further optimize this design according to the most constrained resources
in our application. Let it be processing time, memory, or maximum processing time
92 CHAPTER 6. HEAVY HITTERS IN A STREAM OF PAIRS: DHH AND CHH
per element. For example, to control element processing time we can evict more keys
(a fraction of the cache) when it is full. When memory is highly constrained we can
instead use the exponent representation (round c[i] to an integral power of 2) as done
with Hyperloglog [52] and apply an appropriate HIP estimate as described in [38]. This
will reduce the structure size to O(k logm+ k` log logm).
6.6 The Combined Weighted Sampling Algorithm
We now present our cwsHH algorithm for combined heavy hitters detection. The pseu-
docode, which builds on our Integrated dwsHH design (Algorithm 3), is presented in Al-
gorithm 4 and works with a specified parameter ρ. For each cached key x, the combined
weighted sampling (cwsHH) algorithm also includes a classic counter dCounters[x ].f of
the number of elements with key x processed after x entered the cache.
Theorem 13. The sample computed by Algorithm 4 is a ppswor sample with respect
to the combined weights b(ρ)x ≡ ρhx + wx.
Proof. We will show that the seed value (transformed appropriately) is exponentially
distributed with parameter b(ρ)x :
− ln(1− seed(x)) ∼ Exp[b(ρ)x ]
This will conclude the proof using [102], as in the proof of Theorem 11.
The value seed(x) can be expressed as the minimum of two components: seedwx
which is the minimum over elements of Hash(x,y), and seedhx, which is the minimum
over elements with key x of independent draws of erand← 1− (1− rand())1/ρ.
It follows from the proof of Theorem 11 that − ln(1 − seedw(x)) ∼ Exp[wx]. We
will now show that:
− ln(1− seedh(x)) ∼ Exp[ρhx] .
Since the minimum of two exponential random variables is exponential with the sum
of parameters, this will conclude our proof.
For each element we can draw an exponentially distributed random variable with
parameter ρ using z = − ln(1−rand())/ρ. But since the algorithm takes the minimum
with uniform random variables we apply an appropriate reverse transformation 1 −
6.6. THE COMBINED WEIGHTED SAMPLING ALGORITHM 93
exp(−z), obtaining that we need to draw the random variables
erand = 1− exp(ln(1− rand())/ρ) = 1− (1− rand())1/ρ ,
as used by the algorithm.
Algorithm 4: Streaming cwsHH
Data: cache size k, distinct structure parameter `, parameter ρ, stream (key,subkey)pairs
Output: set of (x, cx, fx, τx) where x ∈ XdCounters← ∅; τ ← 1 // Initialize a cache of distinct counters
foreach stream element with key x and subkey y do // Process a stream element
erand← 1− (1− rand())1/ρ // Randomization for hx count
if x is in dCounters then
dCounters[x ].f+← 1// Increment count
if Hash(x,y) < dCounters[x ].c[BucketOf(x,y)] then
dCounters[x ].CardEst+← `/
∑`−1i=0 dCounters[x ].c[i]
dCounters[x ].c[BucketOf(x,y)]← Hash(x,y)dCounters[x ].seed← mindCounters[x ].seed, Hash(x,y), erand
elseif minerand, Hash(x,y) < τ then // Initialize dCounters[x ]
for i = 0, . . . , `− 1 do dCounters[x ].c[i]← 1dCounters[x ].CardEst← 0dCounters[x ].f ← 1dCounters[x ].c[BucketOf(x,y)]← Hash(x,y)dCounters[x ].seed← minHash(x,y), eranddCounters[x ].τ ← τif |dCounters| > k then
x← arg maxy∈dCounters dCounters[y ].seedτ ← dCounters[x ].seedDelete dCounters[x ]
return(For x ∈ dCounters, (x, dCounters[x ].CardEst, dCounters[x ].f, dCounters[x ].τ))
Similarly to dwsHH, if we are only interested in the set of sampled keys (cHH
candidates), it suffices to maintain the seed values of cached keys without the counting
and distinct counting structures. The counters are useful for obtaining estimates and
confidence intervals on the combined weights of cached keys: For a desired confidence
level 1−δ. The lower end of the interval is dCounters[x ].CardEst+ρdCounters[x ].f−aδσ1,
where σ1 is the standard error of the distinct count. For the higher end, we bound the
contribution of the prefix, which has expectation bounded by 1/τ −1, and subject both
to the S&H error and the approximate distinct counter error, so we obtain
dCounters[x ].CardEst + ρdCounters[x ].f − aδσ1 − 1 + 1/τ + aδ√σ21 + σ22.
94 CHAPTER 6. HEAVY HITTERS IN A STREAM OF PAIRS: DHH AND CHH
6.7 Evaluation
6.7.1 Theoretical Comparison
In Table 6.2 we show a theoretical memory usage comparison of our distinct weighted
sampling algorithms, Superspreaders and Locher [76], assuming all algorithms use the
same distinct count primitive. We are using the notations in Table 6.1, δ as the proba-
bility that a given source becomes a false negative or a false positive, N as the number
of distinct pairs, r as the number of estimates, s as the number of pairs of distinct
counting primitives used to compute each estimate, and c (for a c-superspreader (i.e.
we want to find keys with more than c distinct elements) choosing c = τ−1.As we can
also see from the table, the cache size affects the distinct weight estimation error for the
keys. Note that the Superspreaders algorithm does not provide an estimate on the dis-
tinct weight of the keys, but rather only reports which keys have high enough weights.
Locher’s algorithm provides an estimate error which is incomparable theoretically and
significantly higher than ours in practice
Algorithm Memory usage Keys’ distinct weight estima-tion error
Fixed-threshold dis-tinct WS
O(τ∑y wy · ` logm) (Exp.) τ−1 + wy/
√2`
Fixed-size dwsHH O(k` logm) (1/k)∑y wy + wy/
√2`
Superspreaders 1-LevelFiltering[116]
O(Nc ) NA
Superspreaders 2-LevelFiltering[116]
O(Nc ln1δ ) NA
Locher[76] O(rs · 2`+ |k|) NA
Table 6.2: Theoretic Comparison between methods
6.7.2 Practical Evaluation
6.7.2.1 Accuracy and Parameters
The following tests were done using a 4GB trace of 40M DNS queries captured at
our campus network. For each DNS query q = ...p6.p5.p4.p3.p2.p1, we sliced the query
at most 5 times to produce the < key, subkey > pairs, < p1, ...p6.p5.p4.p3.p2 >, <
p2.p1, ...p6.p5.p4.p3 > ... < p5.p4.p3.p2.p1, ....p6 >. This process gave us a total of over
6.7. EVALUATION 95
120M pairs composed of a total of nearly 1M distinct pairs.
Figure 6.1: Distinct Weighted Sampling (dWS): Modified cache size
We compare the affect of different cache sizes (k) on the output of our dwsHH
algorithm. As shown in Fig 6.1 we set the number of buckets to be 32 and use cache
sizes of 100, 500, 1000, 10000. Using a cache size of 100, our algorithm reports keys
with cardinality at least 0.005 of the total number of distinct items with a false negative
rate under 5%. A false negative rate of under 5% is also achieved with cache size of 500
for cardinality over 0.0008 of the total number of distinct items. Using a cache of 1000
and 10000, our algorithm reports keys with cardinality at least 0.0004 of the total, with
false negative rates of 2% and 0% respectively. Furthermore, for the reported keys, a
cache of 100 gave an average distinct weight estimation error of 24% over all keys and
caches of 500, 1000 and 10000 gave an error of 22%, 17% and 15% accordingly.
Additionally, we compare the affect of different number of buckets (`) on the output
of our dwsHH algorithm. As shown in Fig 6.2 we set cache size to be 1000 and use
4, 8, 16, 32 and 64 buckets. For the reported keys, using 4 buckets gave an average
distinct weight estimated error of 77% over all reported keys and 8, 16, 32 and 64
buckets gave an average error of 37%, 23%, 17% and 15% accordingly. The median
error of the estimate distinct weights of the structure using 4, 8, 16, 32, 64 buckets is
49%, 33%, 18%, 13% and 9% accordingly.
96 CHAPTER 6. HEAVY HITTERS IN A STREAM OF PAIRS: DHH AND CHH
Figure 6.2: Distinct Weighted Sampling (dWS): Modified Number of Buckets
To report, for example, all keys which have a weight of at least 0.001% of the
total number of distinct pairs, using the dwsHH algorithm, we use cache size of 1000,
providing a false negative rate of 0 and a false positive rate of 0. Using 32 buckets, the
weight estimates provided by the algorithm have a median error of less than 0.1% of
the item cardinality for the reported keys. This test is shown in Figure 6.3.
Figures 6.4 and 6.5 compare results of our cwsHH algorithm on the above data,
using a cache of 1000 and 32 buckets, using ρ = 0.1 in test 1 and ρ = 0.9 in test 2.
Fig. 6.4 shows the cardinality estimates of both tests for the 50 most frequent elements
in the data. Test 2 with ρ = 0.9 reported all of the top 50 elements where as test 1 had
a 2% false negative rate (enlarged icons indicate items reported only by test 2). Fig. 6.4
shows the combined weight per item in each test compared to both the frequency and
cardinality of the items in the data. The smaller rho is, the closer the combined weights
are to the cardinality.
6.7.2.2 Memory Usage
In Fig. 6.6, we compare dwsHH to a simple and highly inefficient algorithm which
counts the number of distinct values associated with each key as well as the One-
Filter Superspreaders algorithm [116]. Our algorithm consumes a constant amount of
6.7. EVALUATION 97
Figure 6.3: Distinct Weighted Sampling (dWS): 32 Buckets, 1000 Items
space, while the simple algorithm consumes space that is linear with the number of
distinct pairs seen. The Superspreaders algorithm does slightly better than the simple
algorithm, yet consumes significantly more space than ours. We note that the two filter
variant of the Superspreaders algorithm reaches a better asymptotic memory usage
model yet its memory usage still grows linearly with the stream length. Also, it is far
more complicated and its memory usage is more susceptible to implementation factors.
98 CHAPTER 6. HEAVY HITTERS IN A STREAM OF PAIRS: DHH AND CHH
Figure 6.4: Combined Weighted Sampling (cWSHH) Modified rho: accuracy
Figure 6.5: Combined Weighted Sampling (cWSHH) Modified rho: combined weight
6.7. EVALUATION 99
0 5000 10000 15000 20000 25000 30000 35000 40000Number of Distinct Pairs
0
500000
1000000
1500000
2000000
2500000
Mem
ory
Usa
ge [
byte
s]
Our AlgorithmOne Filter SuperspreadersSimple Counting
Figure 6.6: Distinct Weighted Sampling (dWS): Modified cache size
Chapter 7
Mitigating DNS Random
Subdomain DDoS Attacks Using
Distinct Heavy Hitters
7.1 Overview
The Domain Name System (DNS) service is a critical element in the internet func-
tionality. Distributed Denial of Service (DDoS) attacks on the DNS service typically
consist of many queries coming from a large botnet and sent to the root name server
or an authoritative name server along the domain chain. According to Akamai’s state
of the internet report [16] nearly 20% of DDoS attacks in Q1 of 2016 involved the DNS
service, some of them on the root name servers [117].
One type of particularly hard to mitigate DDoS attack is the randomized attack
on the DNS service called Random Subdomain Attack [91] (also known as Authorita-
tive Exhaustion Attack [23]; Nonsense Name Attack [74]; Pseudo-random Subdomain
Attack [17]). In this attack, queries for many different pseudorandom non-existent sub-
domains (subkeys) of the same primary domain (key) are issued [17]. Since the response
to a query for a new subdomain is not cached at the DNS resolver, these queries are
propagated to the domain authoritative server, overloading these servers and collater-
ally impacting the recursive resolvers of the Internet service provider.
Random Subdomain attacks were first witnessed in China in 2009 [75], yet they
remained sporadic for several years. In 2014 they had started to make a significant
100
7.1. OVERVIEW 101
Figure 7.1: DNS Random Subdomain attack overview
ongoing impact on the network. In [91] it is shown that in the beginning of 2014 the
number of distinct domains seen at ISP resolvers worldwide per day began to rise signif-
icantly, with substantial peaks witnessed later that year. While these attacks have been
witnessed constantly since then, in October 2016 they made headlines, when hundreds
of sites were drastically affected by the Mirai IoT Botnet attack on domains delegated
to the Dyn DNS resolvers [23]. Mitigation of these attacks took hours. Following the
Mirai attack in 2016, Forrester Research discussed the crippling affect this attack could
have on critical Internet infrastructure [65], claiming it is one way in which the Internet
could die.
Currently, Random Subdomain attacks have become very common and continue
to baffle administrators of both recursive and authoritative DNS resolvers. Even data
collected in our own campus network revealed such an attack on an authoritative DNS
resolver in our campus in January 2017.
Top companies involved in DNS security have addressed these attacks, clearly s-
tating the need for efficient mitigation of such attacks and discussing some of their
102 CHAPTER 7. MITIGATING DNS RANDOM SUBDOMAIN ATTACKS
current solutions. Dyn has suggested that customers obtain additional secondary DNS
providers [113]. Security specialists in Akamai Technologies, that operates an authori-
tative DNS service similar to Dyn Managed DNS discuss the need to protect the DNS
infrastructure and how these attacks can indirectly affect organizations relying on an
attacked vendor. They claim that their segregated and distributed DNS architecture
is a major factor in protecting against such large scale DDoS attacks yet they do not
provide insight as to how they may provide a solution specially crafted for such at-
tacks [110]. Cloudflare discuss how their solution can easily scale as a form of defense
against these high volume attacks [61], yet again, no specialized solution is discussed.
Additional companies such as Secure64 [103] and Infoblox [62] indicate that they of-
fer solutions these attacks, yet little detail is provided as to how they are detected or
mitigated.
Mitigation of Random Subdomain attacks is difficult since the packets in the attack
are correctly formed DNS requests. Furthermore, the queries are normally received
from legitimate ISP clients and therefore source based filtering can not be used. The
solution of internet providers so far has been to identify the targeted zone manually by
analyzing query logs, which can take a significant amount of time, and to temporarily
prevent the name server from handling queries for this zone [17, 74] (or alternatively
to reduce the number of queries handled using rate limiting).
7.1.1 Our Contribution
We present a system for the mitigation of this attack. Our system is based on the
observation that the number of distinct subdomains in queries for targeted domains
significantly increases during attack time due to the random part of the query. Our
system detects this sudden rise in the number of distinct subdomains, and therefore
identifies the targeted domain automatically. Depending on the rate of the attack, our
system can detect an attack within seconds of attack start time.
During normal network operation, the number of distinct subdomains for each do-
main is usually relatively constant and typically a small number. One exception to this
is the increasing usage of disposable domains. These are large volumes of automatically
generated domains, legitimately created by top sites and services (e.g., social networks
and search engines), to give some signal to their server [36]. By analyzing traffic during
normal server load (i.e. ”peacetime”), our system creates a baseline of the normal num-
7.1. OVERVIEW 103
ber of distinct subdomains so that it can detect the abnormal rise during the attack.
Using this baseline, our system can identify attacks while significantly reducing false
positives which may be caused by the use of disposable domains.
Furthermore, by analyzing the peacetime traffic we are able to automatically iden-
tify most of the legitimate requests for the targeted domain. For example, suppose
the query mail.targetsite.com is often found during peacetime. During an attack com-
prised of queries of the form < Randomstring > .targetsite.com, our system identifies
queries for legitimate subdomains of targetsite.com therefore allowing queries such as
mail.targetsite.com to be handled, eliminating many false positives. Attack signatures
extracted by our system can be matched against consequent queries so that attacks can
be mitigated quickly and accurately.
We make use of our distinct heavy hitter algorithm (Chapter 6), using the suffix
of the domain as the key and the prefix as a subkey as we shortly explain. Consider a
stream of DNS queries, with the top-level domain serving as the key. A key that appears
a large number of times in the query stream constitutes a ”classic” heavy hitter (e.g.,
google.com, cnn.com, etc,). If each query’s subdomain serves as the subkey (e.g., mail.,
home., game1., etc,), a key with many different subkeys is then a distinct heavy hitters
(dHH).
At the core of our system, is a mechanism, that given a stream of DNS queries,
extracts the hierarchy of heavy distinct domains. There are two main challenges in this
hierarchy extraction. The first, is that there are very many different queried domains.
Identifying the number of distinct subdomains for each domain naively, requires main-
taining the set of distinct subdomains for each queried domain, which would take up
way too much space. The second, is that constructing this hierarchy exactly, requires
placing all of the queried domains in a structure which would become incredibly large
very quickly. To solve the first issue, our system identifies the heavily distinct domains
using our distinct heavy hitter algorithm (Chapter 6). To solve the second issue we
have devised a structure which builds an approximate hierarchy efficiently, using only
the heavily distinct domains detected by our algorithm (Section 7.3).
104 CHAPTER 7. MITIGATING DNS RANDOM SUBDOMAIN ATTACKS
7.2 Attack Overview
As depicted in Figure 7.1, the attack basically works in the following manner: The
initiator of the attack or the attacker, utilizes botnets, and causes the compromised
machines to send many different (unique) queries for the same target domain. For
example, attack queries may be of the form < Randomstring > .targetdomain.com.
These queries are sent by the clients directly or through the clients’ open resolvers
to their Internet Service Provider’s (ISP) resolvers. Since each request is unique and
non-existent, the ISP’s resolvers recursively query the target domain’s authoritative
server.
Initially, the authoritative server is able to respond and typically answers with an
”NXDOMAIN” response, indicating that the domain can’t be found. At some point,
when the authoritative server becomes overwhelmed, it will either crash or implement
a response rate limiting mechanism. Either way, no response will be received from the
authoritative server and it will appear unresponsive to the ISP.
Once this occurs, the ISP servers, that store each recursive request until a response
is received, will exhaust all available storage space and also become debilitated. In this
state, the ISP resolvers can no longer handle legitimate requests from non-compromised
clients, severely degrading its service capabilities.
7.2.1 Current Detection Techniques
Detection mechanisms of this attack include mostly manual identification of the tar-
geted domain through anomalies in the server resource consumption and the backlog
of recursive client queries.
Another possible detection technique is to detect the rise in the number of ”NX-
DOMAIN” responses. While this can help in some cases, in many cases the attack rate
is extremely high, causing the authoritative server to crash very quickly. As we men-
tioned above, once the server crashes, no response will be received for queries for the
targeted domain as well as other domains hosted by that server. At this point no more
”NXDOMAIN” responses are received from the server. This temporary rise needs to
be detected very quickly and can be easily missed. Furthermore, this kind of mitigation
mechanism needs to be placed in a location in the network which allows it to view both
queries and the responses. Due to routing constraints this is not always possible.
7.3. RANDOM SUBDOMAIN ATTACK MITIGATION SYSTEM 105
Authoritative servers have little defense abilities against these attacks. In light of
recent attacks, industry specialists have advised companies to have their DNS authori-
tative server hosted by more than one DNS provider, in hopes of withstanding at least
some of the attacks.
7.3 Random Subdomain Attack Mitigation System
7.3.1 System Overview
System Placement: As depicted in Figure 7.2 our system can be placed at the
ingress point of the ISP DNS resolvers. Alternately, our system can be placed directly
at the ingress point of the authoritative server, to mitigate attacks on a specific sub do-
main. That is, if the attack queries are of the form < random > .subdomain.victim.com
our technique would identify this attack. Additionally, if many domains are hosted on
the same authoritative server our system can detect an attack on any of these domains.
Figure 7.2: DNS Random Subdomain mitigation High-level approach
106 CHAPTER 7. MITIGATING DNS RANDOM SUBDOMAIN ATTACKS
Figure 7.3: DNS Random Subdomain mitigation system overview
Attack Detection: As depicted in Figure 7.3, attack detection is done in two stages.
The first stage is a preprocessing of traffic captured when there is a normal DNS
query load (this is considered to be peacetime). Using our system, a baseline is created
which identifies domains which have many different subdomains on a regular basis (for
example, domains that use disposable domains). Additionally, a whitelist of common
domain subparts (i.e., mail, maps etc.) is also identified and used during mitigation
to allow the legitimate queries of targeted domains. The second stage is an analysis
of traffic during an attack. The system identifies domains which are potential attack
targets. If the number of distinct queries for these domains is significantly higher than
the peacetime baseline, these domains are set as attack signatures.
The main component of our system is the Distinct Heavy Domain Hierarchy Ex-
tractor (Section 7.3.2.2), which is used for both the baseline creation as well as the
attack signature extraction.
Attack Mitigation: Once signatures have been extracted, consequent queries are
matched against the attack signatures. Queries which match an attack signatures and do
not qualify as whitelisted are dropped before reaching the ISP resolvers. For example, in
an attack on victim.com. Our system would generate a signature ’*.victim.com’. Using
the whitelist of common domain subparts, our system identifies that ’mail.victim.com’
is not an attack query and it is allowed. Other queries for ’victim.com’ are dropped.
The whitelist can be fine-tuned for each signature during the attack to further reduce
7.3. RANDOM SUBDOMAIN ATTACK MITIGATION SYSTEM 107
false positives.
Our system makes no assumptions on the resource consumption or behaviour of the
resolvers making it more robust in terms of detection.
7.3.2 System Details
7.3.2.1 Preliminaries and Notations
We denote a domain, subdomain and subpart in the following manner: Given a query
q = ...d6.d5.d4.d3.d2.d1, a subpart of a domain is any individual part di. (i.e., d1, d2
etc.). A domain-suffix of the query is any suffix of q composed of whole subparts. That
is, domains can be d1; d2.d1; d3.d2.d1 etc. up to the entire query ...d6.d5.d4.d3.d2.d1.
The subdomain-prefix of a domain-suffix is the prefix of q up to and not including the
domain-suffix. Therefore, for domain-suffix d1 the subdomain-prefix is ...d6.d5.d4.d3.d2,
for domain-suffix d2.d1 the subdomain-prefix is ...d6.d5.d4.d3 and so forth.
For brevity, we refer to a domain-prefix as a domain and a subdomain-suffix as a
subdomain.
Note that we refer to the length of a domain to be the number of domain subparts
rather than the number of characters in the domain.
We summarize the system parameters and notations used throughout this section
in Table 7.1.
7.3.2.2 Heavy Distinct Domain Hierarchy (HDDH) Extractor
Main concepts: In order to extract the hierarchy of heavy distinct domains, we need
to efficiently compute how many of the distinct subdomains are contributed by each
branch of the hierarchy. For each heavily distinct domain we would like to identify
which, if any, of its subdomains is also heavily distinct. Furthermore, we would like to
calculate the accumulative cardinality of all of its heavily distinct subdomains.
The Heavy Distinct Domain Hierarchy (HDDH) can be better visualized using a
trie. As can be seen in Figure 7.4, the trie holds mostly heavily distinct domains. Each
edge of the trie is labled with a domain subpart. Each node represents a domain (e.g. the
domain ∗.site.org is represented by the right-most leaf in the trie). Each node is labeled
with the number of distinct subdomains seen for that domain. For example, there were
500 different queries for domain ∗.com, of which 420 were for domain ∗.google.com,
60 for ∗.cnn.com and the remaining 20 were for domains that had a cardinality below
108 CHAPTER 7. MITIGATING DNS RANDOM SUBDOMAIN ATTACKS
Symbol Meaning
ki The number of items in structureDHHi
Cardestd cardinality estimate of item d
min sig minimal cardinality for signature
min base minimal cardinality for baseline
min white minimal subpart frequency forwhitelist
min heavy minimal cardinality for domain cov-er
p minimum ratio between item car-dinality and cardinality sum of its”children nodes”
t time interval for extracting signa-tures
min attack minimal cardinality for attack overbaseline
rbaseline required attack ratio from baseline
Table 7.1: System Parameters and Notations
min heavy. Note that the remaining cardinality of each node is the number indicated
on the node minus the sum of its child nodes in the next level of the tree. Therefore,
the remaining cardinality of ∗.com is 20 (calculated: 500− (420 + 60)).
We would like to find a minimal set of nodes in the trie with a cardinality above
min heavy that cover the leaves of the trie. Assume we would like to identify do-
mains which have at least 50 distinct subdomains, i.e. min heavy = 50. Intuitively, if
∗.cnn.com and ∗.google.com are signatures of our algorithm, than the node representing
∗.com only accounts for the remaining 20 which does not surpass the minimum of 50
distinct subdomains. In this case, there would be three nodes selected for the cover and
they are marked on the trie. The heavy domain cover would therefore be: ∗.cnn.com,
∗.maps.google.com, ∗.site.org.
HDDH Extractor Structure: Since extracting the entire hierarchy of queried do-
mains would consume way too many resources, we provide an approximate solution,
that allows extracting the desired information mainly for the heavily distinct domains.
The HDDH Extractor is composed of our Fixed-size streaming Distinct Weighted
Sampling (and specifically Integrated dwsHH) structures for Distinct Heavy Hitter de-
7.3. RANDOM SUBDOMAIN ATTACK MITIGATION SYSTEM 109
Figure 7.4: Hierarchy of heavy distinct domains. Bold edged nodes are in the cover,dashed edge nodes do not surpass the minimum cardinality.
tection. Each of these Integrated dwsHH structures maintains a constant number k of
keys (domains). For each domain, an approximate distinct counter of its subkeys (sub-
domains) is maintained along with a cardinality estimate (CardEst) which estimates
the number of distinct subkeys seen thus far for that domain. Up to a bounded error, at
any point in time, all domains with a high enough cardinality will be in the Integrated
dwsHH structure. Further details are provided in Section 6.5.2.
To achieve practicality both in terms of performance and implementation simplicity,
our structure supports domains of length at most 5. We have found that the average
length of queried domains is between 2 and 3 and most legitimate domains are of length
at most 5.
As seen in Figure 7.5, our structure maintains 5 Distinct Heavy Hitters (specifically
Integrated dwsHH) structures which we denote DHH1-DHH5. Denote as ki the size
of each DHHi, meaning, each DHHi contains at most ki keys. Furthermore, the keys
in each DHHi are domains of length i (i.e. domains of the form ∗.di.di−1.....d1).
Given a stream of traffic (or a traffic capture), for each query q = ...d6.d5.d4.d3.d2.d1
received, the key ∗.d1 (of length 1) is inserted to DHH1 with subkey ...d6.d5.d4.d3.d2,
110 CHAPTER 7. MITIGATING DNS RANDOM SUBDOMAIN ATTACKS
Figure 7.5: Heavy Distinct Domain Hierarchy (HDDH) Extractor
key ∗.d2.d1 (of length 2) is inserted to DHH2 with subkey ...d6.d5.d4.d3 and so on.
However, an insertion is made to DHH2 only if ∗.d1 was already found in DHH1.
Similarly, an insertion is made to DHH3 only if ∗.d2.d1 was already found in DHH2
and so on. Meaning, that a longer domain is only inserted into the structure if a shorter
domain of that query was already sufficiently heavy to be an item in the structure. In
this manner, the algorithm only inserts domains which are somewhat likely to become
signatures.
Finding a Distinct Heavy Domain Cover: Once the traffic capture has been
analyzed or after every fixed time interval t, the data in the structures needs to be pro-
cessed and a heavy domain cover must be extracted from the structure. This cover will
be used to extract a domain baseline and attack signatures as shown in Section 7.3.2.3.
To identify the heavy domain cover, using only the items in our HDDH Extractor, we
build a trie as shown in Figure 7.4. Intuitively, each domain found in our extractor can
be placed on a branch of the tree, forming a sort of suffix tree. Additionally, we need to
calculate the numbers found in each node. The general idea is as follows: we would like to
identify the longest part of the domain that is common to all of the attack queries. Such
that if, for example, the attack is composed of queries of the form < Randomstring >
.subdomain.targetsite.com our goal is to have the domain ’*.subdomain.targetsite.com’
in the cover and not a shorter domain such as ’*.targetsite.com’ or ’*.com’. To do so,
our algorithm identifies for every branch, the deepest node that has many distinct
subdomains and its child nodes do not.
Given the predefined parameters min heavy, and p the following process is per-
formed:
7.3. RANDOM SUBDOMAIN ATTACK MITIGATION SYSTEM 111
• For 1 ≤ i ≤ 5 for each DHHi, for each item d ∈ DHHi if Cardestd < min heavy
discard d.
• For 1 ≤ i ≤ 4 for each DHHi, for each item d ∈ DHHi:
– SumChildrend =∑CardEst of all items in DHHi+1 s.t. d is their suffix
– Deltad = Cardestd − SumChildrend
– If DeltadCardestd
≥ p: insert d to the heavy domain cover.
7.3.2.3 Attack Detection
Distinct Domain Baseline: Different domains in the Internet have a highly varying
number of distinct subdomains. While the vast majority of domains have a small num-
ber subdomains, some domains have hundreds or thousands of different subdomains.
Additionally, certain Internet sites make use of disposable domains [36], meaning they
create ”one-time” subdomains as part of their regular operation and therefore often
have thousands of sub-domains on a regular basis.
In Figure 7.6 we show the distribution of the number of distinct subdomains of
domains at each of the four highest domain levels, in a 40M query trace captured at a
campus sever. The findings show that there are a few top level domains that have a very
high cardinality. There are many second level domains with relatively high cardinality
and cardinality gradually decreases in the third and fourth levels.
To identify domains which normally have many distinct subdomains, our system
processes peacetime (or regular load) traffic to create a baseline. This baseline can be
compared to the number of distinct subdomains queried during an attack to determine if
there is a significant rise in the number of distinct queries for a given domain. To create
a domain cardinality baseline our system uses the HDDH Extractor structure. For each
query q = ...d4.d3.d2.d1 received, the query is inserted into the HDDH Extractor module
for analysis as explained in Section 7.3.2.2. Once the entire capture has been processed
or every fixed time interval, the Distinct Heavy Domain Cover is calculated (as described
in Section 7.3.2.2). Domains identified by this process that have a cardinality over
min base compose the domain baseline.
Attack Signatures Extraction: Our system may be used to process traffic streams
or samples in an ongoing manner to quickly detect a Random Subdomain Attack soon
112 CHAPTER 7. MITIGATING DNS RANDOM SUBDOMAIN ATTACKS
Figure 7.6: Distribution of the number of distinct subdomains per domain level
after it starts and extract attack signatures.
For simplicity, assume attack signatures extraction is done using a separate HDDH
Extractor module than the one used for baseline creation. Each query received during
attack detection time is inserted into the HDDH Extractor and is analyzed as explained
in Section 7.3.2.2.
We wish to output a set Sa of attack signatures. As seen in Figure 7.7, every fixed
interval t, a Heavy Distinct Domain Cover is calculated (as described in Section 7.3.2.2).
To generate the signature set Sa, domains in the cover are compared to the distinct
domain baseline described above. The signature set Sa will include domains for which
there is a significant rise in the number of their distinct subdomains both nominally
and in proportion to their baseline cardinality.
This is calculated in the following manner: Given the ratio rbaseline and the threshold
min attack and denote db the baseline for domain d (CardEstdb = 0 if d is not in the
baseline) For each domain d in the cover:
• If CardEstdb = 0 : If CardEstd ≥ min attack then add d to Sa.
• Else: If (CardEstd − CardEstdb) ≥ min attack AND CardEstdCardEstdb
≥ rbaseline then
add d to Sa.
7.3. RANDOM SUBDOMAIN ATTACK MITIGATION SYSTEM 113
Figure 7.7: Attack time signature extraction
Subdomain Whitelists: We would like to identify strings which often appear as
sub-parts in domain queries. Such strings are most likely not automatically generated.
Our system uses the Classic Heavy Hitters algorithm such as the Space-Saving
algorithm of Metwally et al. [81] to identify strings which are often found as subparts
in many domains.
During the baseline creation process in peacetime, for each query q = ...d4.d3.d2.d1
received, each subpart di will be independently inserted in the heavy hitters compu-
tation module. Once processing of all queries is completed, all strings with a count of
over min white will be inserted into the subdomain whitelist and will later be used by
the system for attack mitigation.
Once attack signatures have been extracted, a subdomain whitelist can be specifi-
cally extracted for each signature, to further reduce false positives. For each signature,
we maintain a separate heavy hitter module. For each query received for a signature s,
of the form q = ...dj+2.dj+1.dj .s, each subpart di will be independently inserted into
the heavy hitters computation module. Once enough legitimate requests are received,
the subparts of legitimate requests will be identified as the heavy hitters and they can
be added to the whitelist. Note that given the nature of the attacks, where each ran-
domly generated subdomain appears a very small (typically 1 or 2) number of times,
the subparts stream is very heavy tailed and therefore legitimate subparts which occur
114 CHAPTER 7. MITIGATING DNS RANDOM SUBDOMAIN ATTACKS
more times should stand out relatively quickly. We can choose to use a heavy hitters
module which is better suited for heavy tail streams such as the one presented in [27].
7.3.2.4 Attack Mitigation
Mitigation is done using the signature set Sa described above along with the Subdomain
Whitelist. Once an attack is detected, each consequent query q = ...d2.d1 is checked
to see if it has a common suffix with one of the signatures in Sa. Note that the suf-
fix has to be at least as long as one of the signatures, meaning, one of the signatures
should be its suffix. If for example, ∗.google.com and ∗.maps.google.com are both sig-
natures, then for the query amap.maps.google.com the longest common suffix should
be maps.google.com, yet for the query mysite.com, .com should not be considered a
common suffix since it is not a complete signature. If no common suffix is found between
q and the signatures in Sa, the query is allowed. Otherwise, a common suffix of the
form dj .dj−1....d1 is found. Denote dj+1 to be the subpart in q immediately preceding
the common suffix. If dj+1 is found in the Subdomain Whitelist, the query is allowed.
Otherwise, the query is dropped (See Fig. 7.3).
7.3.2.5 Timely Attack Detection
Attack detection is a time critical task. To insure timely detection of the attack, our
system works in intervals of a predetermined length l (i.e. 20 minutes) in both peacetime
processing and attack time analysis. After each interval l counters are refreshed by a
complete restart. In this manner the measurements performed by the system during an
attack are comparable with those taken during peacetime.
That said, the system is required to detect an attack within seconds of attack start
time. In order to provide support for this requirement, counters are checked every fixed
(short) time interval s. At each interval sj the incremental cardinality estimate delta
for each key is calculated since sj−1 and local cardinality peaks are identified. That is
for each key k deltaj(k) = k.CardEstsj − k.CardEstsj−1 .
During peacetime analysis, for each key k identified as being distinctly heavy during
peacetime, define delta max(k) = max1≤j≤n deltaj(k). We denote n as the overall
number of short intervals s that were processed during peacetime.
During the attack detection phase, at each interval sa, for each key k, we compare
deltaa(k) with delta max(k). If deltaa(k) >> delta max(k), the key is suspected of
7.4. EVALUATION 115
being under attack.
Accumulative Techniques: We provide two additional techniques which allow the
counters to maintain some history between intervals.
• Accumulative weighted counters: At every interval l it is possible to perform a s-
napshot of the counters prior to the restart. In this manner, we can maintain
an accumulative weighted counter. That is, an accumulative cardinality esti-
mate can be calculated by adding the cardinality estimate of a key k in the
snapshot k.CardEstsnapshot multiplied by w1 (0 < w1 < 1) to the current car-
dinality estimate k.CardEstcurrent multiplied by w2 (0 < w2 < 1). So that
the accumulative cardinality estimate of a key k is equal to accCardEst(k) =
w1 · k.CardEstsnapshot + w2 · k.CardEstcurrent.
• Decaying average: Decaying average may be performed by clearing one of the
buckets in the distinct counter of each item at each time interval l. In each distinct
counter, buckets are cleared in a round robin manner. In this way, the distinct
counter of each item is decremented at every interval, and items which are no
longer distinctly heavy will eventually be evicted from the structure and make
room for items that have recently grown to be distinctly heavy.
7.4 Evaluation
7.4.1 University Network Captures
We examined a 4GB trace of nearly 40M DNS queries captured by a DNS resolvers
in our University Network. This includes both an authoritative name server for some
of the domains at our university and a recursive server handling DNS queries coming
from clients within the campus. The capture was taken over nearly a month and half. In
this capture we saw a small attack on a university authoritative DNS server. Figure 7.8
shows the number of distinct queries made to this domain over the course of one day. As
can be clearly seen, at around 10AM the number of distinct domains nearly doubled,
and this persisted at different volumes for about 10 hours. After 10 hours the number
of distinct queries went back to the normal baseline. We ran the capture through our
system, using the trace from the previous day as a baseline. Our peacetime capture
contained nearly 776K queries, the capture for the day of the attack contained over
116 CHAPTER 7. MITIGATING DNS RANDOM SUBDOMAIN ATTACKS
900K queries. The parameters of our system were set as follows: k1 = 100, k2 = 500,
k3 = 500, k4 = 250, k5 = 100, l = 32, min sig = 500, min base = 20, min white = 5,
min heavy = 20, p = 0.3, min attack = 500, rbaseline = 1.4
Our system was able to identify the attack getting a single attack signature from
those captures after processing only several dozens of attack packets.
Figure 7.8: Distinct queries for campus authoritative server per hour, over 1 day.
Our current implementation is able to process up to 40K queries per second. This
implementation has yet to be optimized for performance. Nonetheless, as our system
can detect an attack after processing merely dozens or hundreds of attack packets, it
can detect an attack within seconds and even less, depending on the rate of the attack.
7.4.2 ISP Attack Captures
Our Random Subdomain attack mitigation system presented above has been evaluated
on traces of actual attacks captured by a large Internet Service Provider (ISP).
We analyzed 5 captures which were sniffed during different Random Subdomain
attacks and contained both attack and legitimate DNS queries. All captures were taken
within a single month in 2014. Note that most of the captures contain 5000 queries as
that was the set amount that was sniffed for each attack spotted. The ISP manually
7.4. EVALUATION 117
identified the Random Subdomain attacks as they were occurring. The attacks targeted
both domains hosted by the ISP authoritative name server and domains outside the IS-
P’s network. Hence, the attacks affected both the authoritative name servers of the ISP
and its recursive resolvers. We compare our results to the analysis performed manually
by the ISP. We use a cache size of k = 50. Note that, some of the attacks analyzed had
a very high percentage of distinct queries and others had lower rates. The repetitions
are of randomly generated queries that were each repeated several times in the traffic.
As we did not have access to a peacetime capture, we used one of the captures to create
a baseline for the others.
Consider attack 1 in Table 7.2. The capture consisted of 92469 DNS queries. Of
these, 4133 are attack queries targeted at the same zone, with a randomly generated
least significant domain sub-part, containing 2051 distinct queries, meaning that some
of the queries were repeated. Of the 4133 queries, the system counted 4123. Meaning,
that 10 queries for the attacked zone had gone through before the zone was placed
in the structure (i.e., in the cache). Once inside, the zone was not evicted from the
structure at any point, all subsequent queries were counted, hence 99.8% of the queries
were identified.
Source Queries in cap-ture
Attackqueries
Distinct attackqueries
Attack queries i-dentified
1 92469 4133 2051 99.8%
2 5000 389 367 99.7%
3 5000 602 567 100%
4 5000 334 330 100%
5 5000 3364 631 99.8%
Table 7.2: Results on Real DNS Attack Captures
Chapter 8
Discussion and Conclusion
8.1 Contributions
We provide a brief overview of the contributions we have presented in this dissertation:
Detection of Heavy Flows in Software Defined Networks: Based on different
parameters, we differentiate between heavy flows, elephant flows and bulky flows and
present innovative algorithms to detect flows of the different types in an SDN switch.
We propose our Sample&Pick algorithm for the development of efficient methods to
detect large or heavy flows going through an SDN switch. The Sample&Pick algorithm
performs a division of labour between the switch and the controller, coordinating be-
tween them to efficiently identify the large flows. Our constructions use, in a sophisticat-
ed way, the Sample and Hold [49] algorithm along with the Space Saving algorithm [81]
to minimize both the switch - controller communication and the number of entries in
the switch flow table. We evaluate the performance of our Sample&Pick algorithm by
measuring its inaccuracy rates and resource consumption. Our evaluations show that
our algorithm provides a good tradeoff between the amount of communication between
the switch and the controller and the amount of space required on the switch while
being able to identify the heavy hitters.
Additionally, we consider a distributed model with multiple switches and propose
solutions for efficient scaling of our techniques.
Our methods rely on standard and optional features of OpenFlow 1.3 and can also
be implemented in the P4 language. Additionally, the techniques presented are both
flow-table size and switch-controller communication efficient.
118
8.1. CONTRIBUTIONS 119
String Heavy Hitters: we propose the String Heavy Hitters problem and present
the Double Heavy Hitter algorithm for efficiently solving this problem. This algorithm
finds popular strings of variable length in a set of messages, using, in a tricky way, the
classic Heavy Hitter algorithm as a building block. This algorithm runs in a single pass
over the input and space dependent only on some predefined parameters.
Zero-Day Signature Extraction for High Volume Attacks: We present an in-
novative system for automatic extraction of signatures for application-level zero-day
DDoS attacks. Our system takes as input two streams (or stream samples) of traffic
collected during an attack and during peacetime. A peacetime traffic sample may be
collected as a routine scheduled procedure. The attack traffic sample can be collected
once the attack has been identified. The system then analyzes both traffic samples to
identify content that is frequent in the attack traffic sample yet appears rarely or not
at all in the peacetime traffic.
Our system makes no assumptions on traffic characteristics such as client behaviour,
address dispersion, URL statistics and so forth. Therefore, it is generic in that it can
be easily adapted to solving other network problems with similar characteristics.
We test our system on real life traffic logs of attacks and peacetime that from real
attacks. We show that our solution has good performance in real life, with a recall
average of 99.95% and an average precision of 98%.
Heavy Hitters in Pairs: Our main contributions, are novel and efficient sampling-
based structures for distinct Heavy Hitters (dhh) and combined Heavy Hitters (chh)
detection in a stream of < key, subkey > pairs, which are able to track only O(ε−1) keys
and require only a single pass over the input. Our dHH design significantly improves
over existing work. We demonstrate, via experimental and theoretical evaluations, the
effectiveness of each of our algorithms in terms of accuracy and memory consumption.
Random Subdomain DNS Attacks: Random subdomain DDoS attacks on the
Domain Name System service have recently become a growing threat to basic internet
functionality. In these attacks, many queries are sent for a single or a few victim do-
mains, yet they include highly varying non-existent subdomains generated randomly.
While the attack targets one or a few authoritative name servers, it usually comes with
significant collateral damage to DNS servers of different providers on its route. We
120 CHAPTER 8. DISCUSSION AND CONCLUSION
present a system for mitigation of such attacks. To the best of our knowledge this is
the first such system. The design makes use of our structures for dHH detection. We
perform extensive experimental evaluation on real DNS attack traces, demonstrating
the effectiveness of our system.
8.2 Future Work
The drastically growing scale of today’s networks requires building solutions that can
adapt to the changing needs of the network. Relatively new network concepts such as
Software Defined Networks (SDN) and Network Function Virtualization (NFV) offer
new capabilities and architectures which may be leveraged to allow for more flexible so-
lutions. SDN and protocols such as OpenFlow [80] allow the decoupling of the network
control plane from the data plane, introducing new flexibility in network management.
NFV is part of the transition of network components from specialized hardware to gen-
eral purpose machines [89], and therefore allows new flexibility in network functionality.
While both of these paradigms allow simplified deployment of new network tools, the
efficient transition of these tools to wide deployment, so that they may cleverly utilize
this new architecture, is not at all trivial. An interesting direction would be to expand
our solutions to a distributed network setting with the aim of building solutions that
can be scaled and virtualized, by collaboration of distributed and possibly hierarchical
network entities, located in different sites, sharing data and resources.
Detection of Heavy Flows in Software Defined Networks: Generally, it would
be interesting to study security vulnerabilities specific to SDN and research how the
tools we have presented thus far can be combined with other SDN monitoring capabil-
ities to mitigate network attacks such as DDoS, Worms etc.
Additionally, our research on the distributed setting solution can be expanded to
support more complex settings and topologies. For example, a topology in which switch-
es may have different roles in the system such as switches found at ingress and egress
points.
String Heavy Hitters: Our Double Heavy Hitter algorithm is able to find strings
of varying length. Each signature formed is a fixed string which may be searched for in
the data. While in the past, these types of signatures may have sufficed, mitigation of
8.2. FUTURE WORK 121
new attacks requires enhanced tools. Attackers are constantly making up new types of
attack signatures that are more difficult to identify. It would be interesting to expand the
variability of the signatures that the algorithm is able to extract, to include, for example,
signatures which contain regular expressions, or signatures that contain ”Don’t-Care”s,
and mismatches. Specifically, we should devise ways to generate signatures in which
part of the string is fixed and part of may be randomly generated. Partially random
signatures are a major challenge facing security experts today, and such a solution would
allow to fine tune mitigation and detection of such network attacks and anomalies.
Zero-Day Signature Extraction for High Volume Attacks: A possible direction
would be to expand our solution so that it is able to monitor traffic in different network
locations and identify signatures based on the analysis performed in the different sites.
To do so, a scalable solution for the String Heavy Hitters problem should be developed
that can be implemented as a virtualized network function (VNF). In [123], Yi et al.
propose an algorithm for identifying classical heavy hitters in a distributed setting, yet
the transition to solving the String Heavy Hitters problem in a distributed setting is
challenging due to the dependencies between the frequencies of the fixed length and
the varying length strings.
Random Subdomain DNS Attacks: According to Akamai’s state of the Internet
report [16] nearly 20% of DDoS attacks in Q1 of 2016 involved the DNS service making
mitigation of such attacks extremely important. Moreover, even some of the Internet’s
DNS root name servers were targets of DNS-based DDoS attacks [117]. Such attacks
can significantly impact the availability of websites globally. In order to detect such ma-
licious behaviour, we must first understand the legitimate usage of the DNS by different
companies and Internet entities. It would be interesting to research the characteristics
of DNS traffic to identify current trends and changes in DNS usage. Due to the recent
introduction of generic top level domains (gTLDs) some of the fundamental character-
istics of DNS traffic are changing and therefore raise the need for such a study. This will
also be very helpful in gaining a better understanding of how disposable domains [36]
in DNS are being used today which is significant to the research done so far.
Additionally, there are possible advancements which can be made in the detection of
Random Subdomain DNS attacks. For example, due to the hierarchical and distributed
architecture of DNS servers, a possible next step would be to study how different servers
122 CHAPTER 8. DISCUSSION AND CONCLUSION
can collaborate to identify such attacks more efficiently.
Bibliography
[1] The caida ucsd anonymized internet traces 2009 - sep. 17 2009.http://www.caida.org/data/passive/passive 2009 dataset.xml. 1, 3.4.3.1
[2] The caida ucsd anonymized internet traces 2012.http://www.caida.org/data/passive/passive 2012 dataset.xml.
[3] The caida ucsd anonymized internet traces 2014 - mar. 20 2014.http://www.caida.org/data/passive/passive 2014 dataset.xml. 1, 3.4.3.2
[4] Cisco Netflow. http://www.cisco.com/c/en/us/tech/quality-of-service-qos/netflow/index.html. 1, 3.1, 3.2.1
[5] NoviFlow’s NoviKit. http://noviflow.com/products/novikit/(accessed on March2015). 3.1.1
[6] NoviFlow’s NoviWare. http://noviflow.com/products/noviware/(accessed onJanuary 2017). 3.1.1
[7] Ucla dward project: Sanitized ucla csd traffic traces. http-s://lasr.cs.ucla.edu/ddos/traces/. 6
[8] Cert advisory ca-1996-21: Tcp syn flooding and ip spoofing attacks., 1996. 2.2
[9] Snort: Open source network intrusion detection system, 2002. 1.1.3, 2.3
[10] Leading security companies. personal communication, 2012-2013. 2.2
[11] Yehuda Afek, Anat Bremler-Barr, and Shir Landau Feibish. Automated signatureextraction for high volume attacks. In Symposium on Architecture for Networkingand Communications Systems, ANCS ’13, San Jose, CA, USA, October 21-22,2013, pages 147–156. IEEE Computer Society, 2013. 1
[12] Yehuda Afek, Anat Bremler-Barr, and Shir Landau Feibish. Zero-day signatureextraction for high volume attacks. Transactions on Networking. Submitted. 1,1.1.3
[13] Yehuda Afek, Anat Bremler-Barr, Shir Landau Feibish, and Golan Parashi.Cloud-based implementation of the signature extraction system. http-s://www.autosigen.com/. 1.2, 5.1.1
[14] Yehuda Afek, Anat Bremler-Barr, Shir Landau Feibish, and Liron Schiff. De-tecting heavy flows in the sdn match and action model. Computer NetworksJournal: special issue on Security and Performance of Software-defined Networksand Functions Virtualization, 2017. Submitted. 3.4.2.1
123
124 BIBLIOGRAPHY
[15] Alfred V. Aho and Margaret J. Corasick. Efficient string matching: An aid tobibliographic search. Commun. ACM, 18(6):333–340, 1975. 2.3
[16] Akamai [state of the internet] / security – q1 2016 report.www.akamai.com/StateOfTheInternet, 2016. 7.1, 8.2
[17] Cathy Almond. Recent authoritative exhaustion attacks, 2016. http-s://www.arbornetworks.com/threats/. 7.1, 7.1
[18] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approx-imating the frequency moments. J. Comput. Syst. Sci., 58(1):137–147, 1999.2.1.2.2
[19] Alberto Apostolico, Maxime Crochemore, Martin Farach-Colton, Zvi Galil, andS. Muthukrishnan. 40 years of suffix trees. Commun. ACM, 59(4):66–73, 2016.4.5
[20] Digital attack map. https://www.arbornetworks.com/threats/, 2016. 2.2
[21] Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and JenniferWidom. Models and issues in data stream systems. In Lucian Popa, SergeAbiteboul, and Phokion G. Kolaitis, editors, Proceedings of the Twenty-firstACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Sys-tems, June 3-5, Madison, Wisconsin, USA, pages 1–16. ACM, 2002. 2.1.1
[22] Brian Babcock and Chris Olston. Distributed top-k monitoring. In Alon Y.Halevy, Zachary G. Ives, and AnHai Doan, editors, Proceedings of the 2003 ACMSIGMOD International Conference on Management of Data, San Diego, Califor-nia, USA, June 9-12, 2003, pages 28–39. ACM, 2003. 2.1.3
[23] Chris Baker. Recent authoritative exhaustion attacks. October 2016. DNS OARC2016 Dallas. Talk given on behalf of Dyn Inc. 7.1, 7.1
[24] Nagender Bandi, Divyakant Agrawal, and Amr El Abbadi. Fast algorithms forheavy distinct hitters using associative memories. In 27th IEEE InternationalConference on Distributed Computing Systems (ICDCS 2007), June 25-29, 2007,Toronto, Ontario, Canada, page 6. IEEE Computer Society, 2007. 1.1.4, 6.4
[25] Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan.Counting distinct elements in a data stream. In Jose D. P. Rolim and Salil P.Vadhan, editors, Randomization and Approximation Techniques, 6th Internation-al Workshop, RANDOM 2002, Cambridge, MA, USA, September 13-15, 2002,Proceedings, volume 2483 of Lecture Notes in Computer Science, pages 1–10.Springer, 2002. 1, 6.1.1, 6.3
[26] Michela Becchi and Patrick Crowley. A hybrid finite automaton for practical deeppacket inspection. In Jim Kurose and Henning Schulzrinne, editors, Proceedingsof the 2007 ACM Conference on Emerging Network Experiment and Technology,CoNEXT 2007, New York, NY, USA, December 10-13, 2007, page 1. ACM, 2007.2.3
[27] Ran Ben-Basat, Gil Einziger, Roy Friedman, and Yaron Kassner. Random-ized admission policy for efficient top-k and frequency estimation. CoRR, ab-s/1612.02962, 2016. 7.3.2.3
BIBLIOGRAPHY 125
[28] Ran Ben-Basat, Gil Einziger, Roy Friedman, and Yaron Kassner. Optimal ele-phant flow detection. CoRR, abs/1701.04021, 2017. 1
[29] Udi Ben-Porat, Anat Bremler-Barr, and Hanoch Levy. Evaluating the vulnera-bility of network mechanisms to sophisticated ddos attacks. In INFOCOM, pages2297–2305. IEEE, 2008. 2.2
[30] Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer Rex-ford, Cole Schlesinger, Dan Talayco, Amin Vahdat, George Varghese, and DavidWalker. P4: programming protocol-independent packet processors. ComputerCommunication Review, 44(3):87–95, 2014. 3.1
[31] Robert S. Boyer and J. Strother Moore. Mjrty - a fast majority vote algorithm.Technical Report, Institute of Computing Science, The university of Texas atAustin, 32, 1981. 2.1.2
[32] The Bro Network Security Monitor. http://bro-ids.org. 2.3
[33] N. Brownlee, C. Mills, and G. Ruth. Rfc 2722, 1999.http://tools.ietf.org/html/rfc2722. 1, 3.3
[34] Moses Charikar, Kevin C. Chen, and Martin Farach-Colton. Finding fre-quent items in data streams. In Peter Widmayer, Francisco Triguero Ruiz,Rafael Morales Bueno, Matthew Hennessy, Stephan Eidenbenz, and RicardoConejo, editors, Automata, Languages and Programming, 29th International Col-loquium, ICALP 2002, Malaga, Spain, July 8-13, 2002, Proceedings, volume 2380of Lecture Notes in Computer Science, pages 693–703. Springer, 2002. 2.1.2.2,2.1.3
[35] Ruiliang Chen, Jung-Min Park, and Randolph Marchany. RIM: router inter-face marking for IP traceback. In Proceedings of the Global TelecommunicationsConference, 2006. GLOBECOM ’06, San Francisco, CA, USA, 27 November - 1December 2006. IEEE, 2006. 2.2
[36] Yizheng Chen, Manos Antonakakis, Roberto Perdisci, Yacin Nadji, David Dagon,and Wenke Lee. DNS noise: Measuring the pervasiveness of disposable domainsin modern DNS traffic. In 44th Annual IEEE/IFIP International Conference onDependable Systems and Networks, DSN 2014, Atlanta, GA, USA, June 23-26,2014, pages 598–609. IEEE, 2014. 7.1.1, 7.3.2.3, 8.2
[37] Edith Cohen. Size-estimation framework with applications to transitive closureand reachability. J. Comput. System Sci., 55:441–453, 1997. 1, 6.1.1, 6.3
[38] Edith Cohen. All-distances sketches, revisited: HIP estimators for massive graphsanalysis. IEEE Trans. Knowl. Data Eng., 27(9):2320–2334, 2015. 6.1.1, 6.3, 6.5.5,6.5.5
[39] Edith Cohen, Nick G. Duffield, Haim Kaplan, Carsten Lund, and Mikkel Thorup.Algorithms and estimators for summarization of unaggregated data streams. J.Comput. Syst. Sci., 80(7):1214–1244, 2014. Space-Saving Heavy Hitters, 6.1.1,6.5.3
[40] Graham Cormode. Misra-gries summaries. In Encyclopedia of Algorithms, pages1334–1337. 2016. 2.1.2.2, 2.1.3
126 BIBLIOGRAPHY
[41] Graham Cormode and Marios Hadjieleftheriou. Finding frequent items in datastreams. PVLDB, 1(2):1530–1541, 2008. 2.1.2.2, Space-Saving Heavy Hitters,4.6.0.1, 5.4.3
[42] Graham Cormode, Flip Korn, S. Muthukrishnan, and Divesh Srivastava. Findinghierarchical heavy hitters in data streams. In VLDB, pages 464–475, 2003. 2.1.3,4.5
[43] Graham Cormode and S. Muthukrishnan. An improved data stream summa-ry: the count-min sketch and its applications. J. Algorithms, 55(1):58–75, 2005.2.1.2.2
[44] Andrew R. Curtis, Jeffrey C. Mogul, Jean Tourrilhes, Praveen Yalagandula,Puneet Sharma, and Sujata Banerjee. Devoflow: scaling flow management forhigh-performance networks. In SIGCOMM, pages 254–265, 2011. 3.2.1
[45] Erik D. Demaine, Alejandro Lopez-Ortiz, and J. Ian Munro. Frequency esti-mation of internet packet streams with limited space. In Rolf H. Mohring andRajeev Raman, editors, Algorithms - ESA 2002, 10th Annual European Sympo-sium, Rome, Italy, September 17-21, 2002, Proceedings, volume 2461 of LectureNotes in Computer Science, pages 348–360. Springer, 2002. 2.1.2.2, 2.1.2.2, 2.1.3
[46] Roland Dobbins. Mirai iot botnet description and ddos attack mitiga-tion. https://www.arbornetworks.com/blog/asert/mirai-iot-botnet-description-ddos-attack-mitigation/, 2016. 5.1
[47] Marianne Durand and Philippe Flajolet. Loglog counting of large cardinalities(extended abstract). In Giuseppe Di Battista and Uri Zwick, editors, Algorithms- ESA 2003, 11th Annual European Symposium, Budapest, Hungary, September16-19, 2003, Proceedings, volume 2832 of Lecture Notes in Computer Science,pages 605–617. Springer, 2003. 1
[48] Cristian Estan and George Varghese. New directions in traffic measurement andaccounting. In Proceedings of the ACM SIGCOMM’02 Conference. ACM, 2002.Space-Saving Heavy Hitters, 6.1.1
[49] Cristian Estan and George Varghese. New directions in traffic measurement andaccounting: Focusing on the elephants, ignoring the mice. ACM Trans. Comput.Syst., 21(3):270–313, 2003. 1, 1.1.1, 3.2.1, 3.3, 3.4.1, 8.1
[50] Shir Landau Feibish, Yehuda Afek, Anat Bremler-Barr, Edith Cohen, and MichalShagam. Mitigating DNS random subdomain ddos attacks by distinct heavyhitters sketches. In Qun Li and Songqing Chen, editors, Proceedings of the fifthACM/IEEE Workshop on Hot Topics in Web Systems and Technologies, HotWeb2017, San Jose / Silicon Valley, CA, USA, October 12 - 14, 2017, pages 8:1–8:6.ACM, 2017. 1
[51] P. Ferguson and D. Senie. Rfc 2827: Network ingress filtering: Defeating de-nial of service attacks which employ ip source address spoofing, 2000. http-s://www.ietf.org/rfc/rfc2827.txt. 2.2
[52] Philippe Flajolet, Eric Fusy, Olivier Gandouet, and Frederic Meunier. Hyper-loglog: The analysis of a near-optimal cardinality estimation algorithm. In Anal-ysis of Algorithms (AOFA), 2007. 6.1.1, 6.3, 6.5.5, 6.5.5
BIBLIOGRAPHY 127
[53] Phillip B. Gibbons and Yossi Matias. New sampling-based summary statistics forimproving approximate query answers. In SIGMOD. ACM, 1998. Space-SavingHeavy Hitters, 2.1.3, 6.1.1
[54] Anna C. Gilbert, Hung Q. Ngo, Ely Porat, Atri Rudra, and Martin J. Strauss.`2/`2-foreach sparse recovery with low risk. In Fedor V. Fomin, Rusins Freivalds,Marta Z. Kwiatkowska, and David Peleg, editors, ICALP (1), volume 7965 ofLecture Notes in Computer Science, pages 461–472. Springer, 2013. 4.5
[55] Jesus M. Gonzalez, Mohd Anwar, and James B. D. Joshi. A trust-based approachagainst ip-spoofing attacks. In Ninth Annual Conference on Privacy, Securityand Trust, PST 2011, 19-21 July, 2011, Montreal, Quebec, Canada, pages 63–70.IEEE, 2011. 2.2
[56] Michael Greenwald and Sanjeev Khanna. Space-efficient online computation ofquantile summaries. In Sharad Mehrotra and Timos K. Sellis, editors, SIGMODConference, pages 58–66. ACM, 2001. 2.1.2.2
[57] Kent Griffin, Scott Schneider, Xin Hu, and Tzi cker Chiueh. Automatic generationof string signatures for malware detection. In Engin Kirda, Somesh Jha, andDavide Balzarotti, editors, RAID, volume 5758 of Lecture Notes in ComputerScience, pages 101–120. Springer, 2009. 4.5, 5.2.1
[58] Stefan Heule, Marc Nunkesser, and Alexander Hall. HyperLogLog in practice:Algorithmic engineering of a state of the art cardinality estimation algorithm. InEDBT, 2013. 1
[59] Lucas Chi Kwong Hui. Color set size problem with application to string matching.In Alberto Apostolico, Maxime Crochemore, Zvi Galil, and Udi Manber, editors,Combinatorial Pattern Matching, Third Annual Symposium, CPM 92, Tucson,Arizona, USA, April 29 - May 1, 1992, Proceedings, volume 644 of Lecture Notesin Computer Science, pages 230–243. Springer, 1992. 4.5
[60] Arbor Networks Inc. Peekflow. http://www.arbornetworks.com/products/peakflow,August 2004. 3.2.1
[61] Cloudflare Inc. How cloudflares architecture can scale to stop the largest attacks,2017. https://www.cloudflare.com/media/pdf/cf-wp-dns-attacks.pdf. 7.1
[62] Infoblox Inc. Case studies: A large internet service provider.https://www.infoblox.com/resources/case-studies/large-internet-service-provider/. 7.1
[63] D. Dittrich J. Mirkovic, S. Dietrich and P. Reiher. Internet Denial of Service:Attack and Defense Mechanisms. Prentice Hall PTR, 2004. 2.2
[64] Lorand Jaakab and Jordi Domingo-Pascual. A selective survey of ddos relatedresearch. Technical Report, UPC-DAC-RR-CBA-2007-3, 2007. 2.2
[65] Pollard Jeff, Joseph Blankenship, and Andras Cser. Quick take: Poor planning,not an iot botnet, disrupted the internet: Dyn outage underscores the need toplan for failure, October 2016. Forrester Research. 1, 7.1
[66] Daniel M. Kane, Jelani Nelson, and David P. Woodruff. An optimal algorithmfor the distinct elements problem. In PODS, 2010. 1
128 BIBLIOGRAPHY
[67] Jeffrey O. Kephartt and William C. Arnold. Automatic extraction of computervirus signatures. In 4th International Virus Bulletin Conference, Sept. 1994. 4.5,5.2.1
[68] Hyang-Ah Kim and Brad Karp. Autograph: Toward automated, distributed wormsignature detection. In USENIX Security Symposium, pages 271–286. USENIX,2004. 4.5, 5.2.1
[69] Tomasz Kociumaka, Tatiana A. Starikovskaya, and Hjalte Wedel Vildhøj. Sub-linear space algorithms for the longest common substring problem. In Andreas S.Schulz and Dorothea Wagner, editors, Algorithms - ESA 2014 - 22th Annual Eu-ropean Symposium, Wroclaw, Poland, September 8-10, 2014. Proceedings, volume8737 of Lecture Notes in Computer Science, pages 605–617. Springer, 2014. 4.5
[70] Christian Kreibich and Jon Crowcroft. Honeycomb: creating intrusion detec-tion signatures using honeypots. Computer Communication Review, 34(1):51–56,2004. 4.5, 1, 5.2.1
[71] Sailesh Kumar, Sarang Dharmapurikar, Fang Yu, Patrick Crowley, andJonathan S. Turner. Algorithms to accelerate multiple regular expressions match-ing for deep packet inspection. In Luigi Rizzo, Thomas E. Anderson, and NickMcKeown, editors, Proceedings of the ACM SIGCOMM 2006 Conference on Ap-plications, Technologies, Architectures, and Protocols for Computer Communica-tions, Pisa, Italy, September 11-15, 2006, pages 339–350. ACM, 2006. 2.3
[72] Heejo Lee and Kihong Park. On the effectiveness of probabilistic packet markingfor IP traceback under denial of service attack. In Proceedings IEEE INFOCOM2001, The Conference on Computer Communications, Twentieth Annual JointConference of the IEEE Computer and Communications Societies, Twenty yearsinto the communications odyssey, Anchorage, Alaska, USA, April 22-26, 2001,pages 338–347. IEEE, 2001. 2.2
[73] Zhichun Li, Manan Sanghi, Yan Chen, Ming-Yang Kao, and Brian Chavez. Ham-sa: Fast signature generation for zero-day polymorphicworms with provable at-tack resilience. In IEEE Symposium on Security and Privacy, pages 32–47. IEEEComputer Society, 2006. 5.2.1
[74] C. Liu. A new kind of ddos threat: The nonsense name attack. Network World,2015. [Online; posted 27-January-2015]. 1.1.5, 7.1, 7.1
[75] Ziqian Liu. Lessons learned from may 19 chinas dns collapse, November 2009.Talk given on behalf of Dyn Inc. 7.1
[76] Thomas Locher. Finding heavy distinct hitters in data streams. In SPAA. ACM,2011. 1.1.4, 6.1.1, 6.4, 6.7.1
[77] Matthew V. Mahoney. Network traffic anomaly detection based on packet bytes.In SAC, pages 346–350. ACM, 2003. 2.2, 5.2.2
[78] Udi Manber and Sun Wu. A fast algorithm for multi-pattern searching. TechnicalReport TR94-17, May 1994. 2.3
[79] Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts overdata streams. PVLDB, 5(12):1699, 2012. 2.1.2.2
BIBLIOGRAPHY 129
[80] Nick McKeown, Tom Anderson, Hari Balakrishnan, Guru M. Parulkar, Larry L.Peterson, Jennifer Rexford, Scott Shenker, and Jonathan S. Turner. Openflow:enabling innovation in campus networks. Computer Communication Review,38(2):69–74, 2008. 1.1.1, 2.4, 3.3, 8.2
[81] Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Efficient computationof frequent and top-k elements in data streams. In ICDT, pages 398–412, 2005.1, 1.1.1, 1.1.2, 2.1.2.2, 2.1.2.2, Space-Saving Heavy Hitters, 2.1.3, 3.4.1, 3.4.2.3,3.5, 4.6.2, 5.4.3, 6.4, 7.3.2.3, 8.1
[82] Jelena Mirkovic, Gregory Prier, and Peter L. Reiher. Source-end ddos defense.In 2nd IEEE International Symposium on Network Computing and Applications(NCA 2003), 16-18 April 2003, Cambridge, MA, USA, pages 171–178. IEEEComputer Society, 2003. 2.2
[83] Jayadev Misra and David Gries. Finding repeated elements. Sci. Comput. Pro-gram., 2(2):143–152, 1982. 1, 2.1.2.1, 2.1.2.2, 2.1.2.2, 6.4
[84] J. Strother Moore. Problem 81-5. Journal of Algorithms, 2:208–209, 1981. 2.1.2
[85] Masoud Moshref, Minlan Yu, Ramesh Govindan, and Amin Vahdat. DREAM:dynamic resource allocation for software-defined measurement. In SIGCOMM,pages 419–430, 2014. 3.2.1
[86] Athicha Muthitacharoen, Benjie Chen, and David Mazieres. A low-bandwidthnetwork file system. In SOSP, pages 174–187, 2001. 4.5
[87] AS Navaz, V Sangeetha, and C Prabhadevi. Entropy based anomaly detectionsystem to prevent ddos attacks in cloud. arXiv preprint arXiv:1308.6745, 2013.2.2, 5.2.2
[88] James Newsome, Brad Karp, and Dawn Xiaodong Song. Polygraph: Automatical-ly generating signatures for polymorphic worms. In IEEE Symposium on Securityand Privacy, pages 226–241. IEEE Computer Society, 2005. 5.2.1
[89] Network functions virtualization – introductory white paper.http://portal.etsi.org/NFV/NFV White Paper.pdf, 2012. 8.2
[90] Hung Q. Ngo, Ely Porat, and Atri Rudra. Efficiently decodable compressedsensing by list-recoverable codes and recursion. In Christoph Durr and ThomasWilke, editors, STACS, volume 14 of LIPIcs, pages 230–241. Schloss Dagstuhl -Leibniz-Zentrum fuer Informatik, 2012. 4.5
[91] Latest internet plague: Random subdomainattacks. https://nominum.com/wp-content/uploads/2014/10/Nominum-Whitepaper-Latest-Internet-Plague-Random-Subdomain-Attacks.pdf, 2014. 7.1, 7.1
[92] George Nychis, Vyas Sekar, David G. Andersen, Hyong Kim, and Hui Zhang. Anempirical evaluation of entropy-based traffic anomaly detection. In Konstanti-na Papagiannaki and Zhi-Li Zhang, editors, Internet Measurement Comference,pages 151–156. ACM, 2008. 2.2, 5.2.2
[93] Hyundo Park, Peng Li, Debin Gao, Heejo Lee, and Robert H. Deng. Distinguish-ing between fe and ddos using randomness check. In Tzong-Chen Wu, Chin-Laung
130 BIBLIOGRAPHY
Lei, Vincent Rijmen, and Der-Tsai Lee, editors, ISC, volume 5222 of Lecture Notesin Computer Science, pages 131–145. Springer, 2008. 1.1.3, 5.1.1
[94] Vern Paxson. Bro: a system for detecting network intruders in real-time. Com-puter Networks, 31(23-24):2435–2463, 1999. 1.1.3
[95] Tao Peng, Christopher Leckie, and Kotagiri Ramamohanarao. Protection fromdistributed denial of service attacks using history-based IP filtering. In Proceed-ings of IEEE International Conference on Communications, ICC 2003, Anchor-age, Alaska, USA, 11-15 May, 2003, pages 482–486. IEEE, 2003. 2.2
[96] PhilippeFlajolet and G.Nigel Martin. Probabilistic counting algorithms for database applications. J. Comput. System Sci., 31:182–209, 1985. 1, 6.1.1, 6.3, 6.5.5
[97] Ely Porat and Martin J. Strauss. Sublinear time, measurement-optimal, sparserecovery for all. In Yuval Rabani, editor, SODA, pages 1215–1227. SIAM, 2012.4.5
[98] M. Zubair Rafique and Juan Caballero. Firma: Malware clustering and networksignature generation with mixed network behaviors. In Salvatore J. Stolfo, Ange-los Stavrou, and Charles V. Wright, editors, RAID, volume 8145 of Lecture Notesin Computer Science, pages 144–163. Springer, 2013. 5.2.1
[99] J. Rajahalme, A. Conta, B. Carpenter, and S. Deering. Rfc 3697, 2004.http://tools.ietf.org/html/rfc3697. 3.3
[100] Supranamaya Ranjan, Ram Swaminathan, Mustafa Uysal, and Edward W.Knightly. Ddos-resilient scheduling to counter application layer attacks underimperfect detection. In INFOCOM 2006. 25th IEEE International Conferenceon Computer Communications, Joint Conference of the IEEE Computer andCommunications Societies, 23-29 April 2006, Barcelona, Catalunya, Spain. IEEE,2006. 2.2
[101] Supranamaya Ranjan, Ram Swaminathan, Mustafa Uysal, Antonio Nucci, andEdward W. Knightly. Ddos-shield: Ddos-resilient scheduling to counter applica-tion layer attacks. IEEE/ACM Trans. Netw., 17(1):26–39, 2009. 2.2
[102] B. Rosen. Asymptotic theory for successive sampling with varying probabilitieswithout replacement, I. The Annals of Mathematical Statistics, 43(2):373–397,1972. Space-Saving Heavy Hitters, 6.5.3, 6.6
[103] Secure64. Defending against ddos attacks that target the dns, 2017.https://secure64.com/solutions/defending-against-ddos-attacks/. 7.1
[104] Vyas Sekar, Michael K. Reiter, Walter Willinger, Hui Zhang, Ramana Rao Kom-pella, and David G. Andersen. csamp: A system for network-wide flow monitoring.In USENIX NSDI, pages 233–246, 2008. 3.2.1
[105] Sajad Shirali-Shahreza and Yashar Ganjali. Flexam: flexible sampling extensionfor monitoring and security applications in openflow. In HotSDN, pages 167–168,2013. 3.2.1
[106] Sumeet Singh, Cristian Estan, George Varghese, and Stefan Savage. Automatedworm fingerprinting. In OSDI, pages 45–60. USENIX Association, 2004. 4.5, 1,5.2.1
BIBLIOGRAPHY 131
[107] Mudhakar Srivatsa, Arun Iyengar, Jian Yin, and Ling Liu. Mitigating application-level denial of service attacks on web servers: A client-transparent approach.TWEB, 2(3):15:1–15:49, 2008. 2.2
[108] Brent Stephens, Alan L. Cox, Wes Felter, Colin Dixon, and John B. Carter.PAST: scalable ethernet for data centers. In Conference on emerging NetworkingExperiments and Technologies, CoNEXT ’12. 3.1
[109] Yong Tang and Shigang Chen. Defending against internet worms: a signature-based approach. In INFOCOM, pages 1384–1394. IEEE, 2005. 5.2.1
[110] Akamai Technologies. How the mirai botnet is fuel-ing today’s largest and most crippling ddos attacks, 2016.https://www.akamai.com/us/en/multimedia/documents/white-paper/akamai-mirai-botnet-and-attacks-against-dns-servers-white-paper.pdf. 7.1
[111] Justin Thaler, Michael Mitzenmacher, and Thomas Steinke. Hierarchical heavyhitters with the space saving algorithm. In David A. Bader and Petra Mutzel,editors, ALENEX, pages 160–174. SIAM / Omnipress, 2012. 2.1.3, 4.5
[112] Alok Tongaonkar, Ruben Torres, Marios Iliofotou, Ram Keralapura, and AntonioNucci. Towards self adaptive network traffic classification. Computer Communi-cations, 56:35–46, 2015. 5.2.1
[113] Matt Torrisi. Advanced secondary dns for the technically inclined, November2016. Published on of dyn.com. 7.1
[114] Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249–260,1995. 4.5
[115] Niels L. M. van Adrichem, Christian Doerr, and Fernando A. Kuipers. Open-netmon: Network monitoring in openflow software-defined networks. In NOMS,pages 1–8. IEEE, 2014. 3.2.1
[116] Shobha Venkataraman, Dawn Xiaodong Song, Phillip B. Gibbons, and AvrimBlum. New streaming algorithms for fast detection of superspreaders. In Proc.Network and Distributed System Security Symposium (NDSS), 2005. 1.1.4, 6.1.1,6.4, 6.7.2.2
[117] Verisign distributed denial of service trends report q4 2015.https://www.verisign.com/assets/report-ddos-trends-Q42015.pdf, 2015. 7.1,8.2
[118] Luis von Ahn, Manuel Blum, Nicholas J. Hopper, and John Langford.CAPTCHA: using hard AI problems for security. In Eli Biham, editor, Advancesin Cryptology - EUROCRYPT 2003, International Conference on the Theoryand Applications of Cryptographic Techniques, Warsaw, Poland, May 4-8, 2003,Proceedings, volume 2656 of Lecture Notes in Computer Science, pages 294–311.Springer, 2003. 2.2
[119] Bing Wang, Yao Zheng, Wenjing Lou, and Y. Thomas Hou. Ddos attack protec-tion in the era of cloud computing and software-defined networking. ComputerNetworks, 81:308–319, 2015. 2.2
132 BIBLIOGRAPHY
[120] Ke Wang and Salvatore J. Stolfo. Anomalous payload-based network intrusion de-tection. In Erland Jonsson, Alfonso Valdes, and Magnus Almgren, editors, RAID,volume 3224 of Lecture Notes in Computer Science, pages 203–222. Springer,2004. 5.2.2
[121] Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposiumon Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973,pages 1–11. IEEE Computer Society, 1973. 4.5
[122] Qiao Yan, F. Richard Yu, Qingxiang Gong, and Jianqiang Li. Software-definednetworking (SDN) and distributed denial of service (ddos) attacks in cloud com-puting environments: A survey, some research issues, and challenges. IEEE Com-munications Surveys and Tutorials, 18(1):602–622, 2016. 2.2
[123] Ke Yi and Qin Zhang. Optimal tracking of distributed heavy hitters and quantiles.Algorithmica, 65(1):206–223, 2013. 8.2
[124] Minlan Yu, Lavanya Jose, and Rui Miao. Software defined traffic measurementwith opensketch. In USENIX NSDI, pages 29–42, 2013. 1, 3.1.1, 3.2.1, 3.4.3.1,3.4.3.1
[125] Ye Yu, Chen Qian, and Xin Li. Distributed and collaborative traffic monitoringin software defined networks. In HotSDN, pages 85–90, 2014. 3.2.1
[126] Ali Zand, Giovanni Vigna, Xifeng Yan, and Christopher Kruegel. Extractingprobable command and control signatures for detecting botnets. In Yookun Cho,Sung Y. Shin, Sang-Wook Kim, Chih-Cheng Hung, and Jiman Hong, editors,Symposium on Applied Computing, SAC 2014, Gyeongju, Republic of Korea -March 24 - 28, 2014, pages 1657–1662. ACM, 2014. 5.2.1
[127] Saman Taghavi Zargar, James Joshi, and David Tipper. A survey of defensemechanisms against distributed denial of service (ddos) flooding attacks. IEEECommunications Surveys and Tutorials, 15(4):2046–2069, 2013. 2.2, 5.2.2
[128] Qi Zhao, Zihui Ge, Jia Wang, and Jun (Jim) Xu. Robust traffic matrix estimationwith imperfect information: making use of multiple data sources. In Proceedingsof the Joint International Conference on Measurement and Modeling of ComputerSystems, SIGMETRICS/Performance 2006. 3.1
אוניברסיטת תל אביב
הפקולטה למדעים מדוייקים ע"ש ריימונד ובברלי סאקלר
ביה"ס למדעי המחשב ע"ש בלווטניק
הרחבות של בעיית האיברים החוזרים ונשנים לצורך זיהוי אנומאליות מורכבות בתעבורת רשת
חיבור לשם קבלת תואר "דוקטור לפילוסופיה"
מאת
שיר לנדאו פייביש
עבודת המחקר בוצעה בהדרכתו של
פרופסור יהודה אפק
הוגש לסנאט של אוניברסיטת תל אביב
אב תשע"ז
תמצית
תעבורה ברשת יחד עם וקטורי התקפה חדשים, הבשנים האחרונות, הכמות העצומה של
ולאתר תופעות ייחודיות בתעבורה. מעלים את הצורך ליצירת כלים חדשים שיכולים לזהות . חדשות של מחרוזות ונתונים בתעבורה רשתית חזרות ותבניותתופעות אלו כוללות
,מורכבות ומציעים טכניקות יסודיות לזהותןזה זו, אנו בוחנים חלק מאותן תבניות וחזרות בת
שלכות של הן ברשתות מוגדרות תוכנה והן ברשתות קלאסיות. בנוסף, אנו בוחנים את הה
על ניטור הרשת והאבטחה של הרשת ומציעים כלים להתמודדות עם תעבורה מסוג זה השלכות אלו.
חוזרים ונשניםאבן יסוד מרכזית שאנו מרחיבים ומכלילים הינה הבעיה של מציאת איברים
). בעבודתנו אנו חוקרים שלוש וריאציות של בעיה זו. לכל וריאציה HittersHeavyבנתונים (
נו מציגים קונספטים חדשים והגדרות חדשות של הבעיה, לצד אלגוריתמים חדשים ויעילים א
, אנו מציעים הגדרות חדשות ומבוססות זמן לבעיה של מציאת 3ראשית, בפרק לפתרונה.
כבדות , אנו בוחנים את הבעיה של מציאת מחרוזות 4. שנית, בפרק חוזרים ונשניםאיברים
, אנו חוקרים את הבעיה של 6מחרוזות (הודעות). לבסוף, בפרק באורכים משתנים בזרם של
מפתח>, שהיא -בזרם של זוגות מתצורה <מפתח, תת חוזרים ונשניםמציאת איברים נבדלים הבעיה של מציאת מפתחות שיש להם הרבה תתי מפתחות שונים.
באחת מציגים שלוש אפליקציות, שכל אחת עושה שימוש אנו בעזרת אלגוריתמים אלו,
, פיתחנו מכניזם הטכניקות שמוצגת לעיל. ראשית, בהתבסס על ההגדרות מבוססות זמן
) לצורך 3ברשתות מוגדרות תוכנה (פרק חוזרים ונשניםלזיהוי סוגים שונים של זרמים
אבטחה, ניטור וניהול של רשתות אלו. שנית, באמצעות האלגוריתם למציאת מחרוזות כבדות
ל מחרוזות, פיתחנו מערכת לזיהוי חתימות של התקפות מניעת באורכים משתנים בזרם ש
לבסוף, אנו מציגים מערכת למיטיגציה ). 5) ברמת האפליקציה (פרק DDOSשירות מבוזרות (
באלגוריתם ), שעושה שימוש 7) (פרק DNSשל התקפות רנדומית על מערכת שמות התחום (
זוגות. שתי ההתקפות הללו ובפרט בזרם של חוזרים ונשניםשלנו למציאת איברים נבדלים
הניכרת שיש להן על מיליוני משתמשי ההאחרונה צברו עניין רב בזמן האחרון לאור ההשפע אינטרנט והמערכות שלנו מציגות שיטות חדשות למיטיגציה של אותן התקפות.
פירי. באופן אמהן ביצענו שיערוך של הכלים שלנו ואנו מוכיחים את יעילותן הן באופן אנליטי ו
יות וביצענו לצורך כך, מימשנו את המערכות והאלגוריתמים שלנו באמצעות מגוון טכנולוגתיות.בדיקות על תעבורה אמתית לרבות דגימות של התקפות אמ
תקציר
עלתה באופן דראסטי בעשרים כמות המידע שעוברת ברשתות התקשורת הגלובאליות
השנים האחרונות. ככל שכמות התעבורה גדלה, הכמות האדירה של הפקטות שעוברות
ברשתות מייצרות סיכונים חדשים לפונקציונאליות של הרשת. קהיליית הרשתות, הכוללת
חוקרים, מתכננים ומפעילים, נמצאת במאבק מתמיד על יכולת המעבר של מידע לגיטימי יעת המעבר של תוכן לא לגיטימי.ברשת ומנ
" שיכולים big-dataסיכונים אלו יוצרים צורך מתמשך במציאת פתרונות חדשים מעולם ה"
התעבורה לבדו מציב מכשולים עשרות מיליוני פקטות בשנייה. ראשית, נפחלהתמודד עם
ות של או התפרצויות גדול eventsflashרבים בתפקוד התקין של הרשתות. אירועים כגון
מנגנונים של איזון עומסים לצורך שימור איכות אמצעות וטיפול ב תעבורה דורשים זיהוי מידי
שנית, גורמים זדוניים ברחבי העולם מבצעים אינספור התקפות מידי יום. השירות ברשת.
). exploitszero-dayתוקפים מייצרים התקפות מסוגים חדשים אשר אין לגביהן ידע מוקדם (
תוקפים עושים שימוש בקבוצות גדולות מאוד של מכונות פגועות המכונות בוטנט בנוסף, ה
)Botnet .הגורמות לעלייה מתמדת בהיקף ועוצמת ההתקפות (
על מנת להתמודד עם סיכונים אלו, מנהלי הרשתות נדרשים למעשה למצוא מחט בערימה
ברחבי הרשת, אפילו לגיטימיות עושות דרכןות פקטעצום של עוד מספר של שחט. כלומר, ב
מכרעת על הרשת. יש צורך אחוז קטן מאוד של פקטות חריגות יכולות להיות בעלות השפעה
בטכניקות מתקדמות אשר יכולות לאתר את הפקטות החריגות ולטפל בהן. תופעות אלו של
הפקטותלדוגמה, וכן תצורות שונות. ולות להיות בעלות מאפיינים מגווניםפקטות חריגות יכ
גדול במיוחד או שהן יכילו payloadיכולות להיות בעלות מבנה משונה. ייתכן שיהיה להן
ת ה להיות בעלשל פקטות יכול קבוצהשאינו בשימוש נפוץ. במקרים אחרים, headerשדה ב
דות לאותו יעד בודד או עמאפיינים מיוחדים. למשל, פקטות רבות ממקורות שונים שמיו שות לאתר מסוים.לדוגמה מספר חריג של בק
חלק מאותן תופעות רשת יוצאות דופן התזה שלי מציגה טכניקות מתקדמות לאפיון וזיהוי של
וכן כלים big-dataאנו מספקים תובנות חדשות לעולם ה שנצפו לאחרונה ברחבי הרשת.
ואלגוריתמים לזיהוי חזרות של מידע בתעבורה ברשת לצורך מגוון של אפליקציות רשתיות.
פית, אנו מתמקדים בחריגות בתעבורה הקשורים לאספקטים של אבטחת הרשת, ספצי
לרבות התקפות zero-dayומייצרים מכניזמים לצורך מיטיגציה של התקפות שונות מסוג
), שמאיימים על לב ליבה של System )DNSNameDomainשנצפו לאחרונה על מערכת ה
.]65[פונקציונאליות הרשת
ים עמו אנו מתמודדים, הינו הזיהוי של כמויות גדולות של תעבורה אחד האתגרים המרכזי
תכונה משותפת כלשהי, אשר נקראת גם זרימה גדולה או כבדה של תעבורה (ראה להש
). במובן הקלאסי, זרימה אופיינה כרצף פקטות שנשלחו ממקור אחד ליעד 3הגדרה בפרק
ו נרחבת בהרבה. כיום, זרימה הינה אחד. ברשתות כפי שאנו מכירים אותן כיום, ההגדרה הז
רצף פקטות אשר יש להן שדות כותרת זהים כלשהם. זיהוי של זרימה כבדה בתעבורה הינה
הבטחת רמת שירות אחת היכולות הבסיסיות הדרושות ברשת. זוהי יכולת מפתח לצורך
)QualityofService,זיהוי של ), תכנון נפחים ועומסים, והנדסת תעבורה יעילה. יתרה מזאת
DenialDistributedזרימה כבדה קריטית לצורך זיהוי של התקפות מניעת שירות מבוזרות (
ofService(DDoS) .ברשת (
], אך אלה אינן 4,33זרימה כבדה התבססו על מדידת זרימות [טכניקות מסורתיות לזיהוי
]. לפיכך, נוכח הכמויות העולות של תעבורה, 49[ ניתנות להרחבה, כלומר אינן סקאלאביליות
אנו ממשיכים את ]. 28,49,124קיים פיתוח של שיטות מתוחכמות יותר כגון אלו המוצגות ב[
המחקר אודות שיטות יעילות לזיהוי זרימה כבדה ומציעים פתרונות שונים אשר מבוססים על .החוזרים ונשניםרים אשר פותחו לבעיית האיב streamingמשפחה של אלגוריתמי
ועוסקת במציאת האיברים הינה בעיה שנחקרה רבות החוזרים ונשניםבעיית האיברים
איברים, איבר Nבמובן הקלאסי של הבעיה, בהינתן זרם של הפופולריים בזרם של איברים.
]. פתרונות כגון 83כלשהו [ θ<1>0פעמים עבור Nθהינו איבר שמופיע לפחות חוזר ונשנה
של Sample and Hold] או אלגוריתם ה81של מטוואלי ועוד [ Space-Savingריתם האלגו
ומספקים הערכה של מספר הפעמים החוזרים ונשנים ] מזהים את האיברים49אסטן וורגיז [ בזרם. ושהם הופיע
ובוחנים את הבעיה של מציאת איברים יריעה המקורית של הבעיה האנו מרחיבים את
בסוגים שונים של תעבורה או ארכיטקטורות רשת. כפי שנראה בתזה זו, זיהוי חוזרים ונשנים דורש אלגוריתמים ייחודיים. חוזרים ונשניםסוגים שונים של איברים
ראשית, אנו מרחיבים את ההגדרה הקלאסית של הבעיה ומשלבים בה היבטים של לוקאליות
אנחנו מציעים למימד הזמן. של זמן. אנו מציגים הגדרות חדשות לבעיה תוך התייחסות
DefinedSoftware(ברשתות מוגדרות תוכנה חוזרים ונשניםשיטות לזיהוי זרמים
Networks(SDN) .( ,חוזרתאנו מציעים אלגוריתמים לזיהוי סוגים שונים של זרימה בנוסף
הן עבור מתג בודד והן עבור מערך מבוזר של בארכיטקטורה של רשת מוגדרת תוכנה, ).3(פרק מתגים
עוסקת בזרם של חוזרים ונשניםשנית, בעוד שהבעיה הקלאסית של מציאת איברים
. בצורות שונות של נתונים חוזרים ונשניםמספרים, אנו חוקרים את הבעיה של זיהוי איברים
אנו בוחנים את הרעיון מורכבים מזוג של נתונים. הבפרט אנו מתמקדים במחרוזות ובאיברים
במידע טקסטואלי (הווה אומר בזרם של מחרוזות) ומציגים חוזרים ונשניםשל איברים
HeavyStringמחרוזות שכיחות בעלות אורכים משתנים (-אלגוריתמים יעילים לזיהוי תתי
Hitters בנוסף, אנחנו מציעים אלגוריתמים חדשים 4) בזרם גדול של מחרוזות (פרק .(
בזרם של זוגות מהתצורה ) HittersHeavyDistinctנבדלים ( חוזרים ונשניםלמציאת איברים
בזוגות חוזרים ונשנים). הגישה שאנו מציגים למציאת איברים 6מפתח> (פרק -<מפתח, תת
approximateמתבססת על אלגוריתמים לבעיה של שיערוך מספר האיברים השונים (
distinctcountingאיברים שונים נצפו בהינתן זרם של איברים, כמה :). הגדרת הבעיה הינה
עד לנקודה מסוימת בזרם. קיימים מספר פתרונות לבעיה זו המבוססים על סקיצות כגון אלו
].96, 66, 58, 47, 37, 25ב [המוצגים
אנו מציגים את השימושיות של האלגוריתמים הנ"ל בזיהוי רצפים חריגים של פקטות ושל
סוגים שונים של התקפות מיטיגציה של בי ותבניות נתונים ומדגימים כיצד הם מסייעים בזיהו
DDoS .טההמוצג מ 1 הגישה הכללית שלנו, אשר מוצגת בתרשים שנצפו בשנים האחרונות
של שני שלבים. ראשית, מבצעים אנליזה של תעבורה בזמן שלום, לצורך כוללת תהליך
קפה נעשית יצירת קו בסיס של דפוסים ותבניות אשר מצויים בתעבורה בזמן שלום. בזמן הת
אשר נבדקים ביחס לקו הבסיס על מנת לבחון אם הם דפוסים ותבניותאנליזה לזיהוי
ייחודיים להתקפה או שהם חלק מהתעבורה גם בזמן שלום. הדפוסים או החזרות שהם
וניתן להשתמש בהם לצורך מיטיגציה ייחודיים להתקפה יוצרים את החתימות של ההתקפה של ההתקפה.
סקר כללי של מערכת למיטיגציה של התקפות מניעת שירות מבוזרות:1תרשים
מחרוזות שכיחות בעלות אורכים משתנים -תתיאנו משתמשים באלגוריתם לזיהוי 5בפרק
בזרם של מחרוזות לבנית מערכת למיטיגציה של התקפות מניעת שירות מבוזרות ברמת
]. הפקטותattacks] (11,12serviceofdenialdistributedlevelApplicationהאפליקציה (
רגל קטנה הנגרמת על ידי הכלים -שמרכיבות את ההתקפות הללו מכילות לרוב תביעת
יכולות להיות קטנות מאוד, לדוגמה, שורה רגל אלו -שמייצרים את פקטות ההתקפה. תביעות
יתם שלנו ) נוספת שלא קיימת בדרך כלל בפקטות מסוג זה. האלגורreturncarriageחדשה (
יכול למצוא את תביעות הרגל האלו בהקשר של התוכן של הפקטה. לאחר מכן, ניתן
להשתמש בתוכן של תביעת הרגל לצורך זיהוי הפקטות הבאות של ההתקפה ולפיכך לעצור את ההתקפה.
נבדלים חוזרים ונשניםאנחנו עושים שימוש באלגוריתמים שלנו למציאת איברים 7בפרק
מערכת שמות המתחם מערכת למיטיגציה של התקפות רנדומיות על בזוגות לבניית
)DomainNameSystem(DNS)( ]50 אל אלו, נשלחות בקשות ייחודיות רבות ]. בהתקפות
רנדומי. האלגוריתמים שלנו מסוגלים -מתחם שהינו פסאודו-תת אשר מכילותמתחם כלשהו
אשר מופיעים בזרם הבקשות יחד עם זיהוי מתחמים (מפתחות) לזהות בקשות אלו על ידי
, ובכך לזהות את המתחם המותקף.מפתחות) שונים-מתחמים (תתי-הרבה תתי