Heavy Hitters Extensions for Advanced Tra c …level DDoS attacks (Chapter5). Finally, we present a...

The Raymond and Beverly Sackler Faculty of Exact Sciences

The Blavatnik School of Computer Science

Heavy Hitters Extensions for

Advanced Traffic Anomalies

Detection

Thesis submitted for the degree of Doctor of Philosophy

by

Shir Landau Feibish

This work was carried out under the supervision of

Prof. Yehuda Afek

Submitted to the Senate of Tel Aviv University

August 2017

Abstract

In recent years, the explosion of network traffic, together with new attack vectors has

brought the need for new tools which can detect and pinpoint specific phenomena in

traffic, such as particular repetitions and patterns. In this dissertation we study some

of these new complex data repetitions and offer fundamental techniques for identifying

them both in Software Defined Networks and in classical network settings. Additionally

we examine the implications of such traffic on both network monitoring and security

and offer mechanisms for attending to them.

A principal building block that we expand and generalize is the Heavy Hitters

problem. We consider three variations of Heavy Hitters, proposing new concepts and

problem definitions for each, as well as new efficient algorithms to detect and output

them. First, in Chapter 3, we suggest new definitions for Heavy Hitters based on time

locality. Second, in Chapter 4, we study the problem of varying length string heavy

hitters in a stream of strings (messages). Finally, in Chapter 6, we explore the problem

of distinct heavy hitters in a stream of < key, subkey > pairs, which is the problem of

detecting a key with many different subkeys.

Using these algorithms, we provide three applications, each utilizing one of the

above new techniques. First, based on our time locality definitions, we have developed

mechanisms for the detection of different types of large flows in Software Defined Net-

works (Chapter 3), for the purpose of network security, measurement and monitoring.

Second, using our algorithms for the detection of varying length string heavy hitters

we have developed a zero-day signature extraction system for mitigation of application

level DDoS attacks (Chapter 5). Finally, we present a system for mitigation of ran-

domized attacks on the Domain Name System (Chapter 7), which makes use of our

algorithms for finding distinct heavy hitters. Both of these attacks, and especially the

latter, have gained much interest recently, due to the large affects they have had on

3

millions of Internet users, and our systems offer a new approach to their mitigation.

We evaluated our tools and prove their effectiveness both analytically and empiri-

cally. To do so, we have implemented our systems and tools using various technologies,

and have performed testing on real traffic traces, including captures of actual attacks.

Acknowledgements

First I would like to express my deepest gratitude to my advisor Prof. Yehuda Afek, for

teaching me that research itself is a science, and that every problem and solution should

be explored from all angles. And, for pushing me to do my very best while reminding

me to keep focused on the important things in life. It has truly been a pleasure.

Second, I would very much like to thank my mentor Prof. Anat Bremler-Barr. An

outstanding women researcher in a field dominated by men, and especially in Israel, you

are my role model. Thank you for always reminding me that research is a marathon, not

a sprint, and for endless conversations of encouragement and guidance about research,

academia, reviewers, students, kids, life and so much more.

Third, I would like to thank my collaborators, with whom I have had the great

pleasure to work over these years, Prof. Edith Cohen, Dr. Liron Schiff, Moshe Sulamy

and Michal Shagam. You have brought teamwork into my research, with enlightning

conversations and great insight. I am also greatful to the wonderful members of the

DEEPNESS Lab, and especially Prof. David Hay, Dr. Yaron Koral and Dr. Yotam

Harchol for their help, guidance and friendship over the years.

Last but certainly not least, I would like to thank my family. My parents, Gadi

and Orith, who have been there at every step of the way, making sure that I find my

path and that I endure it. Standing by my side through bright days and darker ones.

My amazing kids, Nadav, Michal and Noa who make everything in life so much more

interesting, and who always put both success and failure into clear perspective. And

finally, to my husband and closest friend Nir, who has known this day would come since

we met at age 15 and has done everything in his power to help me get here.

This work was partially supported by the chief scientist in the Israeli ministry of

Industry, Trade and Labor and by the Ministry of Science and Technology, Israel.

5

Contents

1 Introduction 1

1.1 Overview of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.1 Time locality in Heavy Hitters and Detection of Heavy Flows in

Software Defined Networks . . . . . . . . . . . . . . . . . . . . . 4

1.1.2 Heavy Hitters in Textual Data: String Heavy Hitters . . . . . . . 6

1.1.3 Zero-Day Signature Extraction for High Volume Attacks Using

String Heavy Hitters . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.1.4 Heavy Hitters in a Stream of Pairs: Distinct and Combined Heavy

Hitters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.1.5 Random Subdomain DNS Attacks Mitigation using Distinct

Heavy Hitters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3 Published Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Background 13

2.1 Frequent Items in Data Streams . . . . . . . . . . . . . . . . . . . . . . 13

2.1.1 The Data Streaming Model . . . . . . . . . . . . . . . . . . . . . 13

2.1.2 The Heavy Hitters Problem . . . . . . . . . . . . . . . . . . . . . 13

2.1.3 Related Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2 DDoS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3 Deep Packet Inspection . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4 Software Defined Networks . . . . . . . . . . . . . . . . . . . . . . . . . 22

3 Time Locality in Heavy Hitters and Detection of Heavy Flows in

Software Defined Networks Match and Action Model 23

3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6

3.1.1 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2.1 Network Measurement . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3 Time Locality Definitions for Heavy Hitters . . . . . . . . . . . . . . . . 27

3.4 Heavy Flows Detection in SDN . . . . . . . . . . . . . . . . . . . . . . . 28

3.4.1 Towards a Solution . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4.2 The Sample&Pick Algorithm . . . . . . . . . . . . . . . . . . . . 29

3.4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.5 Interval Heavy Flow and Bulky Flow Detection . . . . . . . . . . . . . . 38

3.6 Distributed Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4 Finding Heavy Hitters in a Stream of Strings 45

4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45


4.2 String Heavy Hitters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.4 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.6 The Double Heavy Hitters Algorithm . . . . . . . . . . . . . . . . . . . . 50

4.6.1 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.6.2 Algorithm Details . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.6.3 Error Rate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 55

5 Zero-Day Signature Extraction for High Volume Attacks 57

5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57


5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.2.1 Automated Signature Extraction . . . . . . . . . . . . . . . . . . 60

5.2.2 DDoS Defense Mechanisms . . . . . . . . . . . . . . . . . . . . . 61

5.3 The Zero-Day High-Volume Attack Detection System . . . . . . . . . . 62

5.3.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.3.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.3.3 System Requirements . . . . . . . . . . . . . . . . . . . . . . . . 63

5.3.4 System Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.3.5 Identifying Common Combinations of Signatures . . . . . . . . . 67

5.4 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.4.1 Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.4.2 System Quality Test Results . . . . . . . . . . . . . . . . . . . . 73

5.4.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.4.4 Frequency Estimation . . . . . . . . . . . . . . . . . . . . . . . . 76

5.4.5 Threshold Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.4.6 Testing Frequent Signature Combinations . . . . . . . . . . . . . 77

5.4.7 Signature Examples . . . . . . . . . . . . . . . . . . . . . . . . . 78

6 Heavy Hitters in a Stream of Pairs: Distinct and Combined Heavy

Hitters 80

6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80


6.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6.2.1 Problem Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 82

6.2.2 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.3 Background - Approximate Distinct Counters . . . . . . . . . . . . . . . 83

6.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6.5 The Distinct Weighted Sampling Algorithms . . . . . . . . . . . . . . . 85

6.5.1 Fixed-Threshold Distinct Heavy Hitters . . . . . . . . . . . . . . 86

6.5.2 Fixed-Size Distinct Weighted Sampling . . . . . . . . . . . . . . 86

6.5.3 Analysis and Estimates . . . . . . . . . . . . . . . . . . . . . . . 87

6.5.4 Estimate Quality and Confidence Interval . . . . . . . . . . . . . 89

6.5.5 Integrated dwsHH Design . . . . . . . . . . . . . . . . . . . . . . 90

6.6 The Combined Weighted Sampling Algorithm . . . . . . . . . . . . . . . 92

6.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

6.7.1 Theoretical Comparison . . . . . . . . . . . . . . . . . . . . . . . 94

6.7.2 Practical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 94

7 Mitigating DNS Random Subdomain DDoS Attacks Using Distinct

Heavy Hitters 100

7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100


7.2 Attack Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

7.2.1 Current Detection Techniques . . . . . . . . . . . . . . . . . . . . 104

7.3 Random Subdomain Attack Mitigation System . . . . . . . . . . . . . . 105

7.3.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

7.3.2 System Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

7.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

7.4.1 University Network Captures . . . . . . . . . . . . . . . . . . . . 115

7.4.2 ISP Attack Captures . . . . . . . . . . . . . . . . . . . . . . . . . 116

8 Discussion and Conclusion 118

8.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

Bibliography 123

List of Figures

1.1 DDoS mitigation systems overview . . . . . . . . . . . . . . . . . . . . . 4

3.1 Sample&Pick overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2 Resource consumption and accuracy comparison . . . . . . . . . . . . . 35

3.3 Affect of varying t values . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.4 Affect of varying T values . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.5 Affect of varying v values . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.6 The modified heavy hitters data structure using counter arrays. In this

example the active counter is currently c1. . . . . . . . . . . . . . . . . . 41

3.7 Marking sampled packets in the distributed setting. . . . . . . . . . . . 44

4.1 An example of the process of creating varying length strings from con-

secutive k-grams. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2 Non-consecutive heavy hitters . . . . . . . . . . . . . . . . . . . . . . . . 56

5.1 Signatures requirement overview . . . . . . . . . . . . . . . . . . . . . . 64

5.2 The process of extracting attack content signatures. . . . . . . . . . . . 66

5.3 Extracting attack signatures with the additional minimization process. . 69

5.4 An example of different sets of signatures found in different packet types. 70

5.5 Signature frequency: algorithm estimation vs. the actual frequency . . . 77

5.6 Comparing peace-high values. . . . . . . . . . . . . . . . . . . . . . . . . 78

5.7 Testing the algorithm for minimizing the number of signatures . . . . . 79

6.1 Distinct Weighted Sampling (dWS): Modified cache size . . . . . . . . . 95

6.2 Distinct Weighted Sampling (dWS): Modified Number of Buckets . . . . 96

6.3 Distinct Weighted Sampling (dWS): 32 Buckets, 1000 Items . . . . . . . 97

6.4 Combined Weighted Sampling (cWSHH) Modified rho: accuracy . . . . 98

10

6.5 Combined Weighted Sampling (cWSHH) Modified rho: combined weight 98

6.6 Distinct Weighted Sampling (dWS): Modified cache size . . . . . . . . . 99

7.1 DNS Random Subdomain attack overview . . . . . . . . . . . . . . . . . 101

7.2 DNS Random Subdomain mitigation High-level approach . . . . . . . . 105

7.3 DNS Random Subdomain mitigation system overview . . . . . . . . . . 106

7.4 Hierarchy of heavy distinct domains. Bold edged nodes are in the cover,

dashed edge nodes do not surpass the minimum cardinality. . . . . . . . 109

7.5 Heavy Distinct Domain Hierarchy (HDDH) Extractor . . . . . . . . . . 110

7.6 Distribution of the number of distinct subdomains per domain level . . 112

7.7 Attack time signature extraction . . . . . . . . . . . . . . . . . . . . . . 113

7.8 Distinct queries for campus authoritative server per hour, over 1 day. . . 116

List of Tables

3.1 Matrix definitions for large flows . . . . . . . . . . . . . . . . . . . . . . 28

3.2 Comparison of the heavy flow detection techniques presented in this

work. Denote t the threshold for candidate heavy hitters in Sample&Pick . 30

3.3 Illustration of switch flow table configuration. Rule priority decreases

from top to bottom. Actions: 1- increment counter; 2 - apply sampling

technique (goto sampling tables / apply group) . . . . . . . . . . . . . . 30

3.4 Resource consumption test results . . . . . . . . . . . . . . . . . . . . . 34

4.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.2 Summary of the statistics of the tests performed. Note that the captures

are samples of the traffic. . . . . . . . . . . . . . . . . . . . . . . . . . . 74

6.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.2 Theoretic Comparison between methods . . . . . . . . . . . . . . . . . . 94

7.1 System Parameters and Notations . . . . . . . . . . . . . . . . . . . . . 108

7.2 Results on Real DNS Attack Captures . . . . . . . . . . . . . . . . . . . 117

12

Chapter 1

Introduction

The amount of data that traverses the global communication networks has increased

tremendously over the past twenty years. As the amount of traffic grows, the huge

amounts of packets going through the networks create new risks to network function-

ality, and the networking community is at a constant battle to keep legitimate data

flowing.

First, the sheer volume of the traffic makes it difficult to keep the networks up

and running. Flash events or large bursts of traffic need to be promptly detected and

carefully handled, using load balancing and other mechanisms in order to maintain

quality of service. Second, malicious entities around the world perform countless attacks

daily, which are becoming increasingly sophisticated. Attackers are creating new kinds

of attacks, for which no previous knowledge exists (zero-day exploits). Furthermore,

attackers are using large groups of compromised machines called botnets, causing a

continuous rise in the volume and intensity of the attacks.

These risks create an ongoing need for new big-data solutions which can take on

tens of millions of packets per second. That is, while the vast amount of packets work

their way around the network harmlessly, even a small percentage of abnormal packets

may have tremendous impact on the network, and in order to mitigate the risks, these

packets need to be skillfully handled.

These unusual packet phenomenons take on many different shapes. Packets may

be awkwardly formed, having, for instance, a payload that is bigger than usual or

containing a header field that is not often in use. In other cases, a sequence or group of

packets may have some special characteristics. For example, many packets headed to a

single destination from many different sources, or unusually many requests for a single

1

2 CHAPTER 1. INTRODUCTION

site.

My dissertation presents advanced techniques for characterizing and identifying

some of the extraordinary network phenomenon observed recently. We provide new in-

sight to the world of big-data, providing fundamental tools and algorithms for identify-

ing data repetitions in network traffic for a variety of network applications. Specifically,

we focus on traffic abnormalities that are security related, and devise mechanisms for

the mitigation of different types of zero day attacks, including recent attacks on the

Domain Name System (DNS) which threaten the very core of the Internet functionali-

ty [65].

One of our main challenges is the identification of large amounts of traffic that

share some similar properties, often referred to as a large or heavy flow of traffic (See

definition in Chapter 3). In the classic sense, a flow has often been characterized by

packets sent from a single source to a single destination. In today’s networks, this

definition has been expanded to a sequence of packets sharing some common header

fields. Heavy flow detection in traffic is one of the fundamental capabilities required in a

network. It is a key capability in providing Quality of Service (QoS), capacity planning

and efficient traffic engineering. Furthermore, heavy flow detection is crucial for the

detection of Distributed Denial of Service (DDoS) attacks in the network.

Traditional heavy flow detection was based on flow measurements [4, 33], yet these

suffer from a significant lack in scalability [49]. Therefore, with the rising amounts

of traffic, more sophisticated techniques are being developed, such as those presented

in [28, 49, 124]. We continue the research of efficient methods for heavy flow detection

and propose various solutions, which are based on a family of streaming algorithms

devised for the Heavy Hitters problem.

The Heavy Hitters problem is the well studied problem of finding the popular items

in a data stream. In the classic definition, given a stream of N items, a heavy hitter

is an item which appears at least θN times, for some given 0 < θ < 1 [83]. Solutions

such as the Space-Saving algorithm of Metwally et. al. [81] or the Sample and Hold

algorithm of Estan et. al. [49] detect the heavy hitters with a certain probability and

provide an estimate of the amount of times they appeared in the stream.

We broaden the traditional scope of the problem, and examine heavy hitters in

different types of traffic or network architectures. As we exhibit throughout this disser-

tation, identifying different types of heavy flows requires specially crafted algorithms.

3

First, we expand the classic definition of heavy hitters to introduce time locality,

providing different problem definitions. We propose methods for identifying heavy traffic

in the context of recent Software Defined Networking (SDN) technology and provide

algorithms for the detection of different types of heavy flows within a software defined

architecture, for both a single switch and a distributed setting (Chapter 3).

Furthermore, while heavy hitters algorithms have traditionally been developed for

streams of numbers, we study the problem of identifying heavy hitters in other forms

of data, namely streams of strings or pairs. We explore the concept of heavy hitters in

textual data (i.e., in a stream of strings), and present efficient algorithms for identifying

frequent substrings of varying length or String Heavy Hitters in a large stream of strings

(Chapter 4). Additionally, we propose new algorithms for finding Distinct Heavy Hitters

in a stream of 〈key, subkey〉 pairs (Chapter 6). Our approach for heavy hitters in pairs

makes use of algorithms for the approximate distinct counting problem, which is the

problem of estimating the number of distinct items seen. Meaning, given a stream of

items, how many unique items have been encountered up to some point in the stream.

Various sketch-based solutions have been proposed for this problem such as [25, 37, 47,

58, 66, 96].

We exhibit the usefulness of the above algorithms in the identification of unusual

sequences of packets and data patterns and show they are instrumental in the detection

and mitigation of new types of DDoS attacks witnessed in recent years. Our approach in

general, depicted in Figure 1.1, includes a two-stage process. First, peacetime traffic is

analyzed to create a baseline of patterns which are found in the traffic on a normal basis.

Second, during an attack, the traffic is analyzed to detect repetitions. The peacetime

baseline is used to identify patterns which are exhibited during an attacks and are not

part of the traffic on a normal basis. These repetitions form the attack signatures which

can then be used in order to mitigate the attack.

In Chapter 5, we use our algorithm for identification of String Heavy Hitters to build

a system for the mitigation of application level DDoS attacks [11, 12]. The packets which

comprise these attacks often contain a small footprint caused by the attack generation

tools. This footprint can be as small as an extra carriage return (newline) not normally

found in such packets. Our algorithms are able to find this footprint within the context

of the attack packets to allow the detection of consequent attack packets and therefore

mitigate the attack.


Attack time traffic

Peace time traffic

Attack signatures

Find repetitions in attack time

traffic

Generate whitelist/ baseline

Take only signatures found in

attack and not in peace NIDS,

Firewall, mitigation unit etc.

Figure 1.1: DDoS mitigation systems overview

In Chapter 7, we utilize our distinct heavy hitters algorithms for finding heavy

hitters in pairs to build a system for the mitigation of randomized attacks on the Domain

Name System [50]. In these attacks many unique requests containing pseudo-random

subdomains are sent for a specific domain. Our algorithms detect pseudo-randomly

generated traffic by identifying keys which appear in the stream with a large number

of different subkeys, and can therefore be used to detect such attacks.

1.1 Overview of Results

1.1.1 Time locality in Heavy Hitters and Detection of Heavy Flows

in Software Defined Networks

Software Defined Networks (SDN) have emerged in recent years as a framework for

creating configurable networks with improved network management abilities. While

SDN is not limited to OpenFlow [80], it is currently the de-facto SDN standard both

in industry and academia.

OpenFlow is based on the notion of match-action rules. The OpenFlow switch

maintains rule tables which are mostly TCAM based, called flow tables. Each rule in

these tables contains a match and an action. If a packet matches a certain rule, the

rule’s action will be applied to it. An action can be to modify parts of the packet, a

forward port specification, a group assignment etc. The controller, which manages the

switch flow tables, defines and installs these flow table rules.

While SDN switches are very efficient and considerably simpler to manage than

existing routers and switches, they do not offer direct means for sampling and detection

of large flows, which are both important for various basic network applications. These

1.1. OVERVIEW OF RESULTS 5

applications include: traffic monitoring and QoS, security (DDoS detection and other

high volume attacks), anomaly detection, Deep Packet Inspection (DPI) and billing.

Naive use of OpenFlow for sampling and traffic measurements results in excessive

use of two important resources: the number of flow entries in the flow-tables, and the

amount of traffic between the switch and the controller and/or other monitoring devices.

It is not scalable and even infeasible to place an entry in the flow table for every flow,

and therefore more efficient solutions are required.

In Chapter 3 we propose Sample&Pick, an efficient algorithm which detects large

or heavy flows going through an SDN switch. The Sample&Pick algorithm divides the

detection and monitoring work between the switch and the controller, coordinating

between them to efficiently identify large flows. Our constructions are based on the

paradigm of the Sample and Hold [49] algorithm along with other classic heavy hitters

algorithms such as the Space-Saving algorithm [81]. In our solution, sampled packets at

the switch are sent to the controller where the suspected heavy flows are detected. For

each suspected heavy flow, an exact count or hold is placed in the switch. Subsequent

packets of this flow are not sampled, and therefore sampled packets from this flow are no

longer sent to the controller. Counters accumulated in the switch are periodically sent to

the controller, which integrates these counters into the general heavy hitters structure.

In this manner, our algorithm minimizes both the switch - controller communication

and the number of entries in the switch flow table.

Based on different parameters, we differentiate between long lasting and short lived

flows and accordingly define heavy flows, elephant flows and bulky flows and present

innovative algorithms to detect the different types of flows in an SDN switch. Addition-

ally, we consider a distributed model with multiple switches and propose initial methods

for scaling out our techniques, to support both sampling and large flow detection in

the distributed setting.

Our methods rely on standard and optional features of OpenFlow 1.3 and can also

be implemented in the P4 language. Additionally, the techniques presented are efficient

both in terms of flow-table size and switch-controller communication.

We evaluate the performance of our Sample&Pick algorithm by measuring its in-

accuracy rates and resource consumption. Our evaluations demonstrate that our algo-

rithm is able to identify the heavy hitters while providing a good trade-off between the

amount of switch-controller communication and the amount of space required in the


switch.

1.1.2 Heavy Hitters in Textual Data: String Heavy Hitters

In Chapter 4, we consider the problem of finding popular substrings in a stream of

strings. That is, given a stream of strings of different lengths, we would like to find

substrings that appear in some fraction of the strings in the stream.

We define the String Heavy Hitters problem. The input is a sequence S =

〈S1, .....SN 〉 of N strings, a constant k > 0 and θ s.t. 0 < θ < 1. The strings in S

may be of different lengths. A string s of length at least k is referred to as a string

heavy hitter when it is a substring of at least θN strings in S.

Define the weight bs as bs =∑

yifs ⊆ Sy : 1, else : 0, meaning, the number of

strings in S of which s is a substring. Given this definition, a string s is a string heavy

hitter if bs ≥ θN .

We present the Double Heavy Hitters algorithm for efficiently solving the String

Heavy Hitters problem. This algorithm finds popular strings of variable length in a set

of messages, using classic algorithms for heavy hitters detection (e.g. the Space-Saving

algorithm [81]) as a building block. Our algorithm uses a construction of two separate

instances of the classic heavy hitters algorithm. The first instance is used to identify

popular strings of a fixed length k and the second is used to identify popular strings

of varying length. This algorithm, runs in a single pass over the input and its space

depends only on the predefined heavy hitters threshold as in [81].

1.1.3 Zero-Day Signature Extraction for High Volume Attacks Using

String Heavy Hitters

Content signatures are a widely used tool in computer networks. Attack signatures

are one or more precise strings or regular expressions that are common to packets in

an attack. They are usually generated a-priori and then kept in large databases in

order to identify the attack in future traffic. Traditional intrusion detection systems

(IDS) such as Snort [9] and Bro [94], maintain a large database of signatures of attacks

and malwares. Traffic that goes through the IDS is compared to the known signatures

and traffic containing one or more signatures is dropped, thus preventing attacks that

are similar to previous ones. This mechanism is very effective in identifying future

recurrences of past attacks. However, in order to prevent yet-unknown attacks, new


signatures must be created and inserted into the database.

Two basic techniques are traditionally used to identify DDoS attacks: flow authen-

tication based on challenge response and flow behavioral analysis based on statistics

and various machine learning methods. Recent attacks with millions of zombies gener-

ating seemingly legitimate flows go under both radar screens. In these types of attacks,

behavioral analysis does not succeed to detect the malicious traffic, as each zombie

generates little traffic which in itself may appear to be benign. Furthermore, the huge

amount of attack sources makes it infeasible to stop the attack at the source. This

therefore leaves a loophole in the defense mechanisms and creates the demand for zero

day DDoS attack signature extraction.

Identifying signatures for unknown DDoS attacks is extremely difficult due to the

seemingly legitimate content found in the packets which comprise the attack. Most

traditional signatures are based on the malicious code that is expected in the attack

packets, which may not be the case with DDoS attacks. Leading industry experts

confirm, that the signatures found in recent zero-day application-level DDoS attacks

are usually a bi-product of the attack tools which the attackers use. These tools, often

leave some footprint caused unintentionally by the program, such as a short string or

some (protocol complying) anomaly in the packet content structure. Extracting such

signatures allows fine grained identification of attack packets during an attack with

minimal false positives or negatives.

These subtle signatures are not identified by the current automated defense mech-

anisms, but rather by a manual process which may take hours or days. Clearly, in

order to stop such unknown attacks while they are occurring, such signatures must be

extracted quickly and automatically.

In Chapter 5 we present an innovative system for automatic extraction of signatures

for high volume attacks, using a single pass over the input, and space dependent only

on the predetermined size of the data structures used by the heavy hitters detection

algorithm. This system is based on our Double Heavy Hitters algorithm.

Our system takes as input two streams (or stream samples) of traffic collected

during an attack and during peacetime. A peacetime traffic sample may be collected

as a routine scheduled procedure. The attack traffic sample can be collected once the

attack has been detected. We note that for DDoS attacks there are existing mechanisms

(for example in [93]) for identifying when an attack has started and for differentiating


between Flash events and DDoS attacks. The system then analyzes both traffic samples

to identify content that is frequent in the attack traffic sample yet appears rarely or

not at all in the peacetime traffic (As illustrated in Figure 5.1).

Our system makes no assumptions on traffic characteristics such as client behaviour,

address dispersion, URL statistics and so forth. Therefore, it is generic in that it can

be easily adapted to solving other network problems with similar characteristics.

We test our system on traffic from real attacks that have occurred recently. We show

that our solution has good performance in real life, with a recall average of 99.95% and

an average precision of 98%.

In [12] we improve this solution by minimizing the amount of signatures required

to identify malicious packets.

1.1.4 Heavy Hitters in a Stream of Pairs: Distinct and Combined

Heavy Hitters

Formally, our input is modeled as a stream of elements, where each element has an a

primary key x from a domain X and a subkey y from domain Dx. For each key, the

(classic) weight hx is the number of elements with key x, the distinct weight wx is the

number of different subkeys in elements with key x, and, for a parameter ρ 1, the

combined weight is b(ρ)x ≡ ρhx +wx. Combined weights are interesting as they can be a

more accurate measure of the load due to key x than one of hx or wx in isolation: All

hx requests are processed but the wx distinct ones are costlier.

A key x with weight that is at least an ε fraction of the (respective) total is referred

to as a heavy hitter: When hx ≥ ε∑

y hy, x is a (classic) heavy hitter (HH), when

wx ≥ ε∑

y wy, x is a distinct heavy hitter (dHH) or superspreader [116], and when

b(ρ)x ≥ ε

∑y b

(ρ)y , x is a combined heavy hitter (cHH).

The Distinct Heavy Hitters problem (also known as the Superspreaders problem

[116]) was formulated by Venkataraman et al. [116] and studied further [24, 76] but

existing algorithms do not match those of classic heavy hitters detection in performance

and practicality.

Our algorithms, presented in Chapter 6, are novel and efficient sampling-based

structures for dHH and cHH detection which are able to track only O(ε−1) keys. Our

dHH design significantly improves over existing work. We demonstrate, via experimen-

tal evaluations, the effectiveness of each of our algorithms.


1.1.5 Random Subdomain DNS Attacks Mitigation using Distinct

Heavy Hitters

The Domain Name System (DNS) service is one of the core services in internet func-

tionality. Attacks on the DNS service typically consist of many queries coming from a

large botnet. These queries are sent to the root name server or an authoritative name

server along the domain chain. The targeted name server will receive a high volume of

requests, which can degrade its performance or disable it completely. Such attacks may

also contain spoofed source addresses which would cause a reflection of the attack or

may send requests that generate large responses (such as an ANY request) to use the

DNS for amplification.

In randomized attacks on the DNS service, queries for many different non-existent

subdomains (subkeys) of the same primary domain (key) are issued [74]. Since the re-

sult of a query to a new subdomain is not cached at the DNS resolver, these queries are

propagated to the authoritative server for the domain, overloading both these server-

s and the resolvers of the internet service provider (ISP). Such attacks are recently

becoming increasingly common and pose a challenge to Internet service providers.

In the described attacks, and other anomalies such as flash crowds, the impacted

keys are characterized by a large number of requests, number of distinct subkeys, or a

combination of the two. Fast and automated detection and mitigation of such attacks

or anomalies is important for maintaining robustness of the network service. Efficient

detection requires streaming algorithms which maintain a state (memory) that is much

smaller than the number of distinct keys and/or subkeys, which can grow rapidly during

an attack.

In Chapter 7, we design a system to detect Random Subdomain attacks on the

DNS service. The design makes use of our structures for dHH and cHH detection to

identify domains that have many different subdomains in DNS queries for that domain.

Our system generates a baseline during times of normal traffic load which allows it to

differentiate between domains that have many different subdomains on a regular basis

and those that are possibly being attacked.

We demonstrate, the effectiveness of our DNS Random Subdomain attack detection

system as an application-specific tool, which we test on actual attack traces captured

by a large ISP, as well as attack traces captured at our university. Our evaluation shows

that an attack may be identified by our system with high accuracy after processing only


a small number of attack packets.

1.2 Methods

For our research we have used several types of methodologies from a variety of disci-

plines.

Algorithms design: One of our main focuses has been algorithm design, using tools

from various fields. Our algorithms are based on methods from the field of Stream-

ing and big data such as techniques for identifying frequent items, distinct counters

and sampling techniques. Additionally, all of our algorithms are designed to function

in communication networks and are therefore designed with consideration for the d-

ifferent limitations imposed by these networks such as traffic volume and speed and

the resources available in them such as memory and computational capability. Our

algorithms for the String Heavy Hitters problem (Chapter 4) and Randomized Subdo-

main Attacks (Chapter 7) are inspired by techniques from the field of Stringology, and

our mechanisms for identifying heavy hitters in software defined networks (Chapter 3)

make use of tools offered by Programmable data planes, as well as ideas from the field

of Distributed Computing.

Algorithms analysis: We have analyzed our algorithms theoretically in terms of

correctness and quality, using standard measures such as false positive rate, false neg-

ative rate, recall and precision. Define the universe of items as U , the set of items that

need to be in the output of our system or algorithm as S (hence the set of items that

do not need to be in the output is U \ S), and the actual set of items output by our

system or algorithm as S′.

We define the following measures:

1. False positive rate: an item j is a false positive if j ∈ S′ and j /∈ S. The false

positive rate is defined as |j|j is a false positive||U\S| .

2. False negative rate: an item j is a false negative if j ∈ S and j /∈ S′. The false

negative rate is defined as |j|j is a false negative||S| .

3. Recall: Defined as S′∩SS . Intuitively, it measures how many of the relevant items

have been selected.

1.3. PUBLISHED MATERIAL 11

4. Precision: Defined as S′∩SS′ . Intuitively, it measures how many of the selected items

are relevant.

Our analysis has been done on worst-case settings as well as an analysis of expected

behavior using reasonable assumptions on the traffic. Performance has been analyzed

using complexity measures of time and space as well as the amount of traffic overhead

generated by our systems.

Implementation and evaluation: We have put great emphasis on testing the qual-

ity and performance of our algorithms using experimentation and simulations. We have

implemented our algorithms using various programming languages including: C, C++

and Python. An implementation of our system for automatic signature extraction can

be found in [13].

We have used many different technologies for our implementations and simulations

including: Wireshark, WinPcap, Python 2.7, Python 3.3, Microsoft Visual Studio and

a large number of open source libraries.

We have performed testing on both simulated attacks and on real ones, using both

real and synthetic traffic. Our data sources include:

1. CAIDA traces [1–3].

2. Traces captured on campus at Tel-Aviv University.

3. Traces captured on campus at the Interdisciplinary Center, Herzeliya.

4. Traces captured locally using tools such as Wireshark.

5. Traces from various companies in the industry including top security companies

and a large ISP.

6. Traces containing DDoS attacks captured by UCLA’s DWARD lab [7].

1.3 Published Material

This thesis is based on the following published works:

• Yehuda Afek, Anat Bremler-Barr, and Shir Landau Feibish. Automated Signature

Extraction For High Volume Attacks. In Symposium on Architecture for Network-


ing and Communications Systems, ANCS ’13, San Jose, CA, USA, October 21-22,

2013, pages 147-156. IEEE Computer Society, 2013.

• Yehuda Afek, Anat Bremler-Barr, and Shir Landau Feibish. Zero-Day Signature

Extraction For High Volume Attacks. In IEEE/ACM Transactions on Networking

(TON), Submitted.

• Yehuda Afek, Anat Bremler-Barr, Edith Cohen, Shir Landau Feibish, and Michal

Shagam. Mitigating DNS Mitigating DNS random subdomain DDoS attacks by

distinct heavy hitters sketches. In Proceedings of the fifth ACM/IEEE Workshop

on Hot Topics in Web Systems and Technologies, HotWeb 2017, San Jose, CA,

USA, October 12 - 14, 2017, pages 8:1-8:6. ACM/IEEE Computer Society, 2017.

• Yehuda Afek, Anat Bremler-Barr, Edith Cohen, Shir Landau Feibish, Michal

Shagam: Efficient Distinct Heavy Hitters for DNS DDoS Attack Detection. In

CoRR abs/1612.02636, 2016.

• Yehuda Afek, Anat Bremler-Barr, Shir Landau Feibish, and Liron Schiff. Sampling

and large flow detection in SDN. In Proceedings of the 2015 ACM Conference

on Special Interest Group on Data Communication, SIGCOMM 2015, London,

United Kingdom, August 17-21, 2015, pages 345-346. ACM, 2015.

• Yehuda Afek, Anat Bremler-Barr, Shir Landau Feibish, and Liron Schiff. Detect-

ing Heavy Flows in the SDN Match and Action Model. In CoRR abs/1702.08037,

2017.

• Yehuda Afek, Anat Bremler-Barr, Shir Landau Feibish, and Liron Schiff. Detect-

ing Heavy Flows in the SDN Match and Action Model. In Computer Networks

Journal (ComNet): Special Issue on Security and Performance of Software-defined

Networks and Functions Virtualization, Submitted.

Chapter 2

Background

2.1 Frequent Items in Data Streams

2.1.1 The Data Streaming Model

Network traffic is often referred to as a data stream. The data streaming model, as

defined in [21], is a model in which the input data is not available for access locally,

but rather arrives online as a continuous data stream. The order of the elements or

the size of the stream is not known a-priori and can not be determined by the system.

Furthermore, the entire data usually can not be stored and therefore after the system

has completed processing an element it is discarded, such that often, only a single pass

over the input is possible.

2.1.2 The Heavy Hitters Problem

The problem of finding the frequent items in a stream of data evolved from the problem

of finding the majority value in a stream, which was first introduced by Moore and

Boyer [31, 84].

2.1.2.1 The Majority Problem

The Majority problem is defined as follows: given a sequence of N values from universe

U , using a constant amount of space, and in one pass over the values decide if a single

value appears more than N2 times.

Formally, given a sequence of N values, the algorithm should output:

• If ∃j : fj >N2 : output j

13

14 CHAPTER 2. BACKGROUND

• Else: output null

In [83] Misra and Gries presented a novel algorithm for this problem. Their al-

gorithm, which is detailed in Procedure Misra and Gries Majority maintains a single

counter initialized to zero and a VAL variable. Upon the first value that is seen, VAL

is set to that value and the counter is increased by 1. For each value in the stream, if

the value is equal to VAL, the counter is incremented. Otherwise, if the counter equals

zero, VAL is set to be the new value and the counter is incremented. Otherwise, the

counter is decremented.

Procedure Misra and Gries Majority

Data: 〈α1, .....αN 〉Result: the majority value, if existsV AL = NULL and count = 0;V AL = α1;count = 1;for i = 2→ N do

if count == 0 then V AL = αi;count = 1 ;else

if V AL == αi then count+ + ;else

count−−;

if count > 0 then return V AL ;else

return NULL;

If after passing through all of the values the counter is greater than zero, than the

value of VAL is the only candidate for being the majority. To verify that it is indeed

the majority, a second pass is needed to count the number of times it appears.

2.1.2.2 Heavy Hitters Algorithms

We present the following definitions which form the basis of our discussion.

Definition 1. Item frequency: Given a sequence 〈α1, .....αN 〉 of N items from universe

U , the frequency of item x, denoted fx = |j|αj == x.

Definition 2. Heavy Hitter: Given a sequence of N values α = 〈α1, .....αN 〉 from

universe U and a threshold 0 ≤ θ ≤ 1, x is a heavy hitter if fx > θN .

2.1. FREQUENT ITEMS IN DATA STREAMS 15

Finding all of the heavy hitters in a stream with an exact solution would require

maintaining knowledge about the frequency of every item in the stream which would

take up O(U) space [34, 45] and is impractical for applications in which U is sufficiently

large.

Following [40], we provide the following approximation problem definition:

Definition 3. The Heavy Hitters Problem (also known as the ε-Approximate Frequent

Items Problem): Given a sequence S of N values α = 〈α1, .....αN 〉 from universe U , a

threshold 0 ≤ θ ≤ 1 and an error value ε, find a set of items F such that for any item

x ∈ F , fx > (θ − ε)N and for any item j such that j ∈ S and j /∈ F , fj < θN .

Note that a streaming algorithm for the Heavy Hitters problem, can use only a

constant amount of space and make a single pass over the input.

Heavy hitter detection in streams was widely studied and deployed. Many solutions

have been proposed for the classical Heavy Hitters problem, for example, the solutions

suggested in [18, 43, 56, 79, 81, 83]. We provide a detailed explanation of two of these

algorithms, specifically those presented in [83] and [81]. Most of the algorithms for

the Heavy Hitters problem can be categorized as counter-based, such as [81, 83] or

sketch-based. A well known example of a sketch-based algorithm is the Count-Min

algorithm of Cormode et al. which maintains a sketch of the stream. A sketch in this

sense is a data structure which at any time allows us to quickly compute the estimated

frequency of any item. A description of some of the substantial algorithms for heavy

hitters, as well as other significant results regarding the heavy hitters problem can be

found in [41].

The Misra-Gries Algorithm One of the first solutions for this problem was pro-

posed by Misra and Gries in [83]. Given θ = Nk+1 , the algorithm maintains a structure T

of k items, each item consists of a VAL and a counter. The algorithm works as follows:

Upon the first stream element v, the VAL of the first item in T is set to v and its

counter is set to 1. For each consequent element v′ in the stream:

1. If v′ is already the VAL of an item in T , its counter is incremented by 1.

2. Otherwise, that is, if v′ is not the VAL of one of the k items:

(a) If one of the items in T has a counter equal to zero, the VAL of that item is

set to be v′ and its counter is set to 1.


(b) Otherwise decrement all k counters.

After completing the pass over all of the element, the VALs found in the k items of

T are the candidates for being the heavy hitters. The time complexity of the algorithm

is O(1) for the dictionary operations plus the cost of decrementing all of the counters.

Misra and Gries propose an amortized O(1) solution using a balanced search tree.

Demaine et al. propose a worst case O(1) solution using linked lists [45].

The intuition behind this procedure is that if an element occurs at least Nk+1 times

then it must get inserted at some point and there are not enough elements to erase

it completely. It should be noted that the algorithm potentially produces many false

positives. Some or even many of the candidates found in T may not be real heavy

hitters. Suppose for example, that the stream is made up of some k− 1 distinct values,

followed by a stream of a single value. The output in this case will include k − 1

candidates which appeared only one time in the stream and another candidate which

is the real heavy hitter and appeared N − k+ 1 times. To deal with this we perform an

additional procedure on the results to identify the real heavy hitters. In addition, it is

possible to maintain meta-data while performing the above algorithm to provide some

intermediate indication of who the heavy hitters actually are.

The Space Saving Algorithm In [81] an additional counter-based algorithm for

the Heavy Hitters problem is proposed called the Space-Saving algorithm, which is

detailed in Procedure Metwally et al. Heavy Hitters.

As in the above algorithm, a structure T of k items is maintained, each item consists

of a VAL and a counter. For each element v in the stream, if v is already the VAL of

an item in T , its counter is incremented. Otherwise, the item having the lowest count

in T is replaced by v and its counter is incremented.

The error rate ε of this algorithm is ε = Nnv

[81], meaning that each counter in the

output of the algorithm is at most ε higher than the actual number of times that the

value appeared in the stream. The algorithm requires O(1) time per stream item, it

makes only a single pass over the input therefore running in O(N) time, and requires

constant space.

For our systems implementation (See Section 5.4 and 7.3.2), we chose to implement

the Space-Saving algorithm of Metwally et al. [81], since it provides quite accurate

counter estimations for values seen early in the stream [41].

2.1. FREQUENT ITEMS IN DATA STREAMS 17

Procedure Space-Saving Heavy Hitters

Data: 〈α1, .....αN 〉, constant nv << NResult: nv heavy hitters// Maintain nv heavy hitter candidates.

Frequent[1...nv] = item = NULL and count = 0;for i = 1→ N do

// If in Frequent, increment count.

if ∃j s.t. Frequent[j].item == αi then Frequent[j].count+ +;else

// Look for item with smallest count, and replace it.

find j s.t. ∀h Frequent[j].count ≤ Frequent[h].count;Frequent[j].item := αi;Frequent[j].count+ +;

return Items;

Frequent Items using Sampling: Sample and Hold The Sample and Hold family

of streaming algorithms [39, 48, 53] are sampling based solutions for frequent item

detection in streams.

We provide an overview of these algorithms. Given a stream of elements, a set

of cached elements or keys is maintained (the cached elements make up the sample).

A counter cx is maintained for each cached key which tracks the number of times it

occurred in the stream since it entered the cache. When an element with key x that is

not cached is processed, a biased coin flip is used to determine whether to add it to the

cache.

Two basic designs are the fixed threshold and fixed-size paradigms. The fixed thresh-

old design is specified for a threshold τ . The algorithm maintains a cache S of keys,

which is initially empty, and a counter cx for each cached key x. A new element with

key x is processed as follows: If x ∈ S is in the cache, the counter cx is incremented.

Otherwise, a counter cx ← 1 is initialized with probability τ . In the fixed-threshold

paradigm, the bias of the coin is specified, yet it has the disadvantage that the memory

usage (sample size) can increase.

The fixed-size design is specified for a fixed sample (cache) size k and works by

effectively lowering the threshold τ to the value that would have resulted in k cached

keys. In this case the bias is modified as the stream is processed.

An important property of Sample and Hold is that the set of sampled keys is a

probability proportional to size without replacement (ppswor) sample of keys according

to weights hx [102].


2.1.3 Related Problems

Data stream analysis and particularly, item frequency in streams, has received much

attention in both the research community and in the industry. We mention a few

problems which offer a variety of different flavors of the heavy hitters problem. These

problems are sometimes confused with the Heavy Hitters Problem and we define them

here to disambiguate them from the problems we deal with.

The Top-k Problem Following [34], we define the problem as follows: Given a se-

quence S of N values α = 〈α1, .....αN 〉 from universe U and values k and ε Assume that

value vi occurs ni times and that n1 ≥ n2 ≥ n3.... The Top-k Problem is to find a set

T of k values from S such that for all nj ∈ T , nj > (1 − ε)nk. This problem has been

widely studied and we refer the interested reader to solutions proposed, for example,

in [22, 34, 53, 81].

The Item Frequency Problem Given a sequence S of N values α = 〈α1, .....αN 〉

from universe U and a value ε, the Item Frequency Problem is given any j return

f ′j such that f ′j ≤ fj ≤ f ′j + εN [40]. This problem requires a different processing of

the stream than the solutions proposed for the Heavy Hitters problem and they may

be useful for different types of applications. A discussion of this problem in various

streaming models can be found in [45].

Hierarchical Heavy Hitters The last variant which we discuss offers a variation on

the type of data which is handled, which creates the need for a different definition of the

Heavy Hitters problem as well as different algorithms for its solution. The Hierarchical

Heavy Hitters problem (HHH) [42, 111] seeks to find the heavy hitters for data that

has a well defined hierarchical structure such as IP addresses. For example, IP addresses

are formed in a way that 123.∗ .∗ .∗ includes 123.456.∗ .∗, therefore forming a hierarchy

of the data.

Following the definition of [42], the problem is defined as follows: Given a set S of

N items from a hierarchical domain D of height h. For a set P of prefixes from D,

define elements(P ) to be the union of items that are descendants of P in the hierarchy.

Given a threshold θ, the set of Hierarchical Heavy Hitters is defined inductively:

• Level 0: HHH0 is the set of Hierarchical Heavy Hitters at level 0. This is simply

2.2. DDOS 19

the heavy hitters of S by Definition 3.

• Level i:HHHi is the set of Hierarchical Heavy Hitters at level i. Given a prefix p in

level i of the hierarchy, define Fp =∑f(e) : e ∈ elements(p) ∧ e /∈ ∪i−1l=0HHHl.

Then HHHi is the set p : Fp ≥ θN

The HHH of S is the set ∪hj=0HHHj.

An elegant and efficient solution for this problem can be found in [111]

2.2 DDoS

Denial of Service (DoS) attacks have been threatening the security of Internet users and

services for approximately two decades. One third of service downtime in the Internet

is caused by Distributed Denial of Service (DDoS) attacks and over 2000 DDoS attacks

are witnessed globally each day [20].

In networks, denial of service occurs when a network entity can not reach another

node in the network or can not get a legitimate service from that node [64]. In the early

’90s, such attacks were used in online gaming and Internet relay chat communities

[63]. Since then, the Internet and the attackers have evolved significantly, and in recent

years, DDoS attacks, have been posing a significant risk to the Internet and its users.

A DDoS attack occurs when the attacker uses many computers to launch the denial of

service attack. Another important difference between the two types of attacks, is that

while in DoS attacks the attacker sends a relatively small number of packets targeting

a bug in the victim’s program or application, in DDoS attacks, the attacker sends a

huge amount of traffic to a valid, seemingly unexposed victim [29].

DDoS flooding attacks can generally be classified to Network/Transport level at-

tacks and application level attacks [127]. Network/Transport level attacks generally

refers to attacks which consume a large portion of the bandwidth or exploit some fea-

ture or bug of a protocol to consume resources such as in a TCP SYN flood. The

TCP SYN attack [8], is a well known DDoS attack in which the attacker floods the

server with TCP/SYN packets from forged senders, causing the server to keep half-

open connections for responses that will never arrive, using up the available connection

resources.

Application level attacks refer to attempts to consume the server resources such as

CPU, ports, sockets, bandwidth, memory etc. In these attacks, the attacker floods the


network with seemingly legitimate traffic aimed at a server’s incoming link, in order

to consume as much of the server’s resources as possible. This thereby prevents or

significantly impairs the server from handling traffic from legitimate users. This can

be done for example, by using an army of zombies in which each zombie sends traffic

to the server, and in itself, acts as a legitimate user. More sophisticated attackers can

send more complicated request to the server thereby causing it to use up computation

resources as well. Attacks against the Domain Name System (DNS) service may also

contain spoofed source addresses which would cause a reflection of the attack or may

send requests that generate large responses (such as an ANY request) to use the DNS

for amplification.

There has been a great deal of work done on mitigation of different types of DDoS

attacks. Recent advances include solutions for mitigation of DDoS attacks in Software

Defined Networks or cloud environments (For example [119, 122]).

In existing DDoS defense mechanisms, the system has several layers of detection and

defense [10]. Detection layers may look for different types of anomalies and suspicious

findings in the traffic. The defense layers may include several levels of escalation. If

some traffic is considered suspicious, it can be escalated and the source of the traffic

is then presented with different types of challenges, based on the level of escalation. It

is important to understand that escalated traffic is not rejected but rather challenged.

One type of challenge is known as a CAPTCHA (Completely Automated Public Turing

test to tell Computers and Humans Apart [118]), though there are many other types

as well. During a DDoS attack, the attacked resources become unavailable to most if

not all legitimate users. The system then needs to quickly identify the attack traffic so

that it can be escalated and challenged appropriately. Using the signatures extracted

by our algorithm, traffic is filtered. Content that contains some minimal number (1 or

more) of signatures is considered malicious and the rest is considered legitimate.

DDoS attacks have been widely studied in the literature. Following [127], defense

mechanisms may be classified by location as source based, destination based, or network

based.

For network level attacks, source based solutions include traffic filtering or moni-

toring at the source’s edge routers or networks, such as in [51, 82]. However, with the

increasing use of botnets these mechanisms are becoming less useful for today’s DDoS

attacks and therefore network and destination based defenses are more effective [127].

2.3. DEEP PACKET INSPECTION 21

Network based solutions include route based packet filtering (e.g. [72]), detection of

malicious routers, (e.g. [55]). Destination based solutions include packet marking (e.g.

[35]) and filtering (e.g. [95]).

Our work focuses on defenses against application level attacks. Destination based

solutions include various mechanisms which protect a server or group of servers against

potential threats. Examples of such mechanisms include [100, 101]. Hybrid based so-

lutions are also common for application level DDoS attacks, and can be a combined

mechanism in different locations, such as detection at the destination and mitigation

at the network. Examples of such solutions are traffic anomaly detection (e.g. [77, 92]),

admission control (e.g. [107]) and methods for differentiating bots from humans using

mechanisms such as CAPTCHA. Recent SDN and cloud technologies have also brought

the development of various cloud based solutions such as [87, 119].

DDoS attacks are constantly growing in both numbers and strengths, therefore

DDoS detection and mitigation continue to be extensively researched.

2.3 Deep Packet Inspection

Deep packet inspection (DPI ) is one of the core techniques used by security tools such

as Web Application Firewalls (WAF), Network Intrusion Detection/Prevention Systems

(NIDS/IPS) (e.g. Snort [9] and Bro [32]) and others, to detect malicious traffic.

DPI refers to the process of examining the payload and the header of packets, as

they go through different components of the communication network, and indicating

when traffic may contain a malicious signature. To do so, packets are searched for

signatures of malicious traffic using pattern matching techniques.

Signatures can be either precise strings or regular expressions. DPI makes use of

algorithms which can efficiently match multiple signatures. For exact strings, signatures

are detected using classic pattern matching algorithms, usually based on deterministic

finite automata (DFA). Commonly used algorithms include as Aho-Corasick [15] and

Wu-Manber [78]. Regular expression matching can be performed using deterministic

finite automata (DFA) or non-deterministic finite automata (NFA) [26, 71].

Signatures can be generated offline and then inserted into the DPI engine for match-

ing against future traffic. As explained in Chapter 5, signatures generated by our system

can be used in such a manner to detect and mitigate zero-day application level DDoS


attacks in any middlebox that contains a DPI component.

It should be noted, that DPI is one of the most resource and time consuming process-

es within different network security components. It is usually the string manipulation

or pattern matching procedures which account for much of the time and resource de-

mands of DPI. There is ongoing research being done to improve the efficiency of these

processes, such as [71].

2.4 Software Defined Networks

SDN has emerged in recent years as a framework for creating configurable network-

s, with improved network management abilities. While SDN is not limited to Open-

Flow [80], it is currently the de-facto SDN standard both in industry and academia.

OpenFlow switches operate flow tables, mostly TCAM based, that are used to match

packets header fields with a limited set of actions, such as set field and add label.

In general, the OpenFlow protocol is based on a match-action concept: OpenFlow

switches store rules (installed by the controller) consisting of a match and an action

part. A packet matched by a certain rule will be subject to the associated action. For

example, an action can define a port to which the matched packet should be forwarded.

An action can also add or change a tag of a packet (a certain part in the packet header).

The controller, which manages the switch flow tables, defines and installs these flow

table rules and uses them to manage the traffic going through the switch.

We will assume that each switch has a group table: a forwarding table whose rules

include an ordered list of action buckets. Each action bucket contains a set of actions to

execute, and provide the ability to define multiple forwarding behaviors. Each bucket

in a fast failover type table, is associated with a parameter that determines whether

the bucket is live; A switch will always forward traffic to the first live bucket. As the

parameter to determine liveness, the programmer either specifies an output port or a

group number (to allow several groups to be chained together).

Chapter 3

Time Locality in Heavy Hitters

and Detection of Heavy Flows in

Software Defined Networks

Match and Action Model

3.1 Overview

We present techniques for large flows detection in traffic that passes through an SDN

switch with Openflow. While SDN switches are very efficient and considerably simpler

to manage than existing routers and switches, they do not offer direct means for the

detection of large flows.

Existing network monitoring tools for classic IP networks have been available for

over 20 years, with one of the earliest tools being Cisco Netflow [4]. Over the years,

traffic visibility, and specifically measurements and monitoring in IP networks has be-

come an increasingly difficult task due to the overwhelming amounts of traffic and

flows [128]. While existing tools may be very useful for classic networks, monitoring

in SDN networks requires new tools and technology. The SDN network architecture

places the controller as the focal point of the network. Therefore, using existing tool-

s would require extensive communication between the controller and the monitoring

tools, which would place significant overhead on the controller. It is therefore necessary

to provide new monitoring methods for SDN networks based on the SDN architecture.

23

24 CHAPTER 3. TIME LOCALITY IN HH AND HEAVY FLOWS IN SDN

We design ways to implement monitoring methods with the widespread OpenFlow

and the recent P4 [30] standard for SDN switches. OpenFlow switches provide counters

that measure the number of bytes and packets per flow entry, yet traffic measurement

remains a difficult task in SDN for two reasons: First, the hardware (usually Ternary

Content Addressable Memories (TCAMs)) constraints limit the number of flows which

the switch can maintain and follow. Second, the limited number of updates that the

switch can process per second [108], which hence limits the number of updates that the

controller can make to the flow table. The algorithms provided herein overcome these

limitations by providing efficient building blocks for large flow detection and sampling

which may be used by various monitoring applications.

3.1.1 Our Contribution

First, we propose our Sample&Pick algorithm which is an efficient method to detect

large or heavy flows going through an SDN switch. The Sample&Pick algorithm is

designed for protocols which are based on the match and action model (e.g., OpenFlow,

P4, etc.), and performs a division of labour between the switch and the controller,

coordinating between them to identify the large flows. Sample&Pick achieves very high

accuracy using a fixed amount of rules in the switch and requiring little communication

between the switch and the controller.

Second, we consider a distributed model with multiple switches and propose so-

lutions for efficient scaling of our techniques, to support large flow detection in the

distributed setting.

Finally, we have implemented and evaluated our Sample&Pick algorithm compar-

ing it with OpenSketch [124]. The sampling methods rely on standard and optional

features of OpenFlow 1.3 (or the P4 language) and are implemented with the NoviKit

(hardware) switch[5] (operated with NoviWare switching software [6]). The heavy flow

detection also relies on a standard OpenFlow controller and was evaluated as a w-

hole using a dedicated virtual time simulation for both the data and control planes.

Additionally, the techniques presented are both flow-table size and switch-controller

communication efficient.

3.2. RELATED WORK 25

3.2 Related Work

3.2.1 Network Measurement

Network measurement tools are a key component in creating quality networks and

are crucial for providing advanced network abilities such as QoS and security. Cisco

Netflow [4], was one of the earliest network monitoring tools. It provided a variety

of monitoring capabilities allowing the collection of IP flow level statistics. Netflow

provided the ability to gather information from the router about every IP flow, including

byte and packet counts yet suffered from high processing and collection overheads. In

the variant Sampled Netflow, sampling was used to partially decrease these overheads,

yet Sampled Netflow provided reduced accuracy caused by the straightforward use of

sampling [49]. In [49] Estan and Varghese significantly improve the accuracy of the

sampling process by introducing the Sample and Hold algorithm which provides better

accuracy while reducing the processing and collection overhead. The sample and hold

algorithm is essentially sampling with a ”twist”. As in regular sampling, each packet is

sampled with some probability, and if there is no entry for the packet’s flow, an entry

is created. Once an entry for a flow exists, it is updated for every packet thereafter in

that flow.

In a usual setup, monitoring devices are placed in central locations in the network

(such as Arbor’s Peekflow [60], or other security detection devices) and samples of traffic

are being sent to the monitoring devices for various additional processing for which the

switch/router are not suitable, such as heavy hitters analysis, DPI, and behavioral

analysis. These monitoring devices usually cannot absorb and process all the traffic.

Therefore, traffic must be sampled, and only the samples or relevant flows should be

forwarded to these devices.

As the networks evolved, network monitoring tools with more advanced capabilities

were developed. In [104], for example, a flow monitoring tool was presented. There,

they discussed adding flow sampling abilities as an inherent capability of the router-

s. They provide a framework for distributing the monitoring across routers, allowing

for network-wide monitoring. By using uniform hash functions, flow sampling is not

duplicated across different routers which route the same flow.

In OpenFlow the flow table allows us to define rules which support counting of

bytes and packets per flow. However, this is not sufficient for more advanced measure-


ments. Recently there have been several works that discuss or suggest enhancements to

network measurement capabilities for both OpenFlow and for SDN in general. FleX-

am is a sampling infrastructure for OpenFlow proposed in [105], which adds sampling

capabilities, using random number generation. Opensketch [124] provides a simple ap-

proach to collect and use measurement data, separating the measurement data plane

from the control plane. The paper suggests a new architecture, where in the data plane,

a pipeline of three essential building blocks is provided: hashing, filtering and counting,

and in the control plane, a wide library of measurement tasks is provided. The above

works suggest an alternate to the OpenFlow architecture, while our work relies on fea-

tures that already appear in the current OpenFlow standard as required or optional

features, in addition to the common extensions such as matching on an extra field in

the packet. These extensions follow the concepts described in [44], that suggests that

the OpenFlow standard should allow the user to configure the headers that the switch

can examine. All our modification are in the spirit of the OpenFlow architecture. We

note that there are works that do not require changes to the OpenFlow standard. For

instance, OpenNetMon described in [115] is a controller module for monitoring flow

level metrics, such as packet loss, delays and throughput in OpenFlow networks.

A recent work [125], proposes a method for distributing the monitoring tasks be-

tween different switches in order to reduce the number of rules needed in each switch.

This method is orthogonal to our distributed solution (see Section 3.6), and can be

combined to further reduce the number of switch entries.

Another recent work, [85], proposes DREAM, a framework for identifying heavy

hitters (see Section 2.1) in traffic using TCAM based hardware. As shown in [85], the

algorithm they use for heavy hitters detection may require more TCAM entries than

a commodity switch may have available. Therefore DREAM performs efficient multi-

switch resource allocation between switches to achieve the desired accuracy rates. The

Sample&Pick algorithm we propose (Section 3.4.2) requires significantly less counters

in the switch and can be used by DREAM to reduce the overall number of switch

entries used.

3.3. TIME LOCALITY DEFINITIONS FOR HEAVY HITTERS 27

3.3 Time Locality Definitions for Heavy Hitters

In the past, a flow was defined as a sequence of packets considered to be logically

equivalent to a call [33]. A slightly broader definition of a flow is a sequence of packets

from a specific source to a specific unicast, anycast or multicast destination [99]. A more

robust definition may be found in [49], where a flow is considered to be a sequence of

packets defined by a set of header field values, which act as the flow identifier and

identify the flow as well as an optional pattern which identifies the packets that make

up the flow. Using different combinations of identifiers and patterns many different flow

types can be defined.

We follow [80] where a flow is defined to be any sequence of packets which can

be matched to rules in the flow table, such as, for example, those defined by a set

of header field values. Note that our algorithms can be used for any flow definition,

including those which pertain to matches in the payload or any of the headers as long

as it is supported by the controller and switch implementation.

A flow entry in an OpenFlow flow table can be defined to match packets according

to (almost) any selection of header field bits thereby allowing various flow definitions.

A large flow is usually defined as a flow that takes up more than a certain per-

centage of the link traffic during a given time interval [49]. For some applications other

definitions of large flows are required, for instance network analysis tools may need to

identify flows that consist of a certain amount of packets regardless of link capacity.

Therefore we refine the large flow definition, considering both the time aspect as well

as the type of measurement performed.

We consider the following definitions of large flows, which are summarized in Ta-

ble 3.1:

Definition 4. Heavy flow: Given a stream of packets S, a heavy flow is a flow which

includes more than T percent of the packets since the beginning of the measurement.

Considering the definition of flow provided above, this can be useful for identifying

flows which remain heavy over a significant period of time, for example in Distributed

Denial of Service (DDoS) attacks. On the other hand this will miss large flows if the

measurement continues for a very long period of time.

Definition 5. Interval Heavy flow (Elephants): Given a stream of packets S, and

a length of time m, an interval heavy flow is a flow that includes more than T percent


of the packets seen in the previous m time units.

This can be used for standard traffic management and resource allocation.

Definition 6. Bulky flow at a point of time: Given a stream of packets S, and a

length of time m, a bulky flow is a flow that contains at least B packets in the previous

m time units.

XXXXXXXXXXXcount typetime

Limited Time Interval Unlimited time

Percent of traffic Interval Heavy flow Heavy flow

Amount of traffic Bulky flow —-

Table 3.1: Matrix definitions for large flows

The algorithms we present for large flows follow the above definitions which consider

traffic volume measurements in terms of packets. Nevertheless, we note that certain

traffic management capabilities require volume, i.e., byte size, analysis. For instance, if

we wish to identify the flow which takes up the most bandwidth, then we are required

to count the number of bytes in the flow rather than the number of packets. The

algorithms presented here work well for both definitions.

3.4 Heavy Flows Detection in SDN

3.4.1 Towards a Solution

Fundamental counter based algorithms for finding heavy hitters (or flows) such as the

Space-Saving algorithm [81], cannot be directly implemented in the SDN framework

since in the worst case they would require rule changes for every packet that traverses

the switch. A different approach is therefore needed.

First we consider a naive solution which we name Sample&HH, that samples packets

in the switch and then sends all sampled packets to the controller. The controller

computes the heavy flows using a heavy hitters algorithm. However, as can be seen in

Figure 3.2a (and other works [49]), relying solely on the samples is not accurate enough.

Next, we consider a solution based on the Sample&Hold paradigm of [49] which was

devised for identifying elephant flows in traffic of classic IP networks. In Sample&Hold

3.4. HEAVY FLOWS DETECTION IN SDN 29

sampled packets are sent to the controller, which installs a counter rule for each new

flow that is sampled. Every consequent packet from that flow will be counted by the rule

and will not be sampled. By using sampling together with accurate in-band counters

for sampled flows Sample&Hold achieves very accurate results, yet the high amount of

counters and the rate of installing them make Sample&Hold incompatible with SDN

switch architecture. Therefore we only consider it as a reference point to evaluate our

algorithm.

To deal with the problems of the above solutions, we present our Sample&Pick algo-

rithm. Sample&Pick uses sampling to identify flows that are suspicious of being heavy.

For these suspicious flows a special rule is placed in the switch flow table providing

exact counters for them. The Sample&Pick algorithm considers both the bounded rule

space in the switch as well as the time it takes for the controller to install a rule in the

switch. Therefore we use two separate thresholds, the first, T , for determining which

flows are heavy and a second lower threshold, t, for detecting potentially large flows.

This lower threshold allows us to install rules in the switch early enough to get an

accurate count of the large flows, yet we do not install rules for too many flows that

will remain small. The Sample&Pick algorithm is described in detail in Section 3.4.2.

Table 3.2 depicts the conceptual differences and the resource consumption overhead

of the Sample&Pick algorithm, the SDN Sample&Hold algorithm and the Sample&HH

algorithm

3.4.2 The Sample&Pick Algorithm

3.4.2.1 Algorithm Overview

Our algorithm operates as follows: in the first step we sample the flows going through

the switch. Note that sampling can be achieved using Openflow weighted groups as

explained in [14]. As can be seen in Fig. 3.1, these samples are sent to the controller,

that feeds them as input to a heavy hitters computation module in order to identify

the suspicious heavy flows (steps 2 and 3). Once a flow’s counter in the heavy hitters

module has passed some predefined threshold t, a rule is inserted in the switch to

maintain an exact packet counter for that flow (steps 4 and 5). This counter is polled

by the controller at fixed intervals and stored in the controller (steps 6 and 7). Finally,


Technique Switch memo-ry usage

Controllerfunctionali-ty

Controller toSwitch traffic

Switch to con-troller traffic

Sample&Pick Sampling rules+ at most 1

tcount rules

Heavyhitters com-putation+ counteraggregation

Every intervalat most 1

t newcount rules

Sample of allnon-hold pack-ets + counterseach interval.

Sample&Hold(OpenFlowvariant)

Sampling rules+ unlimitedcount rules

Counter ag-gregation

Every newsample createmessage witha new countrule

Sample ofall non-holdpackets + finalcounters.

Sample&HH Sampling rules Heavyhitters com-putation

None Sample of al-l packets

Table 3.2: Comparison of the heavy flow detection techniques presented in this work.Denote t the threshold for candidate heavy hitters in Sample&Pick .

name match actions

Countflow1

(src ip, src port, dst ip, dst port) = flow1 1

... ... ...

Countflowm

(src ip, src port, dst ip, dst port) = flowm 1

Sample (src ip, src port, dst ip, dst port) = ∗ 2

Table 3.3: Illustration of switch flow table configuration. Rule priority decreases fromtop to bottom. Actions: 1- increment counter; 2 - apply sampling technique (goto sam-pling tables / apply group)

the last step increments the counters that are processed by the Heavy Hitters module

to maintain correct counters of non-sampled flows.

3.4.2.2 Switch Components Design

As seen in Fig. 3.1, two kinds of rules are used in the switch flow tables. The sam-

pling rules, which are created as needed by the sampling algorithm, and the counter

rules used for precisely counting packets of potentially heavy flows. An example of this

configuration can be seen in Table 3.3.

First, each packet is matched against counter rules. In case of a successful match,

the relevant counter is increased. Only if the packet does not match any counter rule,

it is matched against the sampling rules, and if the packet is selected by the sampling


Figure 3.1: Sample&Pick overview

rules it (or only the headers) is sent to the controller. Counters of the counter rules are

only sent to the controller when polled by the controller.

3.4.2.3 Controller Components Design

As seen in Fig. 3.1, the controller maintains the heavy hitters computation module and

a collection of the exact counters accumulation.

The heavy hitters computation module: Maintains the data structure used for de-

tection of heavy hitters according to the Space-Saving algorithm [81] (described in

Section 2.1).

As the heavy hitters module only receives sampled data which is sent to the con-

troller from the switch, the traffic of the heavy flows which are not sampled is not

inserted at all into the heavy hitters module and therefore it may seem as though the

flows are no longer heavy. To simulate the sampling of these heavy flows, when the

controller polls the switch for the updated counters, it uses those counters to update

the heavy hitters module accordingly. That is, we simulate a sampling of the heavy

flows by updating the heavy hitters module with the number of new packets that have

been counted since the previous polling, multiplied by the sampling ratio p. As noted,

this mechanism saves a substantial amount of sample traffic from the switch to the

controller.

The exact count data structure: The accumulated counters of the flows that are


suspected to be heavy are maintained in a simple ordered data structure. It is used to

compute the delta from the previous time the counters were polled. This delta is then

fed (with a factor) into the heavy hitters module.

An additional counter is maintained in the controller to count the total number of

items inserted into the heavy hitters module, which is necessary to calculate the rates

from the individual counters inside the heavy hitters module. At any point the heavy

flows may be identified as the flows in the heavy hitters module that have passed the

threshold T , relative to the total counter.

3.4.2.4 Analysis

Here we discuss how to choose the parameters, t and v of Sample&Pick algorithm for

given problem parameters, the threshold T for heavy flow and the sampling probability

p.

By definition, if a total of N packets have passed so far, each heavy hitter flow

contains at least TN packets. Our controller receives each packet with probability p.

The number of samples is then on average (or exactly depending on the sampling

method) n := Np. The number of packets sampled out of x original packets is a

random binomial variable with average xp and variance xp(1 − p). When x is high

this converges to normal distribution with similar parameters. For normal distribution,

w.h.p the random variable is within a distance of 3 times the standard deviation from

the average. Therefore the number of packets sampled from x packets is w.h.p greater

than xp− 3√xp(1− p).

Our scheme uses a threshold t < T , in order to detect possible heavy flows that

might be missed due to sampling errors. For a heavy flow (with at least T ·N packets)

w.h.p at least TNp− 3√TNp(1− p) packets are sampled. We need to set t to ensure

that the above expression is higher than t · n. Thus,

t < T − 3

√T (1− p)√Np

(3.1)

Since t must be a positive number, we get the following constraint on the flow weight

(ratio) our scheme is expected to detect: T 2 − 9T (1−p)Np > 0 which is valid when

T > 91− pNp

(3.2)


For example, assuming a line rate of 6 · 105 packets per second and a controller

throughput of only a few thousands messages per second, we need a sampling rate of

at most 1 : 100, i.e., p < 10−2. Assuming that the tested interval is at least 10 seconds

long, more than six million packets pass through the switch during the interval, i.e.,

N > 106. From Equation 3.2 we get that the threshold, T , can then be roughly 10−3

or more.

Next we consider the fact that the flows that are monitored by exact counters are

updated in batches (when reading the switch flow entry counters). To make sure that

their counters in the approximate HH structure are not evicted between updates, we set

the number of entries, v, to be high enough considering the threshold, t, for monitored

flows.

Next we show that by choosing v = 2/t the number of samples that would cause

the eviction of one of the monitored flows, that is a flow that is located at the top part

of the approximate heavy hitters structure, is very high.

Assume we have k monitored flows, the sum of their counters is at least k · n · t.

The number of other values in the table is v − k, and their sum is at most n− knt. In

order for the minimal monitored flow to be evicted, all lower values in the table should

exceed it, i.e., all smaller counts need to become higher than nt. Their sum should thus

be at least (v− k)nt, increasing by at least (v− k) ·nt− (n− knt) = vnt−n. Since the

counts change by the number of incoming samples, if we set v = 2t then the number of

new samples received between batch updates should be as large as the number of all

samples received so far (n) which is highly unlikely.

3.4.3 Evaluation

3.4.3.1 Comparison of Algorithms

We compare our Sample&Pick algorithm to the two additional solutions described

above Sample&Hold and Sample&HH (See algorithms overview in Table 3.2). We an-

alyze the resource consumption and accuracy of each of the algorithms in fixed time

intervals. We use 10 intervals of 5 seconds each, and we collect the counters of each

algorithm at the end of each interval. In addition, we compare the results of these al-

gorithms to that of the OpenSketch Heavy Hitters detection mechanism [124]. For our

analysis, we use a one-hour packet trace collected at a backbone link of a Tier-1 ISP in

San Jose, CA, at 12pm on September 17, 2009 [1].


We chose the following simulation parameters T = 5 · 10−3, p = 11024·102 Bytes,

t = 2 · 10−3, v = 2000.

Figure 3.2a shows a comparison of the three algorithms based on accuracy criteria.

The counter error refers to the ratio between the real count of the heavy hitters and the

algorithm estimates. The false negative and false positive errors is the ratio between

Heavy Hitter (HH) flows missed to the total number of HH flows, and the HH flows

wrongly detected to the total number of HH flows respectively. Figure 3.2b shows a

comparison of the three algorithms based on the amount of traffic they generate and the

amount of memory they use in the switch. As can be seen, while Sample&Hold provides

the best accuracy results, it requires an increasing amount of counters and therefore its

switch memory consumption is significantly higher than that of the other algorithms. In

contrast, Sample&HH requires the least amount of switch memory since all of the heavy

hitters computation is performed in the controller yet it relies on sampling alone and

provides significantly lower accuracy results. Our testing shows that the Sample&Pick

provides accuracy results only slightly inferior to those of Sample&Hold yet requires

significantly less switch memory.

Technique OpenFlowCompatibility

Error Rate Switchmemoryusage

Controller ↔Switch Traffic

Sample&Pick Yes 3.3% 2KB 220KB/s

Sample&Hold Yes 1.15% 400KB 140KB/s

Sample&HH Yes 11.3% ≤ 1KB 270KB/s

OpenSketch[124] No 0.05−10% 94KB −600KB

NA

Table 3.4: Resource consumption test results

As can be seen in Table 3.4, Sample&Hold gives the smallest error rate, since it

performs an actual count of all flows that it samples, yet it uses significantly more

switch memory. Sample&HH uses only samples for the counter estimates without using

any counters in the switch yet incurs significantly higher error rates. Sample&Pick has

relatively small error rates due to the actual counting of potentially heavy flows, yet due

to the careful selection of which counters to place in the switch, the switch memory usage

in Sample&Pick is very low. According to our testing, the error rate of Sample&Pick


(a) Comparison of algorithms by Counter error, False negative errors and Falsepositive errors.

(b) Comparison of algorithms by Overall traffic (between switch and controller)and Switch memory usage.

Figure 3.2: Resource consumption and accuracy comparison


may be further reduced with increased sampling rate or counter polling rate, yet the

switch memory requirement remains steady at 2KB as determined by our parameters.

The controller↔switch traffic (sum of traffic in both directions) of each of the presented

algorithms is directly influenced by the sampling rate (recall in this case p = 11024·102

Bytes) and the counter polling rate of the controller. In the case of Sample&Pick the

polling rate is set to be every 0.1 seconds in these tests, while in Sample&Hold the

controller only polls for the counters once at the end of the interval. As can be seen,

Sample&HH produces a larger traffic overhead since all sampled messages are sent to

the controller whereas in the other two algorithms the counters in the switch perform

the aggregation locally.

Additionally, we compare our results to testing done on the OpenSketch Heavy Hit-

ters detection mechanism [124]. OpenSketch is a very efficient measurement architec-

ture, yet it is not compliant with the OpenFlow standard. Our Sample&Pick algorithm

was designed with the current OpenFlow and P4 ablities in mind and it can therefore

be implemented using the current standards. We base our comparison on the evaluation

results shown in [124]. Note that while we perform our test on the same data as used

in [124], we provide an average of 10 intervals of 5 seconds each, as opposed to 120

intervals used in the OpenSketch evaluation. As can be seen in Table 3.4, Sample&Pick

requires very little switch memory while achieving counter errors which are similar to

those achieved by OpenSketch which uses significantly more switch memory. The traffic

overhead for OpenSketch is not provided in [124] and therefore we do not indicate it.

3.4.3.2 Parameters Evaluation

We evaluate the affects of different parameters on our system. For our analysis, we use

a one-hour packet trace collected at a backbone link of a Tier-1 ISP in San Jose, CA,

at 12pm on March 20, 2014 [3].

The base setting of our system for all of the following tests used the following

parameters: T = 0.01, p = 1256 Packets, t = 0.005, v = 400.

The first parameter we examine is t, which is the threshold for detecting potentially

large flows. The output line marked as Sample&Pick1 uses the parameters as indicated

above, hence t = 0.005. The output line marked as Sample&Pick2 uses t = 0.0025, and

the output line marked as Sample&Pick3 uses t = 0.00125.

As can be seen in Figure 3.3a, the smaller the value of t is, the lower the error rate.


0 500 1000 1500 2000 2500 3000#Packets(K)

0

20

40

60

80

100

Err

orR

ate

(%)

Sample&Pick1

Sample&Pick2

Sample&Pick3

(a) Comparison of error rate and convergence.

0 500 1000 1500 2000 2500 3000#Packets(K)

600

800

1000

1200

Pac

ket-

InR

ate

(msg

/sec

)

Sample&Pick1

Sample&Pick2

Sample&Pick3

(b) Comparison of PacketIn messages (samples) from switch to controller.

Figure 3.3: Affect of varying t values


This is due to the fact that exact counters are placed for smaller flows, therefore allowing

them to be counted exactly earlier in the stream, hence increasing accuracy. Figure 3.3b,

shows the PacketIn messages generated by the system when using different values of

t. As can be seen, a smaller value of t causes more flows to have exact counters and

therefore not be sampled. This causes a decrease in the number of PacketIn messages

as t decreases.

We now examine the affect of different T values. Recall that T is the threshold

for determining which flows are heavy. Using the base parameters indicated above, the

output line marked as Sample&Pick1 uses T = 0.01, with t = 0.005. The output line

marked as Sample&Pick2 uses T = 0.005 and t = 0.0025, and the output line marked

as Sample&Pick3 uses T = 0.0025 and t = 0.00125.

As can be seen in Figure 3.4a, a smaller T incurs a larger error, even with a smaller

value of t. This is due to the fact that a smaller T calls for the detection of smaller

flows. Figure 3.4b, shows the PacketIn message rate. The rates achieved by each test

are similar to those in Figure 3.3b due to the values of t which significantly influence

the PacketIn message rate.

Additionally, we look at different values of v (the number of items maintained by

the heavy hitters module in the controller) and its affect on the system. Using the base

parameters indicated above, the output line marked as Sample&Pick1 uses v = 400.

The output line marked as Sample&Pick2 uses v = 800, and the output line marked as

Sample&Pick3 uses v = 1600.

As can be seen in Figure 3.5a, the more space allocated in the controller, meaning,

the higher v is, the lower the error rate. Figure 3.5b, shows the PacketIn messages

generated by the system when using different values of v. As can be seen, the value of

v alone does not affect the PacketIn message rate.

3.5 Interval Heavy Flow and Bulky Flow Detection

Recall that, an interval heavy flow is a flow whose volume is more than T percent of

the traffic seen in the last time interval of length m. While the problem is defined in a

continuous manner, that is, an interval can begin at any point in time, considering the

inherent subtle delays caused by the OpenFlow architecture, an approximate solution

is sufficient.

3.5. INTERVAL HEAVY FLOW AND BULKY FLOW DETECTION 39

0 1000 2000 3000 4000 5000 6000#Packets(K)

0

20

40

60

80

Err

orR

ate

(%)

Sample&Pick1

Sample&Pick2

Sample&Pick3


0 1000 2000 3000 4000 5000 6000#Packets(K)

400

600

800

1000

1200

Pac

ket-

InR

ate

(msg

/sec

)

Sample&Pick1

Sample&Pick2

Sample&Pick3


Figure 3.4: Affect of varying T values


0 1000 2000 3000 4000 5000 6000#Packets(K)

0

20

40

60

80

100

Err

orR

ate

(%)

Sample&Pick1

Sample&Pick2

Sample&Pick3


0 1000 2000 3000 4000 5000 6000#Packets(K)

700

800

900

1000

1100

1200

1300

Pac

ket-

InR

ate

(msg

/sec

)

Sample&Pick1

Sample&Pick2

Sample&Pick3


Figure 3.5: Affect of varying v values

3.5. INTERVAL HEAVY FLOW AND BULKY FLOW DETECTION 41

Figure 3.6: The modified heavy hitters data structure using counter arrays. In thisexample the active counter is currently c1.

Our solution makes use of the Sample&Pick algorithm, specifically we take the

array of counters in the heavy hitter module in the controller as the starting point. We

modify this structure so that instead of maintaining one counter per item (flow), an

array of counters is maintained for each flow that is kept in the heavy hitters module.

In addition, for each flow we maintain an additional accumulative counter. The updated

counter structure is depicted in Fig. 3.6.

The array of counters for each flow maintains the history of the flow’s counter values

in fixed intervals of time. The flow’s accumulative counter is the sum of all the counters

in the flow’s array. Let m seconds be the selected time interval, and let there be r history

counters maintained for each flow, we get a sub-interval that is mr seconds long. The

basic idea is that in each sub-interval a different counter in the array is updated by the

HH module, in addition to updating the accumulative counter. Thereby, consecutive

(cyclicly) counters in the array can be used to calculate the number of times the value

appeared in the entire interval. At the beginning of the sub-interval, for each flow, the

value of the active counter is decreased from the value of the accumulative counter.

Then all active counters in all flows are reset to zero. In this manner, at the end of each

sub-interval, for any flow, the active counter equals the number of times the flow was

sampled during that sub-interval, and the value of the accumulative counter equals the

number of times the flow was sampled in the last interval m. It follows that if the index

of the active counter is a s.t. 0 ≤ a ≤ r − 1 for any r′ ≤ r − 1 the sum of the cyclically

consecutive counters between index a− r′ mod r and a equals to the number of times

the item was seen during the r′ previous sub-intervals.


Note that if an interval does not begin at the beginning of an exact sub-interval, we

will consider it to begin at the start of either the current or the consequent sub-interval.

The accumulative counter has two additional important uses: 1) it is used to main-

tain the threshold ratio; 2) it is used by the heavy hitters algorithm as the de-facto

counter for deciding which flow has the minimum counter and should be evicted.

Using the accumulative counter in this manner is the basis for the correctness of our

algorithm, which we will now briefly show. Given an interval i of length m, denote N to

be the number of items seen in i. If i is made up only of whole sub-intervals, it is easy to

see that at the end of interval i the accumulative counter of each flow in the structure

is equal to what its counter would be had we reset all of the counters at the beginning

of the interval. Therefore, using the accumulative counters as described above provides

us with a heavy hitters mechanism which supports the same counter error rate (i.e. Nv )

as that of [81]. If, however, i begins in the middle of a sub-interval, the counter error

rate is slightly higher. In this case, suppose i contains j complete sub-intervals, and

at most 2 partial sub-intervals. The additional error contains appearances of the flow

which occurred in the partial sub-intervals, which may incur an additional error of at

most Nv since otherwise it would be heavy for an interval comprised of only complete

sub-intervals as well, making the overall error rate in this case to be 2Nv .

Notice that bulky flows can be detected by using the above mechanism without di-

viding the counters sum by the relevant sum of counters, but rather taking the absolute

values.

3.6 Distributed Setting

In many cases, in order to achieve a comprehensive view of the network, it is required

to distributively monitor traffic at multiple switches. There are two main challenges

to deal with when detecting large flows in this distributed setting; false negatives due

to split flows and false positives due to sequential flows. Split flows are large flows

that their traffic is split to small sub flows, each going through a different monitoring

switch, and therefore monitored in parallel. Sequential flows are small flows that each

of their packets traverse multiple monitoring switches and are therefore over sampled

or counted.

In this section we extend our Sample&Pick solution in order to support this dis-

3.6. DISTRIBUTED SETTING 43

tributed setting. We describe the changes that need to be done to the sampling and to

the large flow detection scheme. We note that our solution easily scales with the num-

ber of monitoring switches. To support multiple controllers, a hierarchy of controllers

needs to be defined and data should be collected by the controllers and forwarded up

the hierarchy.

Sampling: In order to handle over sampling of sequential flows, flows that each

of their packets traverse multiple switches, we need to prevent each packet from being

sampled more than once. We suggest to do so by marking packets after they are sampled

(whether selected or not) and by applying sampling only to unmarked packets. Marking

of packets can easily be managed in SDNs (with OpenFlow and especially with P4), for

example by utilizing one bit in the VLAN tag. Matching the VLAN tag of each packet

can be easily done and allows to skip sampled packets. Note that the marks should be

removed at egress ports so that they do not affect the traffic leaving the network.

Heavy Flow Detection: As described in Section 3.4.2, our Sample&Pick algorithm

makes use of both sampling and exact counter rules in the switch. To support the

distributed setting, and to handle split flows, that each of their sub flows goes through

a different monitoring switch, all of the samples and counter values from all monitoring

switches should be aggregated centrally by the controller. The controller will receive the

samples and counter values from the different switches and treat them as if they were

generated by a single monitoring switch. One of the implications of that is that when

a flow becomes suspect of being large, exact counter rules should be installed on all

monitoring switches, to assure that all consequent packets going through the network

are counted.

Similarly to sampling, in case of sequential flows that traverse multiple switches,

exact counters (on different switches) should not count the same packet more than

once. The same packet marking technique we suggest to avoid over sampling, can be

used in order to prevent multiple counting (see Figure 3.7), i.e., marked packets are

not matched against exact counter rules nor sampled. Moreover, packets which match

exact count rules are marked even if they have not been sampled.


Figure 3.7: Marking sampled packets in the distributed setting.

Chapter 4

Finding Heavy Hitters in a

Stream of Strings

4.1 Overview

Often times in network management and security applications, identifying recurring

content is important. While in some cases, working with a predefined content length is

sufficient, the length of the recurring content is not always known. Developing streaming

algorithms for identifying recurring varying-length strings poses inherent difficulties

especially given the tight space and time requirements. We tackle the problem of finding

popular strings of varying lengths in textual data streams using counter-based heavy

hitters algorithms.


We propose the String Heavy Hitters problem. Additionally, we propose an efficient

algorithm for solving this problem. This algorithm finds popular strings of variable

length in a set of messages, building upon the classic algorithms for heavy hitters

detection. The algorithm runs in linear time requiring one-pass over the input and

uses a constant amount of memory. In addition to the detection of various types of

DDoS attacks, such an algorithm can be very useful for additional applications both

in networking and in other fields. For example, identifying recurring content in emails

can assist in identification of spam messages. Furthermore, it can be used in detecting

common parts of worm code or even in DNA sequence analysis.

45

46 CHAPTER 4. FINDING HEAVY HITTERS IN A STREAM OF STRINGS

4.2 String Heavy Hitters

Heavy hitters algorithms are usually performed on numeric data, whereas our work

focuses on textual values. We present the following definitions which form the basis of

our discussion.

Definition 7. String frequency: Given a sequence S = 〈S1, .....SN 〉 of N strings, and

some string s, the frequency of s, denoted fs is the number of strings in S in which s

appears. That is fs =∑

yifs ⊆ Sy : 1, else : 0. Note that the frequency of a substring

s can be defined either as the total number of times s appears in S, or as the number

of strings in S in which s appears. For our purposes we will be using the latter.

Definition 8. String Heavy Hitter: Given a sequence S = 〈S1, .....SN 〉 of N strings

and constants k and θ, a string s is a string heavy hitter if the following hold:

1. s is a substring of one or more strings in S

2. The length of s is at least k. (|s| ≥ k)

3. s has a frequency above the threshold fs ≥ θN

Definition 9. The String Heavy Hitters Problem: Given a sequence S = 〈S1, .....SN 〉

of N strings and constants k, θ and v, using a constant amount of space and a single

pass over the sequence S, find at most v string heavy hitters, such that no output string

is contained in another output string. We denote these output strings as signatures.

4.3 Challenges

The String Heavy Hitters problem is closely related to the Heavy Hitters problem,

defined in section 2.1, however, applying the known heavy hitters algorithms to textual

data is not at all trivial.

The following problems arise and must be resolved:

1. The substring pollution problem: The textual data needs to be somehow converted

into a sequence of values. One approach might be to consider every string in

the text as a value in the sequence. This would, however, make the size of the

sequence quadratic in the size of the input, causing a significant decrease in the

time efficiency of the algorithm. Another approach, which we use in our algorithm

4.4. NOTATIONS 47

is to consider the constant length k-gram from each position in the text as a value

in the sequence. Define a k-gram to be a string of length exactly k. If a string

s, |s| > k appears many times in the input text, then all the k-grams which

are substrings of s show up as heavy hitters and are output by the heavy hitters

algorithm. We name this problem the substring pollution problem. The following

is an example of the problem: suppose the string is abcabc and k = 4, than all the

4-grams which make up the string, i.e., abca, bcab and cabc will be heavy hitters

and will therefore pollute the data structure. As explained in Section 4.6.1, our

Double Heavy Hitters algorithm deals with this problem by combining k-grams

that have repeatedly appeared in a sequence, therefore creating varying length

grams. The process of creating a string from consecutive k-grams, is a key factor

in substantially reducing the substring pollution in the output. For each such

consecutive sequence, the process creates a single input of varying length to HH2,

that has been naturally filtered by a preceding heavy hitters procedure, HH1.

2. The frequency estimation problem: Another problem which arises when creating

values from textual data is that heavy hitters may be substrings of one another.

This can occur, for example, if both the strings ABCDEF and BCDE recur

frequently in separate locations in the text. The counter of BCDE provided

by the algorithm would not reflect the times that BCDE appeared as part of

ABCDEF . In order to provide a better estimation of the frequency of each

string, the algorithm as described in Section 4.6.1 must be modified to support

this. We treat this issue using an additional procedure, which we describe in

section 4.6.2.1.

4.4 Notations

The notations we use throughout this section are summarized in Table 4.1.

4.5 Related Work

The String Heavy Hitters problem which we present here, has roots in the field of

Stringology in a variety of problems. Searching for a common substring (or suffix or


k minimal signature length (gram length)

r ratio between the frequencies of consecutivek-grams

m desired number of signatures

HHj heavy hitters module j (j ∈ 1, 2)

nHHj number of items in the HHj data structure(j ∈ 1, 2)

Table 4.1: Notations

prefix) in two texts is a classic Stringology problem [19]. This problem has been extend-

ed to multiple texts in the well studied Longest Common Substring problem. In this

problem, given m documents of total length n, we wish to find the longest substring

common to at least d ≥ 2 documents. Classical solutions such as those in [121] and [59]

require O(n) time and O(n) space. In [69] a time space tradeoff is provided giving a

solution requiring O(τ) space and O(n2

τ ) time for any 1 ≤ τ ≤ n. The Stringology

solutions are well suited for biological, data mining and other applications.

Most of the previous works which focus on identifying recurring strings in a network

setting, were done for signatures of a fixed length [57, 67, 106]. Finding varying-length

popular strings poses inherent difficulties. We note two works which have been done

which generate varying length strings. The first is Honeycomb [70] which was presented

by Kreibich and Crawcroft. There, signatures are created for suspicious traffic using

pattern matching techniques. Specifically using searches for longest common substrings

within packet payloads, using suffix trees. While this method allows creating signatures

composed of varying-length strings, and the suffix tree can be created in linear time

using Ukkonen’s online suffix tree construction algorithm [114], the space complexity of

the suffix tree is at least linear in the size of the input, and therefore not scalable when

dealing with large amounts of data. This is perhaps the most substantial difference from

our solution which uses a configurable fixed amount of space while still maintaining a

time complexity which is linear in the size of the input.

Another work in which signatures composed of varying-length strings are generated,

is Autograph [68], presented by Kim et al. To generate varying length signatures, the

payload of suspicious traffic is divided into variable-length content blocks based on the

Content based Payload Partitioning method first presented in [86]. Content blocks are

chosen as signatures based on their prevalence in the traffic flows. While the signatures


produced are indeed of varying length, the Content based Payload Partitioning per-

formed is done using a predetermined breakmark which is used to partition the payload

into blocks whose size is no more and no less than some predefined values. Addition-

ally, the average block size is also predetermined. Evaluation done in [68], shows that

a larger minimum content block, such as 32 or 64 bytes is needed to avoid a high false

positive rate. Signature structure is therefore based on predefined parameters which

determine the breakmark and the signature length. The system presented in our work

allows shorter signatures to be generated, and more importantly, does not use a pre-

defined breakmark for content partition so that signatures can vary significantly from

one another.

Another variation on heavy hitters that has been discussed is the Hierarchical Heavy

Hitters [42, 111]. Notice that although the Hierarchical Heavy Hitters algorithms (see

Chapter 2), may seem suited for textual data, they work well on data which forms a well

defined hierarchical structure such as a sequence of IP addresses. Since our algorithm

searches for recurring strings in the traffic, and the context of the strings is not relevant

for our purposes, identical strings need to be grouped together regardless of what comes

before them or after them in the content. Inserting a string into an hierarchical structure

would not account for common substrings that appear in different places in different

strings. For example, a common substring found both at index 10 of string s1 and index

100 of string s2 would need to be inserted in the same place in the hierarchy in order for

both appearances to be counted by the same counter. Furthermore, identical substrings

which appear multiple times in the same string would also need to be grouped together

in the hierarchy. Additionally, inserting a string into an hierarchical structure would

require inserting each character into a separate hierarchical level, making the hierarchy

both very wide and very long. Alternatively, inserting more than one character into each

level may cause the algorithm to miss common substrings, therefore causing errors in

the counter estimations of the algorithm.

Another related problem is that of compressed sensing. Many interesting works

have been done in this field such as [54, 90, 97]. It has yet to be seen if the solutions

presented for the compressed sensing problem can be adapted to outperform the above

heavy hitters algorithms for the frequent items problem.


4.6 The Double Heavy Hitters Algorithm

We propose the Double Heavy Hitters algorithm. The purpose of this algorithm is to

identify frequent substrings of varying lengths in the given packets.

4.6.0.1 Packet String Heavy Hitters

Based on Definition 9, we define the Packet String Heavy Hitters problem as follows:

given a sequence P = 〈P1, .....PN 〉 of N packets and constants k, θ and v, using a

constant amount of space and a single pass over the sequence P , find at most v string

heavy hitters, such that each has a packet frequency (the number of packets in which

it appears) which is over threshold θN , that is for pfs =∑

y s ⊆ Py, pfs ≥ θN , such

that no output string is contained in another output string.

Solutions which perform an exact count of the strings would use at least a linear

amount of space [41], therefore a more efficient solution must be found.

4.6.1 Algorithm Overview

Our algorithm makes use of the Heavy Hitters algorithm as a building block. Denote as

HH a component which performs the heavy hitters algorithm. The Double Heavy Hit-

ters algorithm, denoted DHH, makes use of two independent heavy hitters components,

HH1 and HH2, as follows:

1. HH1 finds k-grams that appear frequently, i.e., that are heavy hitters.

2. HH2 finds varying length strings that occur frequently in the input (which are

combinations of the strings found in step 1).

Define a k-gram to be a string of chars of length exactly k. The input to the DHH

algorithm is a sequence of np packets, a constant k which will determine the size of the

k-grams used and a constant r which is the ratio between the frequencies of consecutive

k-grams explained shortly. Conceptually, the process works as follows: the algorithm

traverses the packets one by one. For each index in the packet, a k-gram is formed

by taking the k characters starting from that index. These k-grams are given as an

input to HH1. To form the varying length strings which are the input to HH2, while

HH1 processes the k-grams, the algorithm seeks to find the longest run of consecutive

k-grams such that:

4.6. THE DOUBLE HEAVY HITTERS ALGORITHM 51

1. They are all already in HH1 (i.e., at this stage they are heavy hitters).

2. They have similar counters. The objective is that combining two k-grams should

occur only if they should be part of the same signature. Without this ratio r,

if some k-gram appears very frequently, but the character that usually follows

this k-gram is inconsistent, then the preferred signature should not combine this

k-gram with the one that follows it. Specifically, counters of two consecutive k-

grams maintain a ratio of r. In our experiments we tested values of r from 0 to

1. Since for our purposes a longer signature was preferable we use a ratio of 0.1

in our testing. Testing with a ratio of 0.5 or higher produced significantly shorter

signatures. An example of this process can be seen in Figure 4.1.

Once the entire input has been traversed, the algorithm outputs the items found in

HH2.

abca cabc bcab k-grams:

Is already

in HH1? No Yes

abcabc

Check ratio

between grams

labeled “Yes”

abca cabc bcab abcd

Input: a b c a b c a b c d

No No No Yes Yes

Figure 4.1: An example of the process of creating varying length strings from consecutivek-grams.

4.6.2 Algorithm Details

The pseudo code of the DHH algorithm is found in Procedure DoubleHeavyHitters,

and makes use of the support functions Init(nv) and Update(α), which are derived

from the algorithm in [81], and the sub-procedure InputToHH2. The output of the

DoubleHeavyHitters procedure is the list of heavy hitter values found in HH2 at the

end of the procedure.

The input provided to the algorithm is a sequence of np packets, and constant


integers: k and r as explained above, and nHH1 and nHH2 which indicate the number

of items HH1 and HH2 will be configured to hold respectively.

The algorithm works as follows: the packets are traversed one by one. For each

index in the packet, a k-gram is formed by taking the k characters starting from that

index. The k-gram is given as an input to HH1, which in return provides the k-gram’s

counter in HH1 (a return value of zero indicates that this is a new k-gram).

In order to account for varying length strings, while performing the above traversal,

an additional string stemp is maintained. For any location in the packet, stemp is the

last longest heavy hitter string found until that location. stemp is maintained in the

following manner: at the beginning of each packet, the string stemp is empty. For each

k-gram that is inserted to HH1, we check its returned value:

1. If stemp is empty and the returned value is greater than zero, stemp is set to be

this k-gram.

2. Otherwise, if stemp is not empty, one of the following two occur:

(a) If the returned value is equal to zero, stemp which is the longest ”heavy”

string we found until this point, is given as an input to HH2, and stemp is

reset to empty.

(b) Otherwise, the returned value is greater than zero. In this case, this value is

compared with the counter value of the previous k-gram. If the ratio between

the two values is over some predefined ratio r, stemp is concatenated with

the last character in the current k-gram. Else, stemp is given as an input to

HH2, and stemp is set to be this k-gram.

The algorithm then proceeds to treat the next index. When all of the packets have

been traversed, the algorithm outputs the item in HH2.

We note, that the algorithm also maintains a set of all the treated strings in each

packet so that each string is counted only once. This allows us to find strings that

appeared frequently in different packets rather than strings that have a high overall

frequency.

The strings are checked for uniqueness before being inserted into HH2 to ensure

that each signature is only counted once per packet.


Function Init(V )

Items[V];for i = 1→ V do

Items[i].count = 0;Items[i].ID = null;

end

Function Update(α)

if ∃jItems[j].ID == α thenItems[j].count+ +;output = Items[j].count;

elsefind j s.t. ∀h Items[j].count ≤ Items[h].count;Items[j].ID = α;Items[j].count+ +;output = 0;

endreturn output;

Procedure DoubleHeavyHitters

Data: sequence of np packets, constants k, nHH1 , nHH2 , and ratio rResult: the nHH2 candidates for being the heavy hittersstemp = empty, temp counter = 0;HH1.Init(nHH1), HH2.Init(nHH2);for i = 1→ np do

// Denote α1, ..., αh the bytes of packet pifor j = 1→ h− k + 1 do

counter = HH1.Update(αj ...αj+k−1);if counter > 0 then

if stemp == empty thenstemp = (αj ...αj+k−1);temp counter = counter;

elseif counter > r · temp counter then

stemp = stemp||αj+k−1;temp counter = counter;

elseProcedure InputToHH2;stemp = (αj ...αj+k−1);temp counter = counter;

else InputToHH2;

Procedure FixSubstringFrequencyInHH2;


Procedure InputToHH2

temp counter = 0;if stemp! = empty then

HH2.Update(stemp);stemp = empty;

Procedure FixSubstringFrequencyInHH2

for i = 1→ nHH2 dofor j = 1→ nHH2 do

if i! = j and item[i].ID is a substring of item[j].ID thenitem[i].count+ = item[j].count ;

4.6.2.1 Improving the Frequency Estimation

Due to the frequency estimation problem, as explained in Section 4.6.0.1, it is possible

that a string t inHH2 may contain a substring t′ which is also a string inHH2. However,

when processing t in HH2, the counter of t′ is not incremented. The reason for this is

that only disjoint strings from each packet may be inserted to HH2. Therefore, if for

example in packet pi at index j, the string ”example” appears and the algorithm decides

to insert this string into HH2, the string ”exam” which is a substring of ”example” and

also appears in packet pi at index j is not inserted into HH2. If the string ”exam” is an

item in HH2 (i.e. it has been inserted into HH2 when processing a different appearance

of the string in a different packet or at another index in the same packet), its counter

does not account for its appearance in packet pi in index j.

The goal of our algorithm is to provide an estimate of the actual number of times

that a string was encountered. In order to achieve a better estimation, we perform an

additional procedure on the strings found in HH2 at the end of the above algorithm,

to find which items in HH2 are substrings of other items in HH2. The counter of the

contained item is incremented by all of the counters of the items that contain it. In this

manner, our final counters provide a better estimation of the number of packets in which

each string was encountered. Note that since only disjoint strings from each packet may

be inserted into HH2, this procedure does not result in an additional overestimation

of the counters.


4.6.3 Error Rate Analysis

The heavy-hitters algorithm that we use is an approximation algorithm, and therefore

the DHH algorithm is also an approximation. As can be seen in the below analysis,

the error rate of our algorithm is only a factor of 3 higher than that of the heavy hitters

algorithm that we use as a building block. In fact, as can be seen in the experimental

results in Section 5.4, the error rate of our algorithm is significantly smaller in practice.

Theorem 10. Bounds of the Double Heavy Hitters Algorithm: The final coun-

ters provided by the algorithm may incur an error of at most 3 nknHH

where nHH =

minnHH1 , nHH2 and nk denotes the total number of k-grams processed by the algo-

rithm.

Proof. In order to analyze the error rate of our algorithm, we must first analyze the

error rate of each of its components. As described in Section 2.1, the error rate of

each of the HH items is ε = NnHH

, where nHH is the number of items maintained by

the HH, and N is the number of values in the input. We have defined the number of

items maintained by HH1 and HH2 to be nHH1 and nHH2 respectively. Given an input

sequence of packets, the size of the input is calculated as follows:

1. For HH1: Define the total number of k-grams in all the packets in the sequence

to be nk which is the bound on the size of the input to HH1.

2. For HH2: The input to HH2 is made up of the strings which are a sequence

of consecutive k-grams. Denote nc the number of such strings. nc is maximized

when the inputs to HH2 are all a single k-gram. To understand how these strings

can be formed lets look at the example in Fig. 4.2. Suppose the k-gram abcd is a

heavy hitter. In order for the string beginning with this occurrence of abcd to be

made up of a single k-gram, the following character e must be of a high variability

in this context throughout the input. Otherwise, the k-gram bcde would also be a

heavy hitter, and therefore abcd would be merged with bcde, meaning the string

would be longer than a single k-gram. One can see that this would be true for all

the following k-grams which contain the character e, and therefore they too can

not be heavy hitters. The closest following k-gram that can be a valid candidate

for being a heavy hitter is the k-gram following the character e. It follows that

nc ≤ nkk+1 .


abcd e fgab efgfsdghjghn……

Heavy

Hitter

High variability –

otherwise we get

longer consecutive

Next

possible

heavy hitter

Figure 4.2: Non-consecutive heavy hitters

It follows from the above calculation that the error rate of HH1 is nknHH1

, and the

error rate of HH2 is ncnHH2

≤ nknHH2

(k+1) .

In order to complete the analysis, it remains to account for occurrences of strings

that are not produced as part of the input to HH2. Generally, a string s is produced

as an input to HH2, if the k-grams that comprise it are already found in HH1. Lets

take a look at the sequence of k-grams processed by HH1. For some index j, the jth

k-gram will be found in HH1 only if its frequency is over jnHH1

. Since this must be true

for all k-grams that comprise s, it follows that there can be at most nknHH1

appearances

of S that are not produced as part of in the input to HH2.

It follows that the overall error rate of our algorithm is 2 nknHH1

+ nknHH2

(k+1) . Taking

nHH = minnHH1 , nHH2, we get that the error rate of the algorithm is bound by

3 nknHH

.

Chapter 5

Zero-Day Signature Extraction

for High Volume Attacks

5.1 Overview

Signature extraction is an important tool in several network security problems. In

Distributed Denial of Service (DDoS) mitigation, for example, there has recently been

a growing demand for zero day attack signature extraction solutions.

Two basic techniques are traditionally used to identify DDoS attacks, flow authen-

tication based on challenge response and flow behavioural analysis based on statistics

and learning (further details are provided in Chapter 2). Recent attacks with millions

of zombies generating seemingly legitimate flows go under the behavioural radar screen.

In these types of attacks, behavioural analysis does not succeed to detect the malicious

traffic, as each zombie generates little traffic which in itself may appear to be benign.

Furthermore, the huge amount of attack sources makes it unfeasible to stop the attack

at the source. Recent use of Internet-of-Things (IoT) devices in Botnets has caused

further increase in the number of compromised machines which may take part in the

attack [46]. This therefore leaves a loophole in the defense mechanisms and creates the

demand for a DDoS zero day attack signature extraction solution.

Identifying signatures for unknown DDoS attacks is extremely difficult due to the

seemingly legitimate content found in the packets which comprise the attack. Most

traditional signatures are based on the malicious code that is expected in the attack

packets, which may not be the case with modern DDoS attacks. Leading industry

experts confirm, that the signatures found in recent zero-day application-level DDoS

57

58 CHAPTER 5. ZERO-DAY SIGNATURE EXTRACTION

attacks are usually a bi-product of the attack tools which the attackers use. These tools

often leave some footprint caused unintentionally by the program, such as a short string

or some (protocol complying) anomaly in the packet content structure. Such signatures

allow fine grained identification of attack packets during an attack with minimal false

positives or negatives.

These subtle signatures are not identified by the current automated defense mech-

anisms, but rather by a manual process which may take hours or days.

Generally speaking, leading security companies provide systems which offer several

layers of defense against high-volume attacks. When all layers of defense fail, the at-

tacked customer contacts the security company’s support team to alert them and get

their assistance in stopping the attack. This manual assistance may be composed of

a number of procedures, including the identification of attack signatures. The attack

mitigation process is therefore long and may take hours to days, in addition, it is labor

intensive. Moreover, in many cases the human eye misses the identifying string which

could be an extra space, line-feed etc.

Clearly, in order to stop such unknown attacks while they are occurring, such sig-

natures must be extracted quickly and automatically.


We present a system for automatic extraction of signatures for high volume attacks,

using a single pass over the input, and space dependent only on the predetermined

size of the heavy hitters data structure. Our system takes as input two streams (or

stream samples): one of traffic collected during an attack and a second collected during

peacetime. A peacetime traffic sample may be collected a priori as a routine scheduled

procedure. The attack traffic sample can be collected in real attack time, once the attack

has been identified. We note that for DDoS attacks there are existing mechanisms for

identifying when an attack has started and for differentiating between Flash events

and DDoS attacks, for instance that of Park et al. [93]. The system then analyzes

both traffic samples to identify content that is frequent in the attack traffic sample yet

appears rarely or not at all in the peacetime traffic.



be easily adapted to solving other network problems with similar characteristics. That

5.1. OVERVIEW 59

said, while our algorithms can generically work on different data types, our evaluation

focuses on application-level DDoS attacks.

The following are the basic requirements of our system:

1. Signatures should not be found frequently in legitimate traffic. One of the main

difficulties in differentiating between malicious traffic and traffic from legitimate

sources, lies in the fact that malicious requests may have legitimate payloads.

Identifying these malicious requests therefore becomes a significant challenge.

2. Allow signatures of varying lengths. The signatures produced by the algorithm

must be of varying length. Setting a predefined constant length for signatures

would create very problematic outcomes, as described in section 4.6.0.1.

3. Find a minimal set of signatures. Since filtering devices may have a limited ca-

pacity, the algorithm must aim to produce a small number of signatures.

4. Minimize space and time usage. Our solution must maintain a high level of effi-

ciency, such that the attack can be stopped quickly with minimal space usage.

More specifically, given some constant k we wish to find all strings s1, ..., sm, s.t.

∀i, 1 ≤ i ≤ m:

1. |si| ≥ k

2. si appears frequently enough in the attack traffic.

3. Either one of the following holds:

(a) The frequency of si in peacetime is very low.

(b) The frequency of si in peacetime is moderate, yet in the attack traffic its

frequency is significantly higher.

4. In order to have a minimal set, no string si is contained in another string sj .

These requirements are formally explained in Section 5.3.3.

In Section 5.4, we test our system on real life traffic logs of attacks and peacetime

that from real attacks that have occurred in recent years. We show that our solution

has good performance in real life, with a recall rate average of 99.95% and an average

precision rate of 98%.


Additionally, our system makes use of an algorithm we have devised for finding

heavy hitters in textual data which is described in Chapter 4.

An implementation of our solution is publicly available and may be used for sig-

nature extraction from user uploaded files. It is found on our website [13]. Users are

advised to prepare a peacetime pcap file and an attack time pcap file which they may

upload to the website for immediate signature extraction.

5.2 Related Work

5.2.1 Automated Signature Extraction

In the past, automated signature extraction has been mostly used as a tool for identify-

ing computer malware such as worms and viruses. As such, most algorithms presented

for this problem generally consist of two stages:

1) Identifying suspicious traffic which contains malware with high probability. This is

done using methods such as honeypots [70], behavioural traffic analysis [106], etc.

2) Generating signatures for the suspicious content.

Therefore, the signature generation process of the previous works [57, 67, 68, 70, 98,

106] designed for malware identification, was based on the use of traffic that is known

to be malicious. Contrary to the scenario where the suspicous traffic is identified before

hand, our work deals with the case in which the suspicious traffic can not be detected a

priori, but rather, the suspicious traffic contains some unique prevalent content which

needs to be identified. Our solution does require a sample of peacetime traffic to be

collected prior to the attack, which can be collected by the system on a routine basis

when it is experiencing regular load.

Attack-time traffic that is analyzed, may contain both malicious parts and legiti-

mate parts. Therefore, it is crucial to identify which prevalent content is found only in

malicious packets and create signatures for that content alone. Furthermore, our meth-

ods allow us to identify not only seemingly legitimate malicious content, but it can in

fact, be legitimate in other traffic. For example, in HTTP level attacks, an attacker

can make use of a legitimate yet not commonly used HTTP header field. Use of a this

field can, in this case, be an identifier of malicious traffic, yet in a different case be

completely legitimate.


An interesting variation of the above problem is that of signature extraction solu-

tions with the ability to support morphisms in malware. This problem was addressed

in various works [68, 73, 88, 109], where different algorithms for automatic signature

generation for polymorphic worms are presented. We are currently in the process of

expanding our solution so that it may deal with such variations as well.

In [112], the authors present an automated system for detection of new application

signatures for the purpose of traffic classification. In this work, the authors present

a system for automatically identifying keywords of unknown applications. The key

difference between the solution presented in [112] and the solution we present here, is

that in [112] it is assumed that flows of the same application can be identified a priori

and therefore the analysis can look for the common strings in the specified flows. In our

solution, one of the main difficulties is that we do not know which of the packets are

contained in the attack and are therefore malicious and we therefore can not process

these packets alone to find the attack signature.

In [126], a mechanism is presented for botnet C&C signature extraction. The mech-

anism identifies frequent strings in the traffic and then ranks the frequent strings based

on traffic clustering methods. While in [126] it is not assumed that the C&C connections

can be identified a priori, their analysis is based on characteristics of the connection

and the traffic. Our solutions makes no such assumptions and is therefore more robust

for dealing with specially crafted packets and attacks.

5.2.2 DDoS Defense Mechanisms

In order to place our solution on the map of available DDoS solutions, we follow the clas-

sification of DDoS defense mechanisms according to place and time presented in [127].

The solutions we present are generally destination based solutions used during an at-

tack (with a preparation stage to be performed before an attack) targeting application

level attacks. Our solution is a content based packet filtering method and is not based

on the packet route or parameters.

It may seem natural to compare our solution to solutions based on traffic anomaly

detection. While our method does look for changes in content from peace time to attack

time that exceed some predefined threshold, traffic anomaly detection methods in DDoS

attacks, are usually network or destination based solutions searching for abnormal traffic

patterns. Unusual traffic patterns may be detected using techniques such as machine


learning [77, 120] or entropy [87, 92]. Our solution is not based on traffic behaviour

and makes no assumptions on normal patterns of traffic. Solutions which use traffic

behavioural analysis, may fail to detect large-scale DDoS attacks that simulate normal

traffic behaviour. Since our solution makes no assumptions on the traffic behaviour it

may be used to detect such attacks.

5.3 The Zero-Day High-Volume Attack Detection System

The main purpose of our system is to efficiently extract a minimal set of signatures

that distinguish malicious packets from legitimate ones. Therefore, a major factor in

producing signatures which achieve both a low false negative rate (i.e., a high detection

rate) and a low false positive rate (i.e., a low rate of legitimate traffic that is wrongly

identified as malicious), is the algorithm’s ability to identify strings which appear very

frequently in malicious traffic and which are hardly found in legitimate traffic.

5.3.1 Notations

The notations we use throughout this section are summarized in Table 5.1.

k minimal signature length (gram length)

r ratio between the frequencies of consecutivek-grams

m desired number of signatures

HHj heavy hitters module j (j ∈ 1, 2, 3)

nHHj number of items in the HHj data structure(j ∈ 1, 2, 3)


5.3.2 System Overview

Given a sample of peacetime traffic and a sample of the attack traffic, the following

three stages are performed:

1. Analyzing peacetime traffic: the peacetime traffic is analyzed to identify strings

which appear frequently during peacetime.

5.3. THE ZERO-DAY HIGH-VOLUME ATTACK DETECTION SYSTEM 63

2. Analyzing attack traffic: the attack traffic is analyzed to identify strings that are

very frequent in the attack traffic yet seldom or not found at all during peacetime.

3. Filtering the signature candidates: the strings found in the above step are fil-

tered according to predefined frequency and containment requirements as will be

explained in the following sections.

Note that for DDoS mitigation for example, the traffic that will be analyzed by our

system can either be captured in the DDoS mitigation apparatus or in the cloud by

sampling the traffic from several collectors. The signatures produced by our algorithm

can be used by the anti-DDoS devices and firewalls to stop the attack. Using our

algorithm, mitigation can be achieved in minutes, allowing proper defense against such

attacks. Also, since DDoS attacks are usually high-volume attacks, a sample of the

traffic is sufficient.

5.3.3 System Requirements

The system generates a white-list and a maybe-white-list using the following thresholds:

1. Attack-high: a string s can only be an attack signature if its frequency in the

attack traffic is greater than attack-high.

2. Peace-high: a string s with a peacetime frequency over peace-high can’t be a

signature for the malicious traffic, and will enter the white-list.

3. Peace-low: a peacetime frequency below peace-low is deemed irrelevant for the

attack signature selection process, and the string will be placed in the not-white-

list.

4. Delta: a string s with a peacetime frequency between peace-low and peace-high

can be considered as a possible signature for the malicious traffic only if its fre-

quency in the attack traffic is at least delta higher than its peacetime frequency,

in this case it will enter the maybe-white-list.

As illustrated in Figure 5.1, given a sequence of packets P of traffic captured during

peace time and a sequence of packets A of traffic captured during an attack, and

given the thresholds: peace-high, peace-low, delta and attack-high, and some constant

gram size k the problem is formally defined as follows: Find all strings s1, ..., sm, s.t.

∀i, 1 ≤ i ≤ m:


frequency ≥ attack-High

frequency ≥ peace-Low

Signatures only if attack frequency at least ‘delta’ more than peace frequency

False positives

Signatures

frequency ≥ peace-high

Strings found during attack

Strings found during peace

Figure 5.1: Signatures requirement overview

1. |si| ≥ k

2. The frequency of si in the attack traffic is at least attack-high.

3. One of the following holds:

(a) The frequency of si in peace time is less than peace-low.

(b) Both of the following hold: 1) The frequency of si in peace time is between

peace-low and peace-high. 2) The difference between the frequencies of si in

the attack traffic and in the peacetime traffic is at least delta.

4. To avoid redundancy, no string is contained in another (i.e., @j : sj ⊆ si or

si ⊆ sj).

5.3.4 System Details

Our zero-day high-volume attack detection system makes use of our DHH algorithm,

to analyze both the peace-time traffic and the attack traffic.

5.3.4.1 Analyzing Peacetime Traffic

The DHH algorithm is performed with the peacetime traffic as input. The strings in

the output are categorized into three lists of strings, white-list, maybe-white-list and


not-white-list, as explained above.

Note that to speed up mitigation, the peacetime traffic can be analyzed in advance

to produce these lists. Additionally we note, that in some cases it is difficult to get a

capture of peacetime traffic in advance since the mitigation device only receives attack

time traffic. As can be seen in our evaluation (Section 5.4), those cases can be handled

by other means.

5.3.4.2 Analyzing Attack Traffic

The DHH algorithm is performed with the attack traffic as input, with the modification

that the algorithm omits potential output strings if they are equal to or contained in

a string in the white-list, to reduce false-positives. The other way around is allowed

(i.e., www.facebook.com may appear frequently in the legitimate traffic, yet the string

www.facebook.com/BadPerson could appear frequently in the malicious traffic). We

name this property the one-way containment property. Due to this problem, we can

not filter out strings which appear frequently in legitimate traffic a priori, but rather

a more intricate solution is needed. Intuitively, the algorithm performs as follows: it

receives as an input the sequence of packets captured during an attack, and a list of

white-list strings. In order to avoid creating a signature for the attack traffic which

appears as a string or a substring of a string in the white-list, the algorithm will only

add a string to the input of HH2 if it is not contained in a white-list string.

The main difference, therefore, between the DHH algorithm and the Attack-DHH

algorithm, is that the Attack-DHH is provided with the white-list. Therefore, HH2 is

now updated with an stemp only if stemp is not found (as a whole white-list string or

as part of one) in the white-list (see Fig.5.2). The only change therefore is in the sub-

procedure InputToHH2. The pseudo-code of the modified sub-procedure can be seen

in Procedure Modified-InputToHH2. We note that there can be numerous options for

creating a data structure to support the search in the white-list. In our implementation

we chose to maintain a hash table of all of the substrings in the white-list of length

greater than k. This implementation is very good in terms of time complexity, though

there is a tradeoff in that it takes a bit more space than other possible solutions.

The strings output by the attack traffic analysis will be referred to as the signature

candidates. A graphical depiction of the attack traffic analysis process and the filtering

process described in the following step can be seen in Fig.5.2.


Procedure ModifiedInputToHH2

temp counter = 0;if stemp! = empty then

if stemp is not a string or part of a string in the white-list thenHH2.Update(stemp) ;stemp = empty;

Heavy

Hitters

1

Heavy

Hitters

2

hagdhdadjashdklahdjkasfjasbfjabfhfgahfvhsbdfjkasnkiaywtqyeffcgfacsdxasdbas

b1=hagd b2 = agdh b3 = gdhd ……

Output

values Signatures

Attack traffic packets payload:

White list: discard if equal

to or contained

in whitelist string

Maybe white list:

attack-high

≥ delta

peace-

high

peace-

low

peace-

high

peace-

low

Merged

string

Figure 5.2: The process of extracting attack content signatures.

5.3.4.3 Filtering the Signature Candidates

Notice that all signature candidates in the output of the attack traffic analysis have a

frequency below peace-high in the peacetime traffic. The strings in the output of the

above step are narrowed down as follows:

1. Discard strings with a frequency in the attack traffic that is below the threshold

attack-high.

2. Check if any of the strings are equal to or contained in a string in the maybe-

white-list. For such strings, calculate the difference between the frequency of the

string during the attack and the frequency during peacetime of the relevant string

in the maybe-white-list. If this difference is greater than the threshold delta, the

string is kept, otherwise, it is discarded. We note that strings not found in the

maybe-white-list must have a frequency below peace-low in the peacetime traffic.

3. Once the final signature candidates are acquired by the above process, they are


checked for containment. If a signature candidate is contained in another signa-

ture candidate, the algorithm will only choose one signature based on user policy

(i.e., the longest, the shortest, the one that produces the smaller number of false

positives, etc.). Furthermore, the algorithm may further reduce the number of

signatures by finding which signatures usually appear together in the same pack-

ets, therefore removing the redundant signatures. Selecting which signatures to

discard can also be done based on user policy as described above.

5.3.5 Identifying Common Combinations of Signatures

In many cases it is interesting to identify signature combinations which are often found

together in the same packets. These combinations can be of great use in attack detec-

tion mechanisms. First, they can be used to minimize the number of signatures which

are needed to identify the attack (see subsection 5.3.5.2). Second, signature combina-

tions can be used to increase the confidence level in the detection of the attack (see

subsection 5.3.5.3).

5.3.5.1 The Triple Heavy Hitters Algorithm

In order to identify the frequent signature combinations, we propose the Triple Heavy

Hitters algorithm denoted THH. This algorithm makes use of three heavy hitter mod-

ules. Two modules will be used as in the Double Heavy Hitters algorithm (denoted

DHH), and the third module will be used to find heavy hitters of signature com-

binations. While performing the DHH algorithm, for each packet treated, the THH

algorithm maintains the set of strings which were identified as potential signatures, and

therefore inserted into HH2 while processing the packet. Once the entire packet has

been processed, this set contains all of the signatures found in the packet and it will be

inserted into the third heavy hitters calculation unit HH3. To do so, each string in the

set is concatenated with a special end-of-string delimiter and the delimited strings are

concatenated together in lexicographical order to form a single string which is inserted

into HH3. Once all of the packets have been traversed, HH3 contains the heavy hitter

sets of signatures. This procedure is illustrated in Figure 5.3. The pseudo code can

be seen in Procedure TripleHeavyHitters, which makes use of a sub-procedure called

Procedure InputToHH3.

The THH algorithm has the same time complexity as the DHH algorithm, since


Procedure TripleHeavyHitters

Data: sequence of np packets, constants k, nHH1 , nHH2 , nHH3 , and ratio rResult: the nHH2 candidates for being the heavy hitters, nHH3 frequent

signature setsstemp = empty, temp counter = 0, strings counter = 0,signature set = empty;HH1.Init(nHH1), HH2.Init(nHH2), HH3.Init(nHH3);for i = 1→ np do

signature set = empty;for j = 1→ h− k + 1 do

counter1 = HH1.Update(αi...αi+k−1);if counter1 > 0 then

if stemp == empty thenstemp = (αi...αi+k−1);temp counter = counter1;

elseif counter1 > r · temp counter then

stemp = stemp||αi+k−1;temp counter = counter1;

elsecounter2 =Procedure InputToHH2;if counter2 > r · strings counter then

signature set.Add(stemp);strings counter = counter2;

stemp = (αi...αi+k−1);temp counter = counter1;

elsecounter2 =Procedure InputToHH2;if counter2 > r · strings counter then

signature set.Add(stemp);strings counter = counter2;

if signature set.Size > 0 then Procedure InputToHH3 ;

Procedure FixSubstringFrequencyInHH2;

Procedure InputToHH3

Data: set of ns signatures, delimiter string s′

stemp = empty;for i = 1→ ns do stemp = stemp||sigi||s′ ;HH3.Update(stemp)


the input to HH3 is created as the strings are inserted into HH2. The space complexity

of the THH algorithm is dependent on the number of items in each of the HH modules.

Heavy Hitters

1

Heavy Hitters

2

hagdhdadjashdklahdjkasfjasbfjabfhfgahfvhsbdfjkasnkiaywtqyeffcgfacsdxasdbas b1=hagd b2 = agdh b3 = gdhd ……

Merged string

Output values

Signatures

Attack traffic packets payload:

White list: discard if contained in whitelist string

Heavy Hitters

3

Sets: create sets of all strings in packet that pass filter

Output: signature sets and frequencies

Output

Minimize signatures

Maybe white list:

Attack-high ≥Del

ta

Figure 5.3: Extracting attack signatures with the additional minimization process.

5.3.5.2 Minimizing the Number of Signatures

Minimizing the number of signatures can be very significant, as some of the filtering

mechanisms have a limited capacity. In addition, having less signatures can cause a

reduction of the false positive rate of the signatures.

The ability to minimize the number of signatures is depicted in an example presented

in Figure 5.4. There, we look at a scenario in which there are six different types of attack

packets. In this example, we can see, that four signatures have been extracted. However,

since either the signature ”bad” or ”guy” appear in all of the different attack packets,

they alone can be used, hence the number of signatures can be minimized without

creating false negatives.

Once the signatures and the frequent signature sets are extracted by the system,

we would like to check if the number of signatures can be minimized. To do so, we

propose a greedy process. The pseudo code for this process is shown in Procedure

MinimizeSignatures. Each such set represents a packet type that had been found in the

traffic sample. Intuitively, if some group of signatures appears together in some packet


Packet type 1: … bad…guy…

Packet type 2: …really… bad…guy…

Packet type 3: … mean…guy…

Packet type 4: … really…bad…

Packet type 5: … bad…mean… guy…

Signatures

bad guy really mean

10%

20%

20%

25%

15%

1%

Packet type frequency:

Signature frequency: 71% 65% 45% 35%

Packet types

Packet type 6: … bad…

Figure 5.4: An example of different sets of signatures found in different packet types.

type, then only the signature with the highest frequency is needed to cover this packet

type. To identify these signatures with the highest frequency, the process sorts all of

the signatures according to decreasing frequency. The signatures are then traversed

one by one, and checked to see how many ”un-covered” packet types are covered by

each signature. We denote a ”covered” packet type as a packet type that contains at

least one signature that has been chosen as a final signature. Looking at the example

shown in Figure 5.4, the process would work as follows: The signatures are sorted in

decreasing frequency. The most frequent signature is ”bad”, therefore we start with it.

Since ”bad” is the first signature we deal with, no packet types have been covered yet.

Denote the cover rate of a signature to be the percent of the packets that it covers

in this calculation. Therefore ”bad” covers the packet types:1, 2, 4, 5, 6, and therefore

we indicate its cover rate to be 71%. The next signature we traverse is ”guy”. The

only packet type that remains un-covered is number 3. The signature ”guy” covers

packet type number 3, and therefore we indicate its cover rate to be 20%. Since all of

the packet types have been covered, the cover rate of the remaining signatures is 0%.

Therefore, the only signatures needed to cover all of the packet types are ”bad” and

”guy”.

The time complexity of this procedure in the worst case is O(number of signatures

∗ number of sets) which is O(nHH2 ∗ nHH3) which is therefore dependent only on the

predefined size of each HH module. The space requirements are linear in nHH1 , nHH2

and nHH3 which are configurable parameters. Since this procedure is only done one


time it only adds a constant overhead to the time complexity of the THH algorithm.

Procedure MinimizeSignatures

Data: list Lsigs of nHH2 signatures, list Lsets of nHH3 sets of signaturesResult: the final list of signatures// Initialize the cover rate of all signatures to be zero

list Lfinal = empty;for i = 1→ nHH2 do Lsigs[i].cover rate = 0 ;Sort Lsigs by frequency;i = 0;while i < nHH2 and Lsets not empty do

for j = 0→ Lsets.size() doif Lsets[j] contains Lsigs[i] then

Lsigs[i].cover rate+ = Lsets[j].frequency;remove set Lsets[j] from Lsets;

for i = 0→ nHH2 doif Lsigs[i].cover rate > 0 then Lfinal.insert(Lsigs[i]) ;

return Lfinal;

5.3.5.3 Reducing the False Positives

Signature combinations can be used to increase the confidence level in the detection of

the attack. This can be done by creating rules which are meant to identify specific attack

content. Such specific rules reduce the chance of falsely identifying benign content as

malicious, therefore making the identification of the attack traffic more certain.

In the example presented in Figure 5.4, suppose packet type 6 doesn’t contain

malicious content. The signature ”bad” is found in the packet types:1, 2, 4, 5, 6, therefore

if we create a rule which simply searches for the signature ”bad”, it will catch packets

of type 6 as well, creating false positives. These false positives can be eliminated using

detection rules which combine signatures with ”AND” and therefore can be used to

catch specific types of attack packets. For example, we can create a rule which catches

packets that contain ”bad” AND ”really”. Such a rule will catch only packets of type 4.

If we specify an additional rule that catches ”bad” AND ”guy”, the packet types which

will be caught are 1, 2, 4, 5, whereas packet type 6 will not be caught. Using such rules

therefore reduces the likelihood of false positives and increases the confidence level of

the detection.


5.4 Evaluations

In our evaluation, we focus on high volume DDoS attacks, and specifically on unknown

application layer attacks in HTTP requests, commonly known as HTTP-GET flooding

attacks.

5.4.1 Test Setup

In our evaluations we used real captures from a top security company. Each test in-

cluded a real HTTP-GET flooding attack time capture and a peacetime capture that

included either real traffic or synthetically generated traffic. In some cases peacetime

traffic was not available, and synthetically generated peacetime traffic was created by

crawling through the victim site. If no such traffic is available, we synthetically gener-

ate a peacetime capture by sending requests to the attacked server and capturing the

traffic we create (i.e., a synthetic peacetime traffic capture). Our evaluation included

11 different attacks as follows:

1. We tested 3 attacks for which both the peace time capture and the attack time

capture were recorded on the same server during a time of normal functioning

and then later during an actual DDoS attack. We name these tests real-real.

2. We tested 6 attacks for which the attack time capture was recorded during an

actual DDoS attack, and the peace time capture was created after the attack

by recording traffic created by crawling the victim’s site. We name these tests

real-synthetic.

3. We tested a single attack which included textual log files of the HTTP GET

requests during an actual DDoS attack, and a log file of HTTP GET requests

which were identified as being legitimate during the time of the attack, which was

used as the peace time traffic. We name these tests log.

4. We tested a single synthetic attack which was made up of peace time traffic which

was created by us and then a synthetic attack was merged into the peacetime

traffic. We name these tests synthetic-synthetic.

For each of the above tests, the zero-day high-volume attack detection system was

used to extract attack signatures. In order to evaluate the system’s results, for each of

the above scenarios, we preformed three tests:

5.4. EVALUATIONS 73

1. System quality testing: Performed by evaluating both the recall and precision

rates of the signatures extracted by the system. Recall and precision, defined in

Chapter 1, are standard measures of relevance in fields such as pattern recognition

and information retrieval.

2. Frequency estimation accuracy test of the DHH algorithm: Performed by count-

ing the number of packets in the attack traffic in which each of the attack sig-

natures appears, and comparing the counters with the counters of the DHH

algorithm.

3. Threshold testing: Several threshold value sets were tested.

A summary of the test statistics can be found in Table 5.2 which is explained

in the next section. In addition, we performed a separate testing of the use of the

Triple Heavy Hitters algorithm (explained in Section 5.3.5) for identifying frequent

signature combinations to minimize the number of signatures needed, as described in

Section 5.4.6.

5.4.2 System Quality Test Results

A summary of the test statistics is presented in Table 5.2. All of the attacks analyzed,

are attacks that were not detected by any automated defense mechanism, and these

attack samples were therefore analyzed manually by a human expert. The columns in

the results section of the table are as follows:

1. Manual attack rate estimation: the estimated percent of the packets in the attack

traffic capture, that were identified as attack packets by the manual analysis.

2. System attack rate estimation: the percent of the packets in the attack traffic

capture, that contain one or more of the signatures extracted by the system.

3. Recall rate estimation: the percent of packets identified as attack packets by the

manual analysis which were identified by the signatures extracted by our system.

The aim is to have a recall of 100%, since the recall is an indication of how many

of the relevant results were identified.

4. Precision rate estimation: we estimate the precision rate of our system by two

methods, for both of which the aim is to have a precision of 100%, as precision


is an indication of how many relevant results were returned as opposed to non-

relevant results.:

(a) Peacetime based precision: the percent of peacetime traffic packets that were

not identified by the signatures extracted by our system either.

(b) Attack based precision: the percent of attack traffic packets which were not

identified by the manual analysis that were not identified by the signatures

extracted by our system either.

Test Statistics

Test Capture Files Data Test Results

TestTargetCategory

AttackTime

Testtypeattack-peace

Number ofPackets inSample

Manualattackrateestimation

Systemattackrateestimation

Recallrateestimation

Precision rateestimation

Attacktime

Peacetime

Peacetimebased

Attackbased

1 Telephony Nov2011

Real-Real

407 2347 59% 59% 100% 100% 100%

2 eGaming Jul 2012 Real-Real

157560 2468 98% 98% 99.8% 100% 100%

3 eGaming May2012

Real-Real

191192 47168 75% 75% 99.8% 100% 100%

4 Nationalbank

Jan 2012 Real-Syn.

7050 369 78% 99% 100% 100% 79%

5 News Mar2012

Real-Syn.

47569 216 99.9% 100% 100% 100% 99.9%

6 eCommerece Jan 2013 Real-Syn.

35014 253 NA 98% NA 100% NA

7 Mobile May2013

Real-Syn.

608 497 93% 94% 100% 100% 99%

8 Government Mar2012

Real-Syn.

6875 318 69.5% 90% 100% 100% 79.5%

9 Government Mar2012

Real-Syn.

5867 77 NA 92% NA 100% NA

10 News May2013

Log 34721 70322 47% 47% 100% 100% 100%

11 Synthetic NA Syn-Syn

57112 9016 84% 84% 100% 100% 100%

Table 5.2: Summary of the statistics of the tests performed. Note that the captures aresamples of the traffic.

We note several comments and conclusions regarding the results: 1) For each test,

the system identified the signatures that were found by the human expert in addition

to other signatures which were not identified by the expert.

2) For all of the attacks tested, one or more signature was found that creates a

false positive of 0%, meaning they do not appear in the peacetime traffic at all. As

5.4. EVALUATIONS 75

explained in section 5.3.4.3, the final signature candidates may be filtered according to

user policy. We chose to select the candidates with the lowest frequency in peacetime

traffic, meaning the lowest false-positive rate. The final filtering process of the signature

candidates, selected these signatures alone to achieve the results shown in the table.

This filtering process was done by searching through the peacetime traffic for the final

signatures candidates to select those with the lowest false positive rate. Another option

would be to minimize the signatures based on frequent signature combinations as we

have shown in Section 5.4.6, this also gives good results.

3) If both the attack and the peacetime captures are real, the system’s attack

detection rate is most likely to be very close or equal to the estimated detection rate of

the manual analysis. On the other hand, as can be seen in tests 4 and 8 for example,

a synthetic peacetime capture may cause a system detection rate which is higher than

the manual estimation. The difference between them could indicate the false positive

rate caused by the system’s signatures.

4) All tests were performed with thresholds: attack−high = 50%, peace−high = 3%

peace−low = 2%, delta = 90%. Except for test 10 which was done with: attack−high =

10%, peace−high = 3% peace−low = 2%, delta = 90%. The value of attack−high was

selected based on the characteristics of the attacks themselves and can be selected based

on, for example, performance variations in the attacked site and so forth. The rest of

the thresholds were selected based on testing done, which is presented in section 5.4.5.

There it is shown that a peace − high value of 3 should be selected and determining

the other two thresholds follows from setting this value.

Our testing included a preliminary phase for determining the settings and parame-

ters of the DHH algorithm. These include the values of k, nHH1 , nHH2 , r, attack-high,

peace-high, peace-low and delta. The value k indicates the length of the k-grams, and

nHH1 and nHH2 indicate the number of items each of the HH modules is configured to

hold. The value of k was set to 8, since testing showed that longer signatures are likely

to increase the rate of false negatives, and shorter signatures are often not substantial

enough therefore increasing the possibility of false positives. The values of nHH1 and

nHH2 were both set to be 3000. Our tests included values raging from 1000 to 10000,

and it was found that 3000 was sufficient for our purposes.

Note that as a rule of thumb, the size of nHH1 can be determined according to the

expected frequency of the signature that the system should identify and the average


length of each packet. In general, in order to extract a signature which is found in

fraction x of the packets (0 ≤ x ≤ 1) with average packet length being len, we would

need to set nHH1 to be no more than lenx . Furthermore, the sizes of nHH2 and nHH3

are bounded by the size of nHH1 .

The above values of these parameters were kept unchanged throughout the testing

of the detection system. An additional parameter used by the DHH algorithm is the

ratio r explained in Section 4.6.1. This value was tested within the detection system

with values ranging from 0 to 1. It was found that values closer to 1 yielded the

extraction of shorter signatures. This value should therefore be chosen based on the

desired characteristics of the output. The thresholds which are used to determine the

white-lists and the chosen signatures are configurable in the system and we discuss

some tested values of these thresholds in Section 5.4.5.

5.4.3 Performance

Our implementation was done using C++, and made use of the implementation pro-

vided in [41] of the heavy hitters algorithm presented in [81]. The code was compiled

using g+ +. We ran experiments on a 4-Core Intel(R) Core i7(R) 2.7 GHz with 16 GB

of RAM running Mac OS X 10.9 (Mavericks). We ran our our algorithm on a variety

of real traffic captures, our algorithm was able to process between 144 and 232 Mbps.

When running our algorithm on synthetically generated traffic which has skewed data

frequencies the algorithm performance reaches approximately 1.1 Gbps. The space re-

quired by our system is linear in nHH1 , nHH2 and nHH3 , which were set to nHH1 = 3000,

nHH2 = 200 and nHH3 = 100.

5.4.4 Frequency Estimation

Recall that to test the accuracy of the frequency estimation provided by the algorithm,

the estimated frequency of each signature was compared to an actual count of the

signature in the attack traffic. Figure 5.5 shows this comparison for the signatures of

a single test. We also note that the average difference exhibited in this test between

the estimated frequency and the actual frequency was under 1% over all of the 3000

signature candidates that were produced. This is much better than the analytical error

bounds of the algorithm, which is probably due to the fact that the number of strings

in the input to HH2 is significantly smaller than the worst-case bound provided in the

5.4. EVALUATIONS 77

analysis in Section 4.6.3. The results of the comparison in the other tests were similar.

Signatures

Freq

uen

cy

(Per

cen

t)

0

10

20

30

40

50

60

70

80

90

1 5 9 13 17 21 25 29 33 37

Estimated Frequency

Actual Frequency

Figure 5.5: Signature frequency: algorithm estimation vs. the actual frequency

5.4.5 Threshold Testing

Both the false positive and false negative rate achieved by our system are influenced by

the values of the thresholds discussed in Section 5.3.3. As part of our testing, a range of

thresholds were tested. While intuitively, it may seem reasonable to take a peace−high

threshold that is relatively high (i.e., at least 50%), testing showed that this would lead

to a very high false positive rate. An example of this can be seen in Fig. 5.6. This

graph shows testing of different peace-high values, on a single set of files. The graph

shows the false positive rates caused by the different peace-high values when all other

values remain unchanged. The false positive rate shown in the dotted line measures the

percent of peacetime packets identified by the generated signatures. The false positive

rate shown in the whole line measures the percent of attack traffic packets identified by

the generated signatures which are not malicious. As can be seen, a peace-high value

of 3 is the highest values that minimizes both false positive rates, therefore this is the

value that was chosen for our tests.

5.4.6 Testing Frequent Signature Combinations

We have performed testing on our enhanced system which makes use of the Triple

Heavy Hitters algorithm (explained in Section 5.3.5) for identifying frequent signature


Perc

ent

Peace-high threshold

0

5

10

15

20

25

30

35

40

50 40 30 20 10 7 5 3 1 0

rate in white

detected-estimated

False positive rate in peacetime

False positive attack time detection rate

Figure 5.6: Comparing peace-high values.

combinations. The graphs in Figure 5.7 depict the results of tests performed on two

different attacks, for which we have both real attack traffic captures and real peacetime

traffic captures as described above. The system identified the frequent signature com-

binations and then performed the algorithm for minimizing the number of signatures

presented in Section 5.3.5.2. The results presented show the tradeoff between the pre-

cision and recall rates when selecting an increasing number of signatures. The results

shown indicate that for the tested samples the number of signatures can be decreased

substantially, thereby increasing precision significantly with almost no reduction of the

recall rates.

5.4.7 Signature Examples

An interesting aspect of testing real attacks is to see the actual signatures for these

attacks. Some examples of signatures include: An extra carriage-return (i.e., newline)

somewhere in the packet payload where it was not usually found; Use of upper-case

characters in a field which is normally found in legitimate traffic with lower-case charac-

ters; Use of an HTTP field that is rarely used; Use of a rare user agent. These signatures

are a clear indication of the importance of analyzing the peacetime traffic.

5.4. EVALUATIONS 79

Perc

en

t

80

82

84

86

88

90

92

94

96

98

100

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33

Recall Precision Number of Signatures

(a) Test 2: best recall-precision tradeoff achieved for 3 signatures

80

82

84

86

88

90

92

94

96

98

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Recall Precision

Perc

en

t

Number of Signatures

(b) Test 3: best recall-precision tradeoff achieved for 1 signature

Figure 5.7: Testing the algorithm for minimizing the number of signatures

Chapter 6

Heavy Hitters in a Stream of

Pairs: Distinct and Combined

Heavy Hitters

6.1 Overview

Consider a stream of IP packet headers going through some upstream point. A des-

tination IP address (the key) which receives a large number of packets constitutes a

”classic” heavy hitter (HH) of the request stream. We can also consider the associated

source IP addresses of each packet as an associated subkey. Destination addresses with

many different source IP addresses are then distinct heavy hitters (dHH). Destination

addresses which both receive many packets and have many different source IP addresses

are combined heavy hitters (cHH) (formal definitions are given in Section 6.2)

Exact detection of dHH and cHH would require large amounts of resources, and

therefore approximate solutions are needed. Generally, approximate distinct or com-

bined heavy hitters algorithms exhibit a tradeoff between detection accuracy and the

amount of space they require. Cardinality estimate accuracy is even more difficult to

achieve with a fixed-size structure, since a key may be evicted from the cache and

then re-enter the cache which presents some uncertainty with regards to cardinality.

We provide solutions for approximate detection of dHH and cHH using a fixed-size

structure which outperforms known solutions both in terms of cardinality accuracy

and practicality.

80

6.1. OVERVIEW 81


Our main contribution are novel practical sampling-based structures for distinct heavy

hitters (dHH) and combined heavy hitters (cHH) detection which are able to track only

O(ε−1) keys. Our dHH design significantly improves over existing work. Our cHH struc-

ture is substantially better than the naive approach of maintaining separate structures

for HH and dHH. The latter has overhead due to the overlap in the sets of cached keys

in the two structures, and also requires much larger structure sizes.

Our algorithm uses the basic principle of Sample and Hold (S&H) used in algorithms

applied to streams of elements with keys [39, 48, 53]. The streaming algorithm maintains

a set of cached keys, which constitutes a sample. With each cached key, we maintain

a counter that tracks the number of occurrences of the key in the stream since it has

entered the cache. When an element with key x that is not cached is processed, a biased

coin flip is used to determine whether to add it to the cache. S&H is better known with

a fixed-threshold, where we specify the bias of the coin, but has the disadvantage that

the memory usage (sample size) can increase. There is also a fixed-size scheme, where

we specify the sample size (and memory usage of the algorithm) and instead modify

the bias.

S&H sampling was originally proposed for domain sum queries and our application

here of a S&H based scheme for heavy hitter detection is novel, even in the context of

classic heavy hitters. An important property of S&H, which makes it suitable for HH

detection, is that the set of sampled keys realizes a weighted sample taken according to

hx (defined in Section 6.2) [39]. In this weighted sample, the heavier keys, in particular

the heavy hitters, are much more likely to be included than other keys.

For the purpose of distinct HH detection, we would like to obtain a weighted sample

with respect to the distinct weights wx (defined in Section 6.2). For this purpose,

S&H (or other classic HH algorithms) can not be used out of the box. Our proposed

distinct S&H design replaces the random coin flips by a random hash function applied

to the key and subkey pair. This ensures that repeated occurrences do not affect the

sample. Moreover, instead of the simple counters we use approximate distinct counters

[25, 37, 38, 52, 96], which use space that is only logarithmic or double logarithmic in

the number of distinct elements. We also propose a combined S&H algorithm, designed

for cHH detection, which maintains both a basic counter and an approximate distinct

counter for each cached key.

82 CHAPTER 6. HEAVY HITTERS IN A STREAM OF PAIRS: DHH AND CHH

We show that our distinct S&H and combined S&H schemes have the property that

the set of cached keys realizes a respective weighted sample. More precisely, distinct

S&H computes a sample with respect to distinct weights wx whereas combined S&H

does so with respect to the combined weights b(ρ)x (defined in Section 6.2). Therefore, a

sample of size c/ε (for a given constant c) will include each heavy hitter with probability

at least 1− exp(−c). Note that this is a worst case lower bound on the probability. The

detection probability is higher for heavier keys and more critically, thanks to without-

replacement sampling, also increases for the more skewed distributions that prevail in

practice. If the goal is only to return a short list of candidates which includes heavy

keys, then we do not need to maintain the approximate counting structures and the

total size of our structure is only O(c/ε) (dominated by the storage of key IDs of the c/ε

sampled keys). The distinct counting structures are needed when we are also interested

in estimates on the weight of included keys. In this case, for each key x, with distinct

counters of size c2 + log log n, where n is the sum of the weights of all keys, we can

estimate the weight of x within a well-concentrated absolute error of εn/c.

We demonstrate, via experimental evaluations, the effectiveness of our distinct and

combined S&H algorithms.

Our proposed fixed-size dHH algorithm, named Distinct Weighted Sampling (dwsH-

H), requires a constant amount of memory as opposed to the well known Superspreaders

solution [116] which uses memory which is linear in the input stream length. Moreover,

our use of sampling-based distinct counters is a significant practical improvement over

Locher’s relatively new fixed-size solution [76] which utilizes linear-sketch based distinct

counters, which are much less efficient in practice. In addition, our dHH algorithm pro-

duces a cardinality estimate for each key. This estimate is of much higher accuracy than

the estimate produced by Locher, while the Superspreaders do not provide comparable

estimates.

6.2 Preliminaries

6.2.1 Problem Definitions

As mentioned in Section 1.1.4, our input is modeled as a stream of elements, where is

element is a pair 〈x, y〉. The primary key x from a domain ∈ X and the subkey y from

domain Dx.

6.3. BACKGROUND - APPROXIMATE DISTINCT COUNTERS 83

We differentiate between 3 types of weights for each key x:

1. The (classic) weight hx: is the number of elements in the stream with the same

key x

2. The distinct weight wx: is the number of different subkeys in all elements in the

stream having the same key x.

3. The combined weight b(ρ)x : is a combination of the classic and the distinct weight.

Given a parameter ρ 1, b(ρ)x ≡ ρhx + wx

We define a key x as being heavy in different forms accordingly:

1. x is a heavy hitter when hx ≥ ε∑

y hy

2. x is a distinct heavy hitter when wx ≥ ε∑

y wy

3. x is a combined heavy hitter when b(ρ)x ≥ ε

∑y b

(ρ)y

6.2.2 Notations

The notations used throughout this section are summarized in Table 6.1.

Symbol Meaning

x key

y subkey

hx number of elements with key x

wx number of different subkeys in elements with key x

m maxxwx

τ detection threshold

k cache size

` number of buckets

ρ combined weight parameter


6.3 Background - Approximate Distinct Counters

A distinct counter is an algorithm that maintains the number of different keys in a

stream of elements. An exact distinct counter requires state that is proportional to the


number of different keys in the stream. Fortunately, there are many existing designs and

implementations of approximate distinct counters that have a small relative error but

use state size that is only logarithmic or double logarithmic in the number of distinct

elements [25, 37, 38, 52, 96]. The basic idea is elegant and simple: We apply a random

hash function to each element, and retain the smallest hash value. We can see that

this value, in expectation, would be smaller when there are more distinct elements,

and thus can be used to estimate this number. The different proposed structures have

different ways of enhancing this approach to control the error. The tradeoff of structure

size and error are controlled by a parameter `: A structure of size proportional to ` has

normalized root mean square error (NRMSE) of 1/√`. In Section 6.5 we use distinct

counters as a black box in our dHH structures, abstracted as a class of objects that

support the following operations:

• Init: Initializes a sketch of an empty set

• Merge(x): merge the string x into the set (x could already be a member of the

set or a new string).

• CardEst: return an estimate on the cardinality of the set (with a confidence

interval)

In Section 6.5.5, we also propose a design where a particular algorithm for approximate

distinct counting is integrated in the dHH detection structure.

6.4 Related Work

The concept of distinct heavy hitters, together with the motivation for DDoS attack

detection, was introduced in a seminal paper of Venkataraman et al. [116]. Their algo-

rithm, aimed at detection of fixed-threshold heavy hitters, returns as candidate heavy

hitters the keys with an (initialized) Bloom filter that is filled beyond some thresh-

old. Keys with a high count in the sample are likely to be heavy hitters and almost

saturate their bloom filter. A related work adapts dHH schemes to TCAMs [24]. Our

fixed-threshold scheme is conceptually related to [116]. Some key differences are the

better tradeoffs we obtain by using approximate distinct counters instead of Bloom

filters, and our simpler structure with analysis that ties it directly to classic analysis of

weighted sampling, which also simplifies the use of parameters. More importantly, we

6.5. THE DISTINCT WEIGHTED SAMPLING ALGORITHMS 85

provide a solution to the fixed-size problem and also address the estimation problem.

The estimates on the weight of the heavy keys that can be obtained from the Bloom

filters in [116] are much weaker, since once the filter is saturated, it can not distinguish

between heavy and very heavy keys.

Locher [76] recently presented two designs for dHH detection which makes use of

approximate distinct counters. The first design is sampling-based and builds on the

distinct pair sampling approach of [116]. This design also only applies to the fixed-

threshold problem. The other design uses linear sketches and applies to the fixed-size

problem. Locher’s designs are weaker than ours both in terms of practicality and in

terms of theoretical bounds. The linear-sketch based design utilizes linear-sketch based

distinct counters, which are much less efficient in practice than the sampling-based ones.

The designs have a quadratically worse dependence of structure size on the detection

threshold τ , which is Ω(τ−2) instead of our O(τ−1). Finally, multiple copies of the same

structure are maintained to boost up confidence, which results in a large overhead, since

heavy hitters are accounted for in most copies. Locher’s code was not available for a

direct comparison.

Another conceivable approach is to convert to DHH classic fixed-size deterministic

HH streaming algorithms, such as Misra Gries [83] or the space saving algorithm [81], by

replacing counters with approximate distinct counters. The difficulty that arises is that

the same distinct element may affect the structure multiple times when the same key

re-enters the cache, resulting in much weaker guarantees on the quality of the results.

6.5 The Distinct Weighted Sampling Algorithms

We now present our distinct weighted sampling schemes, which take as input elements

that are key and subkey pairs. We build on the fixed-threshold and fixed-size classic

S&H schemes but make some critical adjustments: First, we apply hashing so that

we can sample the distinct stream instead of the classic stream. Second, instead of

using simple counters cx for cached keys as in classic S&H, we use approximate distinct

counters applied to subkeys. Third, we maintain state per key that is suitable for

estimating the weight of heavy cached keys (whereas classic S&H was designed for

unbiased domain queries).

Our algorithms, in essence, compute heavy hitters using weighted sampling. A sam-


ple set of the keys is maintained during the execution of each of the algorithms (HH,

dHH, or cHH). The sample set constitutes a weighted sample according to the respective

counts so that the heavier keys, in particular the heavy hitters, are much more likely to

be included than other keys. The counts in each of the algorithms are different; number

of repetitions, measure of distinctness, and a combined measure, respectively. The al-

gorithms maintain counts with each cached key which allow to produce the cardinality

estimate for each output key.

6.5.1 Fixed-Threshold Distinct Heavy Hitters

Our fixed-threshold distinct heavy hitters algorithm is applied with respect to a spec-

ified threshold parameter τ . We make use of a random hash function Hash ∼ U [0, 1].

An element (x, y) is processed as follows. If the key x is not cached, then if Hash(x, y)

(applied to the key and subkey pair (x, y)) is below τ , we initialize a dCounters[x ] object

(and say that now x is cached) and insert the string (x, y). If the key x is already in

the cache, we merge the string (x, y) into the distinct counter dCounters[x ]. The pseudo

code is provided as Algorithm 1.

Algorithm 1: Fixed-threshold Distinct Heavy Hitters

Data: threshold τ , stream of elements of the form (key,subkey), where keys are fromdomain X

Output: set of pairs (x, cx) where x ∈ XdCounters← ∅ // Initialize a cache of distinct counters

foreach stream element with key x and subkey y do // Process a stream element

if x is in dCounters thendCounters[x ].Merge(x,y);

elseif Hash(x,y) < τ then // Create dCounters[x ]

dCounters[x ].InitdCounters[x ].Merge(x,y)

return(For x ∈ dCounters, (x, dCounters[x ].CardEst))

6.5.2 Fixed-Size Distinct Weighted Sampling

The fixed-size Distinct Weighted Sampling (dwsHH) algorithm is specified for a cache

size k. Compared with the fixed-threshold algorithm, we keep some additional state for

each cached key:

• The threshold τx when x entered the cache (represented in the pseudocode as

dCounters[x ].τ). The purpose of maintaining τx is deriving confidence intervals on


wx. Intuitively, τx captures a prefix of elements with key x which were seen before

the distinct structure for x was initialized, and is used to estimate the number of

distinct subkeys in this prefix.

• A value seed(x) ≡ min(x,y)in stream Hash(x,y) which is the minimum Hash(x,y)

of all elements with key x (in the pseudocode, dCounters[x ].seed represents

seed(x)). Note that it suffices to track seed(x) only after the key x is inserted

into the cache, since all elements that occurred before the key entered the cache

necessarily had Hash(x,y) > τx, as the entry threshold τ can only decrease over

time.

The fixed-size dwsHH algorithm retains in the cache only the k keys with lowest

seeds. The effective threshold value τ that we work with is the seed of the most recently

evicted key. The effective threshold has the same role as the fixed threshold since it

determines the (conditional) probability on inclusion in the sample for a key with

certain wx. A pseudo code is provided as Algorithm 2.

Algorithm 2: Fixed-size streaming Distinct Weighted Sampling (dwsHH)

Data: cache size k, stream of elements of the form (key,subkey), where keys are fromdomain X

Output: set of (x, cx, τx) where x ∈ XdCounters← ∅; τ ← 1 // Initialize a cache of distinct counters


if x is in dCounters thendCounters[x ].Merge(x,y)dCounters[x ].seed← mindCounters[x ].seed, Hash(x,y)

elseif Hash(x,y) < τ then // Create dCounters[x ]

dCounters[x ].InitdCounters[x ].Merge(x,y)dCounters[x ].seed← Hash(x,y)dCounters[x ].τ ← τif |dCounters| > k then

x← arg maxy∈dCounters dCounters[y ].seedτ ← dCounters[x ].seedDelete dCounters[x ]

return(For x in dCounters, (x, dCounters[x ].CardEst, dCounters[x ].τ))

6.5.3 Analysis and Estimates

We first consider the sample distribution S of dwsHH. As we mentioned (in Chapter

2), it is known that classic S&H applied with weights hx has the property that the set


of sampled keys is a ppswor sample according to hx [39]. A ppswor sampling scheme

with respect to weights hx can be described as an iterative process, where in each

step a key with weight hx is selected with probability wx/W , where W =∑

x 6∈S hx is

the total weight of keys that are not in the sample S. After selection, the key is added

to the sample S. Surprisingly, the sample distribution properties of S&H carries over

from being with respect to hx (classic S&H) to being with respect to wx (distinct WS):

Theorem 11. The set of cached keys by dwsHH is a ppswor sample taken according

to weights wx.

Proof. First note that repeated (key, subkey) pairs can not affect the structure, so the

structure only depends on the distinct stream of (key, subkey) pairs with all repetitions

omitted.

The set of cached keys can be fully characterized in terms of the set of seed(x)

values, as it contains a prefix of keys with smallest seeds: In the fixed τ scheme, a key

x is cached if and only if seed(x) < τ . In the fixed k scheme, the set of cached keys

corresponds to the k keys that have smallest seed values.

We now consider the distribution of seed(x), which is the minimum of wx inde-

pendent random variables selected uniformly from U [0, 1]. If we transform each hash u

to − ln(1− u), we obtain that each is exponentially distributed with parameter 1. The

minimum of wx such selections is exponentially distributed with parameter wx. This

transformation is monotone, so we can work with the uniform hashes and then trans-

form the seed, obtaining that − ln(1−seed(x)) ∼ Exp[wx] is exponentially distributed

with parameter wx.

Now note that seed(x) for different keys is independent. We now apply a classic

result of Rosen which shows that ppswor can be realized by associating with keys’ seed

values that are independent exponential random variables with parameter equal to the

weight of the key and taking as our sample the keys with smallest seed values [102].

Note that since the transformation of the seeds is minimum, the order according to

seed(x) is the same as the order according to − ln(1− seed(x)).

A ppswor sample with respect to weights wx provides the following guarantees on

inclusion probabilities of keys:

Lemma 12. When working with a fixed k, a key with weight wx is selected with prob-

ability ≥ 1 − (1 − wx/m)k, where m =∑

xwx is the sum of weights of all keys. If the


threshold is τ , a key with weight wx is selected with probability 1− exp(−τwx).

Proof. From the definition of ppswor, the probability that a key is selected at each step

is at least wx/m. Therefore, the probability that it is not selected in k steps is at most

1− (1− wx/m)k.

It follows that a key x is likely to be sampled when wx m/k. We can tighten this

bound when there are keys with weight much larger than m/k. We obtain that key x

is very likely to be sampled when:

wx maxi∈[0,k−1]

(m−∑x∈topi

wx)/(k − i) (6.1)

where topi is the set of i heaviest keys.

6.5.4 Estimate Quality and Confidence Interval

The set of sampled keys can be viewed as dHH candidates. Note that the sample can

be computed by only maintaining seed values for keys, without including the distinct

counters. The candidates include the heavy hitters but may also include keys with small

weight: With the fixed-threshold scheme, we expect the sample size to include τ∑

y wy

keys even when all keys have wx = 1. With the fixed-size scheme, we expect the cache

to include keys with wx ∑

y wy/k but it may also include some keys with small

weight.

For many applications, including the detection of DDoS attacks which we discussed

in the introduction, it is important to identify the actual distinct heavy hitters in

our candidate list by returning an estimate on their weight wx. We compute an esti-

mate with a confidence interval on wx for each cached key x, using the entry thresh-

old τ (or dCounters[x ].τ in the fixed-size scheme) and the approximate distinct count

dCounters[x ].CardEst.

The count dCounters[x ].CardEst estimates the number of distinct subkeys processed

after x entered the cache. This component is subject to the solution quality provided

by our approximate distinct counter. The variance on this estimate σ22 depends on

the specific distinct counter implementation. The implementation we worked with has

σ22 = 1/(2(`− 1))n2, where ` is the distinct counter parameter and n is the estimated

cardinality.


The other component is bounding or estimating the number of distinct subkeys

processed before x entered the cache. We obtain this bound using the entry threshold

τ : In expectation, τ−1 distinct subkeys are processed before x enters the cache. As

with classic S&H, but considering distinct subkeys this time, the actual distribution is

geometric with parameter τ , and its variance is σ21 = (1− τ)/τ2.

These two estimates are well concentrated and we can apply the normal approxi-

mation to obtain confidence intervals. Now we observe that the set of subkeys viewed

before x enters the cache can be disjoint or can overlap with the subkeys processed after

x entered the cache. Because of that, we have uncertainty in our estimate and also can

not provide an unbiased estimate. Combining it all we have the confidence interval

[dCounters[x ].CardEst− aδσ2, (6.2)

dCounters[x ].CardEst− 1 + 1/τ + aδ

√σ21 + σ2

2

],

where aδ is the coefficient for confidence 1− δ according to the normal approximation.

E.g., for 95% confidence we can use aδ = 2.

We note that while the set of cached keys does not depend on the stream arrange-

ment (is a ppswor sample by wx), the confidence intervals are tighter (and thus better)

for keys that are presented earlier and thus have τx τ .

6.5.5 Integrated dwsHH Design

We propose a seamless design (Integrated dwsHH) which integrates the hashing per-

formed for the weighted sampling component with the hashing performed for the ap-

proximate distinct counters. We use a particular type of distinct counters based on

stochastic averaging (`-partition) [52, 96] (see [38] for an overview). This design hashes

strings to ` buckets and maintains the minimum hash in each bucket. These counters

are the industry’s choice as they use fewer hash computations. We estimate the distinct

counts using the tighter HIP estimators [38]. Pseudocode for the fixed-size Integrated

dwsHH is provided as Algorithm 3. The parameter k is the sample size and the param-

eter ` is the number of buckets. Note, we use two independent random hash functions

applied to strings: BucketOf returns an integer ∼ U [0, ` − 1] selected uniformly at

random. Hash returns ∼ U [0, 1] (O(logm) bits suffice).

As in the generic Algorithm 2, we maintain an object dCounters[x ] for each cached

key x. The object includes the entry threshold dCounters[x ].τ and dCounters[x ].seed,


which is the minimum Hash(x,y) of all elements (x, y) with key x. The object also

maintains ` values c[i] for i = 0, . . . , ` − 1 from the range of Hash, where c[i] is the

minimum Hash over all elements (x, y) such that the element was processed after x was

cached and BucketOf(x,y) is equal to i (c[i] = 1 when this set is empty). Note that

dCounters[x ].seed ≡ mini∈[0,`−1] c[i]. The object also maintains a HIP estimate CardEst

of the number of distinct subkeys since the counter was created.

Algorithm 3: Integrated dwsHH

Data: cache size k, distinct structure parameter `, stream of (key,subkey) pairsOutput: set of (x, cx, τx) where x ∈ XdCounters← ∅; τ ← 1 // Initialize a cache of distinct counters


if x is in dCounters thenif Hash(x,y) < dCounters[x ].c[BucketOf(x,y)] then

dCounters[x ].CardEst+← `/

∑`−1i=0 dCounters[x ].c[i]

dCounters[x ].c[BucketOf(x,y)]← Hash(x,y)dCounters[x ].seed← mindCounters[x ].seed, Hash(x,y)

elseif Hash(x,y) < τ then // Initialize dCounters[x ]

for i = 0, . . . , `− 1 do dCounters[x ].c[i]← 1dCounters[x ].CardEst← 0dCounters[x ].c[BucketOf(x,y)]← Hash(x,y)dCounters[x ].seed← Hash(x,y)dCounters[x ].τ ← τif |dCounters| > k then


return(For x ∈ dCounters, (x, dCounters[x ].CardEst, dCounters[x ].τ))

For a sampled x, we can obtain a confidence interval on wx using the lower end

point dCounters[x ].CardEst + 1, with error controlled by the distinct counter and the

upper end point

dCounters[x ].CardEst + 1/dCounters[x ].τ , with error controlled by both the distinct

counter and the entry threshold. The errors are combined as explained in Section 6.5.4

using the HIP error of σ2 ≈ (2`)−0.5dCounters[x ].CardEst .

The size of our structure is O(k` logm) and the representation of the k cached keys.

Note that the parameter ` can be a constant for DDoS applications: A choice of ` = 50

gives NRMSE of 10%.

We can further optimize this design according to the most constrained resources

in our application. Let it be processing time, memory, or maximum processing time


per element. For example, to control element processing time we can evict more keys

(a fraction of the cache) when it is full. When memory is highly constrained we can

instead use the exponent representation (round c[i] to an integral power of 2) as done

with Hyperloglog [52] and apply an appropriate HIP estimate as described in [38]. This

will reduce the structure size to O(k logm+ k` log logm).

6.6 The Combined Weighted Sampling Algorithm

We now present our cwsHH algorithm for combined heavy hitters detection. The pseu-

docode, which builds on our Integrated dwsHH design (Algorithm 3), is presented in Al-

gorithm 4 and works with a specified parameter ρ. For each cached key x, the combined

weighted sampling (cwsHH) algorithm also includes a classic counter dCounters[x ].f of

the number of elements with key x processed after x entered the cache.

Theorem 13. The sample computed by Algorithm 4 is a ppswor sample with respect

to the combined weights b(ρ)x ≡ ρhx + wx.

Proof. We will show that the seed value (transformed appropriately) is exponentially

distributed with parameter b(ρ)x :

− ln(1− seed(x)) ∼ Exp[b(ρ)x ]

This will conclude the proof using [102], as in the proof of Theorem 11.

The value seed(x) can be expressed as the minimum of two components: seedwx

which is the minimum over elements of Hash(x,y), and seedhx, which is the minimum

over elements with key x of independent draws of erand← 1− (1− rand())1/ρ.

It follows from the proof of Theorem 11 that − ln(1 − seedw(x)) ∼ Exp[wx]. We

will now show that:

− ln(1− seedh(x)) ∼ Exp[ρhx] .

Since the minimum of two exponential random variables is exponential with the sum

of parameters, this will conclude our proof.

For each element we can draw an exponentially distributed random variable with

parameter ρ using z = − ln(1−rand())/ρ. But since the algorithm takes the minimum

with uniform random variables we apply an appropriate reverse transformation 1 −

6.6. THE COMBINED WEIGHTED SAMPLING ALGORITHM 93

exp(−z), obtaining that we need to draw the random variables

erand = 1− exp(ln(1− rand())/ρ) = 1− (1− rand())1/ρ ,

as used by the algorithm.

Algorithm 4: Streaming cwsHH

Data: cache size k, distinct structure parameter `, parameter ρ, stream (key,subkey)pairs

Output: set of (x, cx, fx, τx) where x ∈ XdCounters← ∅; τ ← 1 // Initialize a cache of distinct counters


erand← 1− (1− rand())1/ρ // Randomization for hx count

if x is in dCounters then

dCounters[x ].f+← 1// Increment count

if Hash(x,y) < dCounters[x ].c[BucketOf(x,y)] then

dCounters[x ].CardEst+← `/

∑`−1i=0 dCounters[x ].c[i]

dCounters[x ].c[BucketOf(x,y)]← Hash(x,y)dCounters[x ].seed← mindCounters[x ].seed, Hash(x,y), erand

elseif minerand, Hash(x,y) < τ then // Initialize dCounters[x ]

for i = 0, . . . , `− 1 do dCounters[x ].c[i]← 1dCounters[x ].CardEst← 0dCounters[x ].f ← 1dCounters[x ].c[BucketOf(x,y)]← Hash(x,y)dCounters[x ].seed← minHash(x,y), eranddCounters[x ].τ ← τif |dCounters| > k then


return(For x ∈ dCounters, (x, dCounters[x ].CardEst, dCounters[x ].f, dCounters[x ].τ))

Similarly to dwsHH, if we are only interested in the set of sampled keys (cHH

candidates), it suffices to maintain the seed values of cached keys without the counting

and distinct counting structures. The counters are useful for obtaining estimates and

confidence intervals on the combined weights of cached keys: For a desired confidence

level 1−δ. The lower end of the interval is dCounters[x ].CardEst+ρdCounters[x ].f−aδσ1,

where σ1 is the standard error of the distinct count. For the higher end, we bound the

contribution of the prefix, which has expectation bounded by 1/τ −1, and subject both

to the S&H error and the approximate distinct counter error, so we obtain

dCounters[x ].CardEst + ρdCounters[x ].f − aδσ1 − 1 + 1/τ + aδ√σ21 + σ22.


6.7 Evaluation

6.7.1 Theoretical Comparison

In Table 6.2 we show a theoretical memory usage comparison of our distinct weighted

sampling algorithms, Superspreaders and Locher [76], assuming all algorithms use the

same distinct count primitive. We are using the notations in Table 6.1, δ as the proba-

bility that a given source becomes a false negative or a false positive, N as the number

of distinct pairs, r as the number of estimates, s as the number of pairs of distinct

counting primitives used to compute each estimate, and c (for a c-superspreader (i.e.

we want to find keys with more than c distinct elements) choosing c = τ−1.As we can

also see from the table, the cache size affects the distinct weight estimation error for the

keys. Note that the Superspreaders algorithm does not provide an estimate on the dis-

tinct weight of the keys, but rather only reports which keys have high enough weights.

Locher’s algorithm provides an estimate error which is incomparable theoretically and

significantly higher than ours in practice

Algorithm Memory usage Keys’ distinct weight estima-tion error

Fixed-threshold dis-tinct WS

O(τ∑y wy · ` logm) (Exp.) τ−1 + wy/

√2`

Fixed-size dwsHH O(k` logm) (1/k)∑y wy + wy/

√2`

Superspreaders 1-LevelFiltering[116]

O(Nc ) NA

Superspreaders 2-LevelFiltering[116]

O(Nc ln1δ ) NA

Locher[76] O(rs · 2`+ |k|) NA

Table 6.2: Theoretic Comparison between methods

6.7.2 Practical Evaluation

6.7.2.1 Accuracy and Parameters

The following tests were done using a 4GB trace of 40M DNS queries captured at

our campus network. For each DNS query q = ...p6.p5.p4.p3.p2.p1, we sliced the query

at most 5 times to produce the < key, subkey > pairs, < p1, ...p6.p5.p4.p3.p2 >, <

p2.p1, ...p6.p5.p4.p3 > ... < p5.p4.p3.p2.p1, ....p6 >. This process gave us a total of over

6.7. EVALUATION 95

120M pairs composed of a total of nearly 1M distinct pairs.

Figure 6.1: Distinct Weighted Sampling (dWS): Modified cache size

We compare the affect of different cache sizes (k) on the output of our dwsHH

algorithm. As shown in Fig 6.1 we set the number of buckets to be 32 and use cache

sizes of 100, 500, 1000, 10000. Using a cache size of 100, our algorithm reports keys

with cardinality at least 0.005 of the total number of distinct items with a false negative

rate under 5%. A false negative rate of under 5% is also achieved with cache size of 500

for cardinality over 0.0008 of the total number of distinct items. Using a cache of 1000

and 10000, our algorithm reports keys with cardinality at least 0.0004 of the total, with

false negative rates of 2% and 0% respectively. Furthermore, for the reported keys, a

cache of 100 gave an average distinct weight estimation error of 24% over all keys and

caches of 500, 1000 and 10000 gave an error of 22%, 17% and 15% accordingly.

Additionally, we compare the affect of different number of buckets (`) on the output

of our dwsHH algorithm. As shown in Fig 6.2 we set cache size to be 1000 and use

4, 8, 16, 32 and 64 buckets. For the reported keys, using 4 buckets gave an average

distinct weight estimated error of 77% over all reported keys and 8, 16, 32 and 64

buckets gave an average error of 37%, 23%, 17% and 15% accordingly. The median

error of the estimate distinct weights of the structure using 4, 8, 16, 32, 64 buckets is

49%, 33%, 18%, 13% and 9% accordingly.


Figure 6.2: Distinct Weighted Sampling (dWS): Modified Number of Buckets

To report, for example, all keys which have a weight of at least 0.001% of the

total number of distinct pairs, using the dwsHH algorithm, we use cache size of 1000,

providing a false negative rate of 0 and a false positive rate of 0. Using 32 buckets, the

weight estimates provided by the algorithm have a median error of less than 0.1% of

the item cardinality for the reported keys. This test is shown in Figure 6.3.

Figures 6.4 and 6.5 compare results of our cwsHH algorithm on the above data,

using a cache of 1000 and 32 buckets, using ρ = 0.1 in test 1 and ρ = 0.9 in test 2.

Fig. 6.4 shows the cardinality estimates of both tests for the 50 most frequent elements

in the data. Test 2 with ρ = 0.9 reported all of the top 50 elements where as test 1 had

a 2% false negative rate (enlarged icons indicate items reported only by test 2). Fig. 6.4

shows the combined weight per item in each test compared to both the frequency and

cardinality of the items in the data. The smaller rho is, the closer the combined weights

are to the cardinality.

6.7.2.2 Memory Usage

In Fig. 6.6, we compare dwsHH to a simple and highly inefficient algorithm which

counts the number of distinct values associated with each key as well as the One-

Filter Superspreaders algorithm [116]. Our algorithm consumes a constant amount of

6.7. EVALUATION 97

Figure 6.3: Distinct Weighted Sampling (dWS): 32 Buckets, 1000 Items

space, while the simple algorithm consumes space that is linear with the number of

distinct pairs seen. The Superspreaders algorithm does slightly better than the simple

algorithm, yet consumes significantly more space than ours. We note that the two filter

variant of the Superspreaders algorithm reaches a better asymptotic memory usage

model yet its memory usage still grows linearly with the stream length. Also, it is far

more complicated and its memory usage is more susceptible to implementation factors.


Figure 6.4: Combined Weighted Sampling (cWSHH) Modified rho: accuracy

Figure 6.5: Combined Weighted Sampling (cWSHH) Modified rho: combined weight

6.7. EVALUATION 99

0 5000 10000 15000 20000 25000 30000 35000 40000Number of Distinct Pairs

0

500000

1000000

1500000

2000000

2500000

Mem

ory

Usa

ge [

byte

s]

Our AlgorithmOne Filter SuperspreadersSimple Counting

Figure 6.6: Distinct Weighted Sampling (dWS): Modified cache size

Chapter 7

Mitigating DNS Random

Subdomain DDoS Attacks Using

Distinct Heavy Hitters

7.1 Overview

The Domain Name System (DNS) service is a critical element in the internet func-

tionality. Distributed Denial of Service (DDoS) attacks on the DNS service typically

consist of many queries coming from a large botnet and sent to the root name server

or an authoritative name server along the domain chain. According to Akamai’s state

of the internet report [16] nearly 20% of DDoS attacks in Q1 of 2016 involved the DNS

service, some of them on the root name servers [117].

One type of particularly hard to mitigate DDoS attack is the randomized attack

on the DNS service called Random Subdomain Attack [91] (also known as Authorita-

tive Exhaustion Attack [23]; Nonsense Name Attack [74]; Pseudo-random Subdomain

Attack [17]). In this attack, queries for many different pseudorandom non-existent sub-

domains (subkeys) of the same primary domain (key) are issued [17]. Since the response

to a query for a new subdomain is not cached at the DNS resolver, these queries are

propagated to the domain authoritative server, overloading these servers and collater-

ally impacting the recursive resolvers of the Internet service provider.

Random Subdomain attacks were first witnessed in China in 2009 [75], yet they

remained sporadic for several years. In 2014 they had started to make a significant

100

7.1. OVERVIEW 101

Figure 7.1: DNS Random Subdomain attack overview

ongoing impact on the network. In [91] it is shown that in the beginning of 2014 the

number of distinct domains seen at ISP resolvers worldwide per day began to rise signif-

icantly, with substantial peaks witnessed later that year. While these attacks have been

witnessed constantly since then, in October 2016 they made headlines, when hundreds

of sites were drastically affected by the Mirai IoT Botnet attack on domains delegated

to the Dyn DNS resolvers [23]. Mitigation of these attacks took hours. Following the

Mirai attack in 2016, Forrester Research discussed the crippling affect this attack could

have on critical Internet infrastructure [65], claiming it is one way in which the Internet

could die.

Currently, Random Subdomain attacks have become very common and continue

to baffle administrators of both recursive and authoritative DNS resolvers. Even data

collected in our own campus network revealed such an attack on an authoritative DNS

resolver in our campus in January 2017.

Top companies involved in DNS security have addressed these attacks, clearly s-

tating the need for efficient mitigation of such attacks and discussing some of their

102 CHAPTER 7. MITIGATING DNS RANDOM SUBDOMAIN ATTACKS

current solutions. Dyn has suggested that customers obtain additional secondary DNS

providers [113]. Security specialists in Akamai Technologies, that operates an authori-

tative DNS service similar to Dyn Managed DNS discuss the need to protect the DNS

infrastructure and how these attacks can indirectly affect organizations relying on an

attacked vendor. They claim that their segregated and distributed DNS architecture

is a major factor in protecting against such large scale DDoS attacks yet they do not

provide insight as to how they may provide a solution specially crafted for such at-

tacks [110]. Cloudflare discuss how their solution can easily scale as a form of defense

against these high volume attacks [61], yet again, no specialized solution is discussed.

Additional companies such as Secure64 [103] and Infoblox [62] indicate that they of-

fer solutions these attacks, yet little detail is provided as to how they are detected or

mitigated.

Mitigation of Random Subdomain attacks is difficult since the packets in the attack

are correctly formed DNS requests. Furthermore, the queries are normally received

from legitimate ISP clients and therefore source based filtering can not be used. The

solution of internet providers so far has been to identify the targeted zone manually by

analyzing query logs, which can take a significant amount of time, and to temporarily

prevent the name server from handling queries for this zone [17, 74] (or alternatively

to reduce the number of queries handled using rate limiting).


We present a system for the mitigation of this attack. Our system is based on the

observation that the number of distinct subdomains in queries for targeted domains

significantly increases during attack time due to the random part of the query. Our

system detects this sudden rise in the number of distinct subdomains, and therefore

identifies the targeted domain automatically. Depending on the rate of the attack, our

system can detect an attack within seconds of attack start time.

During normal network operation, the number of distinct subdomains for each do-

main is usually relatively constant and typically a small number. One exception to this

is the increasing usage of disposable domains. These are large volumes of automatically

generated domains, legitimately created by top sites and services (e.g., social networks

and search engines), to give some signal to their server [36]. By analyzing traffic during

normal server load (i.e. ”peacetime”), our system creates a baseline of the normal num-

7.1. OVERVIEW 103

ber of distinct subdomains so that it can detect the abnormal rise during the attack.

Using this baseline, our system can identify attacks while significantly reducing false

positives which may be caused by the use of disposable domains.

Furthermore, by analyzing the peacetime traffic we are able to automatically iden-

tify most of the legitimate requests for the targeted domain. For example, suppose

the query mail.targetsite.com is often found during peacetime. During an attack com-

prised of queries of the form < Randomstring > .targetsite.com, our system identifies

queries for legitimate subdomains of targetsite.com therefore allowing queries such as

mail.targetsite.com to be handled, eliminating many false positives. Attack signatures

extracted by our system can be matched against consequent queries so that attacks can

be mitigated quickly and accurately.

We make use of our distinct heavy hitter algorithm (Chapter 6), using the suffix

of the domain as the key and the prefix as a subkey as we shortly explain. Consider a

stream of DNS queries, with the top-level domain serving as the key. A key that appears

a large number of times in the query stream constitutes a ”classic” heavy hitter (e.g.,

google.com, cnn.com, etc,). If each query’s subdomain serves as the subkey (e.g., mail.,

home., game1., etc,), a key with many different subkeys is then a distinct heavy hitters

(dHH).

At the core of our system, is a mechanism, that given a stream of DNS queries,

extracts the hierarchy of heavy distinct domains. There are two main challenges in this

hierarchy extraction. The first, is that there are very many different queried domains.

Identifying the number of distinct subdomains for each domain naively, requires main-

taining the set of distinct subdomains for each queried domain, which would take up

way too much space. The second, is that constructing this hierarchy exactly, requires

placing all of the queried domains in a structure which would become incredibly large

very quickly. To solve the first issue, our system identifies the heavily distinct domains

using our distinct heavy hitter algorithm (Chapter 6). To solve the second issue we

have devised a structure which builds an approximate hierarchy efficiently, using only

the heavily distinct domains detected by our algorithm (Section 7.3).


7.2 Attack Overview

As depicted in Figure 7.1, the attack basically works in the following manner: The

initiator of the attack or the attacker, utilizes botnets, and causes the compromised

machines to send many different (unique) queries for the same target domain. For

example, attack queries may be of the form < Randomstring > .targetdomain.com.

These queries are sent by the clients directly or through the clients’ open resolvers

to their Internet Service Provider’s (ISP) resolvers. Since each request is unique and

non-existent, the ISP’s resolvers recursively query the target domain’s authoritative

server.

Initially, the authoritative server is able to respond and typically answers with an

”NXDOMAIN” response, indicating that the domain can’t be found. At some point,

when the authoritative server becomes overwhelmed, it will either crash or implement

a response rate limiting mechanism. Either way, no response will be received from the

authoritative server and it will appear unresponsive to the ISP.

Once this occurs, the ISP servers, that store each recursive request until a response

is received, will exhaust all available storage space and also become debilitated. In this

state, the ISP resolvers can no longer handle legitimate requests from non-compromised

clients, severely degrading its service capabilities.

7.2.1 Current Detection Techniques

Detection mechanisms of this attack include mostly manual identification of the tar-

geted domain through anomalies in the server resource consumption and the backlog

of recursive client queries.

Another possible detection technique is to detect the rise in the number of ”NX-

DOMAIN” responses. While this can help in some cases, in many cases the attack rate

is extremely high, causing the authoritative server to crash very quickly. As we men-

tioned above, once the server crashes, no response will be received for queries for the

targeted domain as well as other domains hosted by that server. At this point no more

”NXDOMAIN” responses are received from the server. This temporary rise needs to

be detected very quickly and can be easily missed. Furthermore, this kind of mitigation

mechanism needs to be placed in a location in the network which allows it to view both

queries and the responses. Due to routing constraints this is not always possible.

7.3. RANDOM SUBDOMAIN ATTACK MITIGATION SYSTEM 105

Authoritative servers have little defense abilities against these attacks. In light of

recent attacks, industry specialists have advised companies to have their DNS authori-

tative server hosted by more than one DNS provider, in hopes of withstanding at least

some of the attacks.

7.3 Random Subdomain Attack Mitigation System

7.3.1 System Overview

System Placement: As depicted in Figure 7.2 our system can be placed at the

ingress point of the ISP DNS resolvers. Alternately, our system can be placed directly

at the ingress point of the authoritative server, to mitigate attacks on a specific sub do-

main. That is, if the attack queries are of the form < random > .subdomain.victim.com

our technique would identify this attack. Additionally, if many domains are hosted on

the same authoritative server our system can detect an attack on any of these domains.

Figure 7.2: DNS Random Subdomain mitigation High-level approach


Figure 7.3: DNS Random Subdomain mitigation system overview

Attack Detection: As depicted in Figure 7.3, attack detection is done in two stages.

The first stage is a preprocessing of traffic captured when there is a normal DNS

query load (this is considered to be peacetime). Using our system, a baseline is created

which identifies domains which have many different subdomains on a regular basis (for

example, domains that use disposable domains). Additionally, a whitelist of common

domain subparts (i.e., mail, maps etc.) is also identified and used during mitigation

to allow the legitimate queries of targeted domains. The second stage is an analysis

of traffic during an attack. The system identifies domains which are potential attack

targets. If the number of distinct queries for these domains is significantly higher than

the peacetime baseline, these domains are set as attack signatures.

The main component of our system is the Distinct Heavy Domain Hierarchy Ex-

tractor (Section 7.3.2.2), which is used for both the baseline creation as well as the

attack signature extraction.

Attack Mitigation: Once signatures have been extracted, consequent queries are

matched against the attack signatures. Queries which match an attack signatures and do

not qualify as whitelisted are dropped before reaching the ISP resolvers. For example, in

an attack on victim.com. Our system would generate a signature ’*.victim.com’. Using

the whitelist of common domain subparts, our system identifies that ’mail.victim.com’

is not an attack query and it is allowed. Other queries for ’victim.com’ are dropped.

The whitelist can be fine-tuned for each signature during the attack to further reduce


false positives.

Our system makes no assumptions on the resource consumption or behaviour of the

resolvers making it more robust in terms of detection.

7.3.2 System Details

7.3.2.1 Preliminaries and Notations

We denote a domain, subdomain and subpart in the following manner: Given a query

q = ...d6.d5.d4.d3.d2.d1, a subpart of a domain is any individual part di. (i.e., d1, d2

etc.). A domain-suffix of the query is any suffix of q composed of whole subparts. That

is, domains can be d1; d2.d1; d3.d2.d1 etc. up to the entire query ...d6.d5.d4.d3.d2.d1.

The subdomain-prefix of a domain-suffix is the prefix of q up to and not including the

domain-suffix. Therefore, for domain-suffix d1 the subdomain-prefix is ...d6.d5.d4.d3.d2,

for domain-suffix d2.d1 the subdomain-prefix is ...d6.d5.d4.d3 and so forth.

For brevity, we refer to a domain-prefix as a domain and a subdomain-suffix as a

subdomain.

Note that we refer to the length of a domain to be the number of domain subparts

rather than the number of characters in the domain.

We summarize the system parameters and notations used throughout this section

in Table 7.1.

7.3.2.2 Heavy Distinct Domain Hierarchy (HDDH) Extractor

Main concepts: In order to extract the hierarchy of heavy distinct domains, we need

to efficiently compute how many of the distinct subdomains are contributed by each

branch of the hierarchy. For each heavily distinct domain we would like to identify

which, if any, of its subdomains is also heavily distinct. Furthermore, we would like to

calculate the accumulative cardinality of all of its heavily distinct subdomains.

The Heavy Distinct Domain Hierarchy (HDDH) can be better visualized using a

trie. As can be seen in Figure 7.4, the trie holds mostly heavily distinct domains. Each

edge of the trie is labled with a domain subpart. Each node represents a domain (e.g. the

domain ∗.site.org is represented by the right-most leaf in the trie). Each node is labeled

with the number of distinct subdomains seen for that domain. For example, there were

500 different queries for domain ∗.com, of which 420 were for domain ∗.google.com,

60 for ∗.cnn.com and the remaining 20 were for domains that had a cardinality below


Symbol Meaning

ki The number of items in structureDHHi

Cardestd cardinality estimate of item d

min sig minimal cardinality for signature

min base minimal cardinality for baseline

min white minimal subpart frequency forwhitelist

min heavy minimal cardinality for domain cov-er

p minimum ratio between item car-dinality and cardinality sum of its”children nodes”

t time interval for extracting signa-tures

min attack minimal cardinality for attack overbaseline

rbaseline required attack ratio from baseline

Table 7.1: System Parameters and Notations

min heavy. Note that the remaining cardinality of each node is the number indicated

on the node minus the sum of its child nodes in the next level of the tree. Therefore,

the remaining cardinality of ∗.com is 20 (calculated: 500− (420 + 60)).

We would like to find a minimal set of nodes in the trie with a cardinality above

min heavy that cover the leaves of the trie. Assume we would like to identify do-

mains which have at least 50 distinct subdomains, i.e. min heavy = 50. Intuitively, if

∗.cnn.com and ∗.google.com are signatures of our algorithm, than the node representing

∗.com only accounts for the remaining 20 which does not surpass the minimum of 50

distinct subdomains. In this case, there would be three nodes selected for the cover and

they are marked on the trie. The heavy domain cover would therefore be: ∗.cnn.com,

∗.maps.google.com, ∗.site.org.

HDDH Extractor Structure: Since extracting the entire hierarchy of queried do-

mains would consume way too many resources, we provide an approximate solution,

that allows extracting the desired information mainly for the heavily distinct domains.

The HDDH Extractor is composed of our Fixed-size streaming Distinct Weighted

Sampling (and specifically Integrated dwsHH) structures for Distinct Heavy Hitter de-


Figure 7.4: Hierarchy of heavy distinct domains. Bold edged nodes are in the cover,dashed edge nodes do not surpass the minimum cardinality.

tection. Each of these Integrated dwsHH structures maintains a constant number k of

keys (domains). For each domain, an approximate distinct counter of its subkeys (sub-

domains) is maintained along with a cardinality estimate (CardEst) which estimates

the number of distinct subkeys seen thus far for that domain. Up to a bounded error, at

any point in time, all domains with a high enough cardinality will be in the Integrated

dwsHH structure. Further details are provided in Section 6.5.2.

To achieve practicality both in terms of performance and implementation simplicity,

our structure supports domains of length at most 5. We have found that the average

length of queried domains is between 2 and 3 and most legitimate domains are of length

at most 5.

As seen in Figure 7.5, our structure maintains 5 Distinct Heavy Hitters (specifically

Integrated dwsHH) structures which we denote DHH1-DHH5. Denote as ki the size

of each DHHi, meaning, each DHHi contains at most ki keys. Furthermore, the keys

in each DHHi are domains of length i (i.e. domains of the form ∗.di.di−1.....d1).

Given a stream of traffic (or a traffic capture), for each query q = ...d6.d5.d4.d3.d2.d1

received, the key ∗.d1 (of length 1) is inserted to DHH1 with subkey ...d6.d5.d4.d3.d2,


Figure 7.5: Heavy Distinct Domain Hierarchy (HDDH) Extractor

key ∗.d2.d1 (of length 2) is inserted to DHH2 with subkey ...d6.d5.d4.d3 and so on.

However, an insertion is made to DHH2 only if ∗.d1 was already found in DHH1.

Similarly, an insertion is made to DHH3 only if ∗.d2.d1 was already found in DHH2

and so on. Meaning, that a longer domain is only inserted into the structure if a shorter

domain of that query was already sufficiently heavy to be an item in the structure. In

this manner, the algorithm only inserts domains which are somewhat likely to become

signatures.

Finding a Distinct Heavy Domain Cover: Once the traffic capture has been

analyzed or after every fixed time interval t, the data in the structures needs to be pro-

cessed and a heavy domain cover must be extracted from the structure. This cover will

be used to extract a domain baseline and attack signatures as shown in Section 7.3.2.3.

To identify the heavy domain cover, using only the items in our HDDH Extractor, we

build a trie as shown in Figure 7.4. Intuitively, each domain found in our extractor can

be placed on a branch of the tree, forming a sort of suffix tree. Additionally, we need to

calculate the numbers found in each node. The general idea is as follows: we would like to

identify the longest part of the domain that is common to all of the attack queries. Such

that if, for example, the attack is composed of queries of the form < Randomstring >

.subdomain.targetsite.com our goal is to have the domain ’*.subdomain.targetsite.com’

in the cover and not a shorter domain such as ’*.targetsite.com’ or ’*.com’. To do so,

our algorithm identifies for every branch, the deepest node that has many distinct

subdomains and its child nodes do not.

Given the predefined parameters min heavy, and p the following process is per-

formed:


• For 1 ≤ i ≤ 5 for each DHHi, for each item d ∈ DHHi if Cardestd < min heavy

discard d.

• For 1 ≤ i ≤ 4 for each DHHi, for each item d ∈ DHHi:

– SumChildrend =∑CardEst of all items in DHHi+1 s.t. d is their suffix

– Deltad = Cardestd − SumChildrend

– If DeltadCardestd

≥ p: insert d to the heavy domain cover.

7.3.2.3 Attack Detection

Distinct Domain Baseline: Different domains in the Internet have a highly varying

number of distinct subdomains. While the vast majority of domains have a small num-

ber subdomains, some domains have hundreds or thousands of different subdomains.

Additionally, certain Internet sites make use of disposable domains [36], meaning they

create ”one-time” subdomains as part of their regular operation and therefore often

have thousands of sub-domains on a regular basis.

In Figure 7.6 we show the distribution of the number of distinct subdomains of

domains at each of the four highest domain levels, in a 40M query trace captured at a

campus sever. The findings show that there are a few top level domains that have a very

high cardinality. There are many second level domains with relatively high cardinality

and cardinality gradually decreases in the third and fourth levels.

To identify domains which normally have many distinct subdomains, our system

processes peacetime (or regular load) traffic to create a baseline. This baseline can be

compared to the number of distinct subdomains queried during an attack to determine if

there is a significant rise in the number of distinct queries for a given domain. To create

a domain cardinality baseline our system uses the HDDH Extractor structure. For each

query q = ...d4.d3.d2.d1 received, the query is inserted into the HDDH Extractor module

for analysis as explained in Section 7.3.2.2. Once the entire capture has been processed

or every fixed time interval, the Distinct Heavy Domain Cover is calculated (as described

in Section 7.3.2.2). Domains identified by this process that have a cardinality over

min base compose the domain baseline.

Attack Signatures Extraction: Our system may be used to process traffic streams

or samples in an ongoing manner to quickly detect a Random Subdomain Attack soon


Figure 7.6: Distribution of the number of distinct subdomains per domain level

after it starts and extract attack signatures.

For simplicity, assume attack signatures extraction is done using a separate HDDH

Extractor module than the one used for baseline creation. Each query received during

attack detection time is inserted into the HDDH Extractor and is analyzed as explained

in Section 7.3.2.2.

We wish to output a set Sa of attack signatures. As seen in Figure 7.7, every fixed

interval t, a Heavy Distinct Domain Cover is calculated (as described in Section 7.3.2.2).

To generate the signature set Sa, domains in the cover are compared to the distinct

domain baseline described above. The signature set Sa will include domains for which

there is a significant rise in the number of their distinct subdomains both nominally

and in proportion to their baseline cardinality.

This is calculated in the following manner: Given the ratio rbaseline and the threshold

min attack and denote db the baseline for domain d (CardEstdb = 0 if d is not in the

baseline) For each domain d in the cover:

• If CardEstdb = 0 : If CardEstd ≥ min attack then add d to Sa.

• Else: If (CardEstd − CardEstdb) ≥ min attack AND CardEstdCardEstdb

≥ rbaseline then

add d to Sa.


Figure 7.7: Attack time signature extraction

Subdomain Whitelists: We would like to identify strings which often appear as

sub-parts in domain queries. Such strings are most likely not automatically generated.

Our system uses the Classic Heavy Hitters algorithm such as the Space-Saving

algorithm of Metwally et al. [81] to identify strings which are often found as subparts

in many domains.

During the baseline creation process in peacetime, for each query q = ...d4.d3.d2.d1

received, each subpart di will be independently inserted in the heavy hitters compu-

tation module. Once processing of all queries is completed, all strings with a count of

over min white will be inserted into the subdomain whitelist and will later be used by

the system for attack mitigation.

Once attack signatures have been extracted, a subdomain whitelist can be specifi-

cally extracted for each signature, to further reduce false positives. For each signature,

we maintain a separate heavy hitter module. For each query received for a signature s,

of the form q = ...dj+2.dj+1.dj .s, each subpart di will be independently inserted into

the heavy hitters computation module. Once enough legitimate requests are received,

the subparts of legitimate requests will be identified as the heavy hitters and they can

be added to the whitelist. Note that given the nature of the attacks, where each ran-

domly generated subdomain appears a very small (typically 1 or 2) number of times,

the subparts stream is very heavy tailed and therefore legitimate subparts which occur


more times should stand out relatively quickly. We can choose to use a heavy hitters

module which is better suited for heavy tail streams such as the one presented in [27].

7.3.2.4 Attack Mitigation

Mitigation is done using the signature set Sa described above along with the Subdomain

Whitelist. Once an attack is detected, each consequent query q = ...d2.d1 is checked

to see if it has a common suffix with one of the signatures in Sa. Note that the suf-

fix has to be at least as long as one of the signatures, meaning, one of the signatures

should be its suffix. If for example, ∗.google.com and ∗.maps.google.com are both sig-

natures, then for the query amap.maps.google.com the longest common suffix should

be maps.google.com, yet for the query mysite.com, .com should not be considered a

common suffix since it is not a complete signature. If no common suffix is found between

q and the signatures in Sa, the query is allowed. Otherwise, a common suffix of the

form dj .dj−1....d1 is found. Denote dj+1 to be the subpart in q immediately preceding

the common suffix. If dj+1 is found in the Subdomain Whitelist, the query is allowed.

Otherwise, the query is dropped (See Fig. 7.3).

7.3.2.5 Timely Attack Detection

Attack detection is a time critical task. To insure timely detection of the attack, our

system works in intervals of a predetermined length l (i.e. 20 minutes) in both peacetime

processing and attack time analysis. After each interval l counters are refreshed by a

complete restart. In this manner the measurements performed by the system during an

attack are comparable with those taken during peacetime.

That said, the system is required to detect an attack within seconds of attack start

time. In order to provide support for this requirement, counters are checked every fixed

(short) time interval s. At each interval sj the incremental cardinality estimate delta

for each key is calculated since sj−1 and local cardinality peaks are identified. That is

for each key k deltaj(k) = k.CardEstsj − k.CardEstsj−1 .

During peacetime analysis, for each key k identified as being distinctly heavy during

peacetime, define delta max(k) = max1≤j≤n deltaj(k). We denote n as the overall

number of short intervals s that were processed during peacetime.

During the attack detection phase, at each interval sa, for each key k, we compare

deltaa(k) with delta max(k). If deltaa(k) >> delta max(k), the key is suspected of

7.4. EVALUATION 115

being under attack.

Accumulative Techniques: We provide two additional techniques which allow the

counters to maintain some history between intervals.

• Accumulative weighted counters: At every interval l it is possible to perform a s-

napshot of the counters prior to the restart. In this manner, we can maintain

an accumulative weighted counter. That is, an accumulative cardinality esti-

mate can be calculated by adding the cardinality estimate of a key k in the

snapshot k.CardEstsnapshot multiplied by w1 (0 < w1 < 1) to the current car-

dinality estimate k.CardEstcurrent multiplied by w2 (0 < w2 < 1). So that

the accumulative cardinality estimate of a key k is equal to accCardEst(k) =

w1 · k.CardEstsnapshot + w2 · k.CardEstcurrent.

• Decaying average: Decaying average may be performed by clearing one of the

buckets in the distinct counter of each item at each time interval l. In each distinct

counter, buckets are cleared in a round robin manner. In this way, the distinct

counter of each item is decremented at every interval, and items which are no

longer distinctly heavy will eventually be evicted from the structure and make

room for items that have recently grown to be distinctly heavy.

7.4 Evaluation

7.4.1 University Network Captures

We examined a 4GB trace of nearly 40M DNS queries captured by a DNS resolvers

in our University Network. This includes both an authoritative name server for some

of the domains at our university and a recursive server handling DNS queries coming

from clients within the campus. The capture was taken over nearly a month and half. In

this capture we saw a small attack on a university authoritative DNS server. Figure 7.8

shows the number of distinct queries made to this domain over the course of one day. As

can be clearly seen, at around 10AM the number of distinct domains nearly doubled,

and this persisted at different volumes for about 10 hours. After 10 hours the number

of distinct queries went back to the normal baseline. We ran the capture through our

system, using the trace from the previous day as a baseline. Our peacetime capture

contained nearly 776K queries, the capture for the day of the attack contained over


900K queries. The parameters of our system were set as follows: k1 = 100, k2 = 500,

k3 = 500, k4 = 250, k5 = 100, l = 32, min sig = 500, min base = 20, min white = 5,

min heavy = 20, p = 0.3, min attack = 500, rbaseline = 1.4

Our system was able to identify the attack getting a single attack signature from

those captures after processing only several dozens of attack packets.

Figure 7.8: Distinct queries for campus authoritative server per hour, over 1 day.

Our current implementation is able to process up to 40K queries per second. This

implementation has yet to be optimized for performance. Nonetheless, as our system

can detect an attack after processing merely dozens or hundreds of attack packets, it

can detect an attack within seconds and even less, depending on the rate of the attack.

7.4.2 ISP Attack Captures

Our Random Subdomain attack mitigation system presented above has been evaluated

on traces of actual attacks captured by a large Internet Service Provider (ISP).

We analyzed 5 captures which were sniffed during different Random Subdomain

attacks and contained both attack and legitimate DNS queries. All captures were taken

within a single month in 2014. Note that most of the captures contain 5000 queries as

that was the set amount that was sniffed for each attack spotted. The ISP manually

7.4. EVALUATION 117

identified the Random Subdomain attacks as they were occurring. The attacks targeted

both domains hosted by the ISP authoritative name server and domains outside the IS-

P’s network. Hence, the attacks affected both the authoritative name servers of the ISP

and its recursive resolvers. We compare our results to the analysis performed manually

by the ISP. We use a cache size of k = 50. Note that, some of the attacks analyzed had

a very high percentage of distinct queries and others had lower rates. The repetitions

are of randomly generated queries that were each repeated several times in the traffic.

As we did not have access to a peacetime capture, we used one of the captures to create

a baseline for the others.

Consider attack 1 in Table 7.2. The capture consisted of 92469 DNS queries. Of

these, 4133 are attack queries targeted at the same zone, with a randomly generated

least significant domain sub-part, containing 2051 distinct queries, meaning that some

of the queries were repeated. Of the 4133 queries, the system counted 4123. Meaning,

that 10 queries for the attacked zone had gone through before the zone was placed

in the structure (i.e., in the cache). Once inside, the zone was not evicted from the

structure at any point, all subsequent queries were counted, hence 99.8% of the queries

were identified.

Source Queries in cap-ture

Attackqueries

Distinct attackqueries

Attack queries i-dentified

1 92469 4133 2051 99.8%

2 5000 389 367 99.7%

3 5000 602 567 100%

4 5000 334 330 100%

5 5000 3364 631 99.8%

Table 7.2: Results on Real DNS Attack Captures

Chapter 8

Discussion and Conclusion

8.1 Contributions

We provide a brief overview of the contributions we have presented in this dissertation:

Detection of Heavy Flows in Software Defined Networks: Based on different

parameters, we differentiate between heavy flows, elephant flows and bulky flows and

present innovative algorithms to detect flows of the different types in an SDN switch.

We propose our Sample&Pick algorithm for the development of efficient methods to

detect large or heavy flows going through an SDN switch. The Sample&Pick algorithm

performs a division of labour between the switch and the controller, coordinating be-

tween them to efficiently identify the large flows. Our constructions use, in a sophisticat-

ed way, the Sample and Hold [49] algorithm along with the Space Saving algorithm [81]

to minimize both the switch - controller communication and the number of entries in

the switch flow table. We evaluate the performance of our Sample&Pick algorithm by

measuring its inaccuracy rates and resource consumption. Our evaluations show that

our algorithm provides a good tradeoff between the amount of communication between

the switch and the controller and the amount of space required on the switch while

being able to identify the heavy hitters.

Additionally, we consider a distributed model with multiple switches and propose

solutions for efficient scaling of our techniques.

Our methods rely on standard and optional features of OpenFlow 1.3 and can also

be implemented in the P4 language. Additionally, the techniques presented are both

flow-table size and switch-controller communication efficient.

118

8.1. CONTRIBUTIONS 119

String Heavy Hitters: we propose the String Heavy Hitters problem and present

the Double Heavy Hitter algorithm for efficiently solving this problem. This algorithm

finds popular strings of variable length in a set of messages, using, in a tricky way, the

classic Heavy Hitter algorithm as a building block. This algorithm runs in a single pass

over the input and space dependent only on some predefined parameters.

Zero-Day Signature Extraction for High Volume Attacks: We present an in-

novative system for automatic extraction of signatures for application-level zero-day

DDoS attacks. Our system takes as input two streams (or stream samples) of traffic

collected during an attack and during peacetime. A peacetime traffic sample may be

collected as a routine scheduled procedure. The attack traffic sample can be collected

once the attack has been identified. The system then analyzes both traffic samples to

identify content that is frequent in the attack traffic sample yet appears rarely or not

at all in the peacetime traffic.



be easily adapted to solving other network problems with similar characteristics.

We test our system on real life traffic logs of attacks and peacetime that from real

attacks. We show that our solution has good performance in real life, with a recall

average of 99.95% and an average precision of 98%.

Heavy Hitters in Pairs: Our main contributions, are novel and efficient sampling-

based structures for distinct Heavy Hitters (dhh) and combined Heavy Hitters (chh)

detection in a stream of < key, subkey > pairs, which are able to track only O(ε−1) keys

and require only a single pass over the input. Our dHH design significantly improves

over existing work. We demonstrate, via experimental and theoretical evaluations, the

effectiveness of each of our algorithms in terms of accuracy and memory consumption.

Random Subdomain DNS Attacks: Random subdomain DDoS attacks on the

Domain Name System service have recently become a growing threat to basic internet

functionality. In these attacks, many queries are sent for a single or a few victim do-

mains, yet they include highly varying non-existent subdomains generated randomly.

While the attack targets one or a few authoritative name servers, it usually comes with

significant collateral damage to DNS servers of different providers on its route. We

120 CHAPTER 8. DISCUSSION AND CONCLUSION

present a system for mitigation of such attacks. To the best of our knowledge this is

the first such system. The design makes use of our structures for dHH detection. We

perform extensive experimental evaluation on real DNS attack traces, demonstrating

the effectiveness of our system.

8.2 Future Work

The drastically growing scale of today’s networks requires building solutions that can

adapt to the changing needs of the network. Relatively new network concepts such as

Software Defined Networks (SDN) and Network Function Virtualization (NFV) offer

new capabilities and architectures which may be leveraged to allow for more flexible so-

lutions. SDN and protocols such as OpenFlow [80] allow the decoupling of the network

control plane from the data plane, introducing new flexibility in network management.

NFV is part of the transition of network components from specialized hardware to gen-

eral purpose machines [89], and therefore allows new flexibility in network functionality.

While both of these paradigms allow simplified deployment of new network tools, the

efficient transition of these tools to wide deployment, so that they may cleverly utilize

this new architecture, is not at all trivial. An interesting direction would be to expand

our solutions to a distributed network setting with the aim of building solutions that

can be scaled and virtualized, by collaboration of distributed and possibly hierarchical

network entities, located in different sites, sharing data and resources.

Detection of Heavy Flows in Software Defined Networks: Generally, it would

be interesting to study security vulnerabilities specific to SDN and research how the

tools we have presented thus far can be combined with other SDN monitoring capabil-

ities to mitigate network attacks such as DDoS, Worms etc.

Additionally, our research on the distributed setting solution can be expanded to

support more complex settings and topologies. For example, a topology in which switch-

es may have different roles in the system such as switches found at ingress and egress

points.

String Heavy Hitters: Our Double Heavy Hitter algorithm is able to find strings

of varying length. Each signature formed is a fixed string which may be searched for in

the data. While in the past, these types of signatures may have sufficed, mitigation of

8.2. FUTURE WORK 121

new attacks requires enhanced tools. Attackers are constantly making up new types of

attack signatures that are more difficult to identify. It would be interesting to expand the

variability of the signatures that the algorithm is able to extract, to include, for example,

signatures which contain regular expressions, or signatures that contain ”Don’t-Care”s,

and mismatches. Specifically, we should devise ways to generate signatures in which

part of the string is fixed and part of may be randomly generated. Partially random

signatures are a major challenge facing security experts today, and such a solution would

allow to fine tune mitigation and detection of such network attacks and anomalies.

Zero-Day Signature Extraction for High Volume Attacks: A possible direction

would be to expand our solution so that it is able to monitor traffic in different network

locations and identify signatures based on the analysis performed in the different sites.

To do so, a scalable solution for the String Heavy Hitters problem should be developed

that can be implemented as a virtualized network function (VNF). In [123], Yi et al.

propose an algorithm for identifying classical heavy hitters in a distributed setting, yet

the transition to solving the String Heavy Hitters problem in a distributed setting is

challenging due to the dependencies between the frequencies of the fixed length and

the varying length strings.

Random Subdomain DNS Attacks: According to Akamai’s state of the Internet

report [16] nearly 20% of DDoS attacks in Q1 of 2016 involved the DNS service making

mitigation of such attacks extremely important. Moreover, even some of the Internet’s

DNS root name servers were targets of DNS-based DDoS attacks [117]. Such attacks

can significantly impact the availability of websites globally. In order to detect such ma-

licious behaviour, we must first understand the legitimate usage of the DNS by different

companies and Internet entities. It would be interesting to research the characteristics

of DNS traffic to identify current trends and changes in DNS usage. Due to the recent

introduction of generic top level domains (gTLDs) some of the fundamental character-

istics of DNS traffic are changing and therefore raise the need for such a study. This will

also be very helpful in gaining a better understanding of how disposable domains [36]

in DNS are being used today which is significant to the research done so far.

Additionally, there are possible advancements which can be made in the detection of

Random Subdomain DNS attacks. For example, due to the hierarchical and distributed

architecture of DNS servers, a possible next step would be to study how different servers

122 CHAPTER 8. DISCUSSION AND CONCLUSION

can collaborate to identify such attacks more efficiently.

Bibliography

[1] The caida ucsd anonymized internet traces 2009 - sep. 17 2009.http://www.caida.org/data/passive/passive 2009 dataset.xml. 1, 3.4.3.1

[2] The caida ucsd anonymized internet traces 2012.http://www.caida.org/data/passive/passive 2012 dataset.xml.

[3] The caida ucsd anonymized internet traces 2014 - mar. 20 2014.http://www.caida.org/data/passive/passive 2014 dataset.xml. 1, 3.4.3.2

[4] Cisco Netflow. http://www.cisco.com/c/en/us/tech/quality-of-service-qos/netflow/index.html. 1, 3.1, 3.2.1

[5] NoviFlow’s NoviKit. http://noviflow.com/products/novikit/(accessed on March2015). 3.1.1

[6] NoviFlow’s NoviWare. http://noviflow.com/products/noviware/(accessed onJanuary 2017). 3.1.1

[7] Ucla dward project: Sanitized ucla csd traffic traces. http-s://lasr.cs.ucla.edu/ddos/traces/. 6

[8] Cert advisory ca-1996-21: Tcp syn flooding and ip spoofing attacks., 1996. 2.2

[9] Snort: Open source network intrusion detection system, 2002. 1.1.3, 2.3

[10] Leading security companies. personal communication, 2012-2013. 2.2

[11] Yehuda Afek, Anat Bremler-Barr, and Shir Landau Feibish. Automated signatureextraction for high volume attacks. In Symposium on Architecture for Networkingand Communications Systems, ANCS ’13, San Jose, CA, USA, October 21-22,2013, pages 147–156. IEEE Computer Society, 2013. 1

[12] Yehuda Afek, Anat Bremler-Barr, and Shir Landau Feibish. Zero-day signatureextraction for high volume attacks. Transactions on Networking. Submitted. 1,1.1.3

[13] Yehuda Afek, Anat Bremler-Barr, Shir Landau Feibish, and Golan Parashi.Cloud-based implementation of the signature extraction system. http-s://www.autosigen.com/. 1.2, 5.1.1

[14] Yehuda Afek, Anat Bremler-Barr, Shir Landau Feibish, and Liron Schiff. De-tecting heavy flows in the sdn match and action model. Computer NetworksJournal: special issue on Security and Performance of Software-defined Networksand Functions Virtualization, 2017. Submitted. 3.4.2.1

123

124 BIBLIOGRAPHY

[15] Alfred V. Aho and Margaret J. Corasick. Efficient string matching: An aid tobibliographic search. Commun. ACM, 18(6):333–340, 1975. 2.3

[16] Akamai [state of the internet] / security – q1 2016 report.www.akamai.com/StateOfTheInternet, 2016. 7.1, 8.2

[17] Cathy Almond. Recent authoritative exhaustion attacks, 2016. http-s://www.arbornetworks.com/threats/. 7.1, 7.1

[18] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approx-imating the frequency moments. J. Comput. Syst. Sci., 58(1):137–147, 1999.2.1.2.2

[19] Alberto Apostolico, Maxime Crochemore, Martin Farach-Colton, Zvi Galil, andS. Muthukrishnan. 40 years of suffix trees. Commun. ACM, 59(4):66–73, 2016.4.5

[20] Digital attack map. https://www.arbornetworks.com/threats/, 2016. 2.2

[21] Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and JenniferWidom. Models and issues in data stream systems. In Lucian Popa, SergeAbiteboul, and Phokion G. Kolaitis, editors, Proceedings of the Twenty-firstACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Sys-tems, June 3-5, Madison, Wisconsin, USA, pages 1–16. ACM, 2002. 2.1.1

[22] Brian Babcock and Chris Olston. Distributed top-k monitoring. In Alon Y.Halevy, Zachary G. Ives, and AnHai Doan, editors, Proceedings of the 2003 ACMSIGMOD International Conference on Management of Data, San Diego, Califor-nia, USA, June 9-12, 2003, pages 28–39. ACM, 2003. 2.1.3

[23] Chris Baker. Recent authoritative exhaustion attacks. October 2016. DNS OARC2016 Dallas. Talk given on behalf of Dyn Inc. 7.1, 7.1

[24] Nagender Bandi, Divyakant Agrawal, and Amr El Abbadi. Fast algorithms forheavy distinct hitters using associative memories. In 27th IEEE InternationalConference on Distributed Computing Systems (ICDCS 2007), June 25-29, 2007,Toronto, Ontario, Canada, page 6. IEEE Computer Society, 2007. 1.1.4, 6.4

[25] Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan.Counting distinct elements in a data stream. In Jose D. P. Rolim and Salil P.Vadhan, editors, Randomization and Approximation Techniques, 6th Internation-al Workshop, RANDOM 2002, Cambridge, MA, USA, September 13-15, 2002,Proceedings, volume 2483 of Lecture Notes in Computer Science, pages 1–10.Springer, 2002. 1, 6.1.1, 6.3

[26] Michela Becchi and Patrick Crowley. A hybrid finite automaton for practical deeppacket inspection. In Jim Kurose and Henning Schulzrinne, editors, Proceedingsof the 2007 ACM Conference on Emerging Network Experiment and Technology,CoNEXT 2007, New York, NY, USA, December 10-13, 2007, page 1. ACM, 2007.2.3

[27] Ran Ben-Basat, Gil Einziger, Roy Friedman, and Yaron Kassner. Random-ized admission policy for efficient top-k and frequency estimation. CoRR, ab-s/1612.02962, 2016. 7.3.2.3

BIBLIOGRAPHY 125

[28] Ran Ben-Basat, Gil Einziger, Roy Friedman, and Yaron Kassner. Optimal ele-phant flow detection. CoRR, abs/1701.04021, 2017. 1

[29] Udi Ben-Porat, Anat Bremler-Barr, and Hanoch Levy. Evaluating the vulnera-bility of network mechanisms to sophisticated ddos attacks. In INFOCOM, pages2297–2305. IEEE, 2008. 2.2

[30] Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer Rex-ford, Cole Schlesinger, Dan Talayco, Amin Vahdat, George Varghese, and DavidWalker. P4: programming protocol-independent packet processors. ComputerCommunication Review, 44(3):87–95, 2014. 3.1

[31] Robert S. Boyer and J. Strother Moore. Mjrty - a fast majority vote algorithm.Technical Report, Institute of Computing Science, The university of Texas atAustin, 32, 1981. 2.1.2

[32] The Bro Network Security Monitor. http://bro-ids.org. 2.3

[33] N. Brownlee, C. Mills, and G. Ruth. Rfc 2722, 1999.http://tools.ietf.org/html/rfc2722. 1, 3.3

[34] Moses Charikar, Kevin C. Chen, and Martin Farach-Colton. Finding fre-quent items in data streams. In Peter Widmayer, Francisco Triguero Ruiz,Rafael Morales Bueno, Matthew Hennessy, Stephan Eidenbenz, and RicardoConejo, editors, Automata, Languages and Programming, 29th International Col-loquium, ICALP 2002, Malaga, Spain, July 8-13, 2002, Proceedings, volume 2380of Lecture Notes in Computer Science, pages 693–703. Springer, 2002. 2.1.2.2,2.1.3

[35] Ruiliang Chen, Jung-Min Park, and Randolph Marchany. RIM: router inter-face marking for IP traceback. In Proceedings of the Global TelecommunicationsConference, 2006. GLOBECOM ’06, San Francisco, CA, USA, 27 November - 1December 2006. IEEE, 2006. 2.2

[36] Yizheng Chen, Manos Antonakakis, Roberto Perdisci, Yacin Nadji, David Dagon,and Wenke Lee. DNS noise: Measuring the pervasiveness of disposable domainsin modern DNS traffic. In 44th Annual IEEE/IFIP International Conference onDependable Systems and Networks, DSN 2014, Atlanta, GA, USA, June 23-26,2014, pages 598–609. IEEE, 2014. 7.1.1, 7.3.2.3, 8.2

[37] Edith Cohen. Size-estimation framework with applications to transitive closureand reachability. J. Comput. System Sci., 55:441–453, 1997. 1, 6.1.1, 6.3

[38] Edith Cohen. All-distances sketches, revisited: HIP estimators for massive graphsanalysis. IEEE Trans. Knowl. Data Eng., 27(9):2320–2334, 2015. 6.1.1, 6.3, 6.5.5,6.5.5

[39] Edith Cohen, Nick G. Duffield, Haim Kaplan, Carsten Lund, and Mikkel Thorup.Algorithms and estimators for summarization of unaggregated data streams. J.Comput. Syst. Sci., 80(7):1214–1244, 2014. Space-Saving Heavy Hitters, 6.1.1,6.5.3

[40] Graham Cormode. Misra-gries summaries. In Encyclopedia of Algorithms, pages1334–1337. 2016. 2.1.2.2, 2.1.3

http://bro-ids.org

126 BIBLIOGRAPHY

[41] Graham Cormode and Marios Hadjieleftheriou. Finding frequent items in datastreams. PVLDB, 1(2):1530–1541, 2008. 2.1.2.2, Space-Saving Heavy Hitters,4.6.0.1, 5.4.3

[42] Graham Cormode, Flip Korn, S. Muthukrishnan, and Divesh Srivastava. Findinghierarchical heavy hitters in data streams. In VLDB, pages 464–475, 2003. 2.1.3,4.5

[43] Graham Cormode and S. Muthukrishnan. An improved data stream summa-ry: the count-min sketch and its applications. J. Algorithms, 55(1):58–75, 2005.2.1.2.2

[44] Andrew R. Curtis, Jeffrey C. Mogul, Jean Tourrilhes, Praveen Yalagandula,Puneet Sharma, and Sujata Banerjee. Devoflow: scaling flow management forhigh-performance networks. In SIGCOMM, pages 254–265, 2011. 3.2.1

[45] Erik D. Demaine, Alejandro Lopez-Ortiz, and J. Ian Munro. Frequency esti-mation of internet packet streams with limited space. In Rolf H. Mohring andRajeev Raman, editors, Algorithms - ESA 2002, 10th Annual European Sympo-sium, Rome, Italy, September 17-21, 2002, Proceedings, volume 2461 of LectureNotes in Computer Science, pages 348–360. Springer, 2002. 2.1.2.2, 2.1.2.2, 2.1.3

[46] Roland Dobbins. Mirai iot botnet description and ddos attack mitiga-tion. https://www.arbornetworks.com/blog/asert/mirai-iot-botnet-description-ddos-attack-mitigation/, 2016. 5.1

[47] Marianne Durand and Philippe Flajolet. Loglog counting of large cardinalities(extended abstract). In Giuseppe Di Battista and Uri Zwick, editors, Algorithms- ESA 2003, 11th Annual European Symposium, Budapest, Hungary, September16-19, 2003, Proceedings, volume 2832 of Lecture Notes in Computer Science,pages 605–617. Springer, 2003. 1

[48] Cristian Estan and George Varghese. New directions in traffic measurement andaccounting. In Proceedings of the ACM SIGCOMM’02 Conference. ACM, 2002.Space-Saving Heavy Hitters, 6.1.1

[49] Cristian Estan and George Varghese. New directions in traffic measurement andaccounting: Focusing on the elephants, ignoring the mice. ACM Trans. Comput.Syst., 21(3):270–313, 2003. 1, 1.1.1, 3.2.1, 3.3, 3.4.1, 8.1

[50] Shir Landau Feibish, Yehuda Afek, Anat Bremler-Barr, Edith Cohen, and MichalShagam. Mitigating DNS random subdomain ddos attacks by distinct heavyhitters sketches. In Qun Li and Songqing Chen, editors, Proceedings of the fifthACM/IEEE Workshop on Hot Topics in Web Systems and Technologies, HotWeb2017, San Jose / Silicon Valley, CA, USA, October 12 - 14, 2017, pages 8:1–8:6.ACM, 2017. 1

[51] P. Ferguson and D. Senie. Rfc 2827: Network ingress filtering: Defeating de-nial of service attacks which employ ip source address spoofing, 2000. http-s://www.ietf.org/rfc/rfc2827.txt. 2.2

[52] Philippe Flajolet, Eric Fusy, Olivier Gandouet, and Frederic Meunier. Hyper-loglog: The analysis of a near-optimal cardinality estimation algorithm. In Anal-ysis of Algorithms (AOFA), 2007. 6.1.1, 6.3, 6.5.5, 6.5.5

BIBLIOGRAPHY 127

[53] Phillip B. Gibbons and Yossi Matias. New sampling-based summary statistics forimproving approximate query answers. In SIGMOD. ACM, 1998. Space-SavingHeavy Hitters, 2.1.3, 6.1.1

[54] Anna C. Gilbert, Hung Q. Ngo, Ely Porat, Atri Rudra, and Martin J. Strauss.`2/`2-foreach sparse recovery with low risk. In Fedor V. Fomin, Rusins Freivalds,Marta Z. Kwiatkowska, and David Peleg, editors, ICALP (1), volume 7965 ofLecture Notes in Computer Science, pages 461–472. Springer, 2013. 4.5

[55] Jesus M. Gonzalez, Mohd Anwar, and James B. D. Joshi. A trust-based approachagainst ip-spoofing attacks. In Ninth Annual Conference on Privacy, Securityand Trust, PST 2011, 19-21 July, 2011, Montreal, Quebec, Canada, pages 63–70.IEEE, 2011. 2.2

[56] Michael Greenwald and Sanjeev Khanna. Space-efficient online computation ofquantile summaries. In Sharad Mehrotra and Timos K. Sellis, editors, SIGMODConference, pages 58–66. ACM, 2001. 2.1.2.2

[57] Kent Griffin, Scott Schneider, Xin Hu, and Tzi cker Chiueh. Automatic generationof string signatures for malware detection. In Engin Kirda, Somesh Jha, andDavide Balzarotti, editors, RAID, volume 5758 of Lecture Notes in ComputerScience, pages 101–120. Springer, 2009. 4.5, 5.2.1

[58] Stefan Heule, Marc Nunkesser, and Alexander Hall. HyperLogLog in practice:Algorithmic engineering of a state of the art cardinality estimation algorithm. InEDBT, 2013. 1

[59] Lucas Chi Kwong Hui. Color set size problem with application to string matching.In Alberto Apostolico, Maxime Crochemore, Zvi Galil, and Udi Manber, editors,Combinatorial Pattern Matching, Third Annual Symposium, CPM 92, Tucson,Arizona, USA, April 29 - May 1, 1992, Proceedings, volume 644 of Lecture Notesin Computer Science, pages 230–243. Springer, 1992. 4.5

[60] Arbor Networks Inc. Peekflow. http://www.arbornetworks.com/products/peakflow,August 2004. 3.2.1

[61] Cloudflare Inc. How cloudflares architecture can scale to stop the largest attacks,2017. https://www.cloudflare.com/media/pdf/cf-wp-dns-attacks.pdf. 7.1

[62] Infoblox Inc. Case studies: A large internet service provider.https://www.infoblox.com/resources/case-studies/large-internet-service-provider/. 7.1

[63] D. Dittrich J. Mirkovic, S. Dietrich and P. Reiher. Internet Denial of Service:Attack and Defense Mechanisms. Prentice Hall PTR, 2004. 2.2

[64] Lorand Jaakab and Jordi Domingo-Pascual. A selective survey of ddos relatedresearch. Technical Report, UPC-DAC-RR-CBA-2007-3, 2007. 2.2

[65] Pollard Jeff, Joseph Blankenship, and Andras Cser. Quick take: Poor planning,not an iot botnet, disrupted the internet: Dyn outage underscores the need toplan for failure, October 2016. Forrester Research. 1, 7.1

[66] Daniel M. Kane, Jelani Nelson, and David P. Woodruff. An optimal algorithmfor the distinct elements problem. In PODS, 2010. 1

128 BIBLIOGRAPHY

[67] Jeffrey O. Kephartt and William C. Arnold. Automatic extraction of computervirus signatures. In 4th International Virus Bulletin Conference, Sept. 1994. 4.5,5.2.1

[68] Hyang-Ah Kim and Brad Karp. Autograph: Toward automated, distributed wormsignature detection. In USENIX Security Symposium, pages 271–286. USENIX,2004. 4.5, 5.2.1

[69] Tomasz Kociumaka, Tatiana A. Starikovskaya, and Hjalte Wedel Vildhøj. Sub-linear space algorithms for the longest common substring problem. In Andreas S.Schulz and Dorothea Wagner, editors, Algorithms - ESA 2014 - 22th Annual Eu-ropean Symposium, Wroclaw, Poland, September 8-10, 2014. Proceedings, volume8737 of Lecture Notes in Computer Science, pages 605–617. Springer, 2014. 4.5

[70] Christian Kreibich and Jon Crowcroft. Honeycomb: creating intrusion detec-tion signatures using honeypots. Computer Communication Review, 34(1):51–56,2004. 4.5, 1, 5.2.1

[71] Sailesh Kumar, Sarang Dharmapurikar, Fang Yu, Patrick Crowley, andJonathan S. Turner. Algorithms to accelerate multiple regular expressions match-ing for deep packet inspection. In Luigi Rizzo, Thomas E. Anderson, and NickMcKeown, editors, Proceedings of the ACM SIGCOMM 2006 Conference on Ap-plications, Technologies, Architectures, and Protocols for Computer Communica-tions, Pisa, Italy, September 11-15, 2006, pages 339–350. ACM, 2006. 2.3

[72] Heejo Lee and Kihong Park. On the effectiveness of probabilistic packet markingfor IP traceback under denial of service attack. In Proceedings IEEE INFOCOM2001, The Conference on Computer Communications, Twentieth Annual JointConference of the IEEE Computer and Communications Societies, Twenty yearsinto the communications odyssey, Anchorage, Alaska, USA, April 22-26, 2001,pages 338–347. IEEE, 2001. 2.2

[73] Zhichun Li, Manan Sanghi, Yan Chen, Ming-Yang Kao, and Brian Chavez. Ham-sa: Fast signature generation for zero-day polymorphicworms with provable at-tack resilience. In IEEE Symposium on Security and Privacy, pages 32–47. IEEEComputer Society, 2006. 5.2.1

[74] C. Liu. A new kind of ddos threat: The nonsense name attack. Network World,2015. [Online; posted 27-January-2015]. 1.1.5, 7.1, 7.1

[75] Ziqian Liu. Lessons learned from may 19 chinas dns collapse, November 2009.Talk given on behalf of Dyn Inc. 7.1

[76] Thomas Locher. Finding heavy distinct hitters in data streams. In SPAA. ACM,2011. 1.1.4, 6.1.1, 6.4, 6.7.1

[77] Matthew V. Mahoney. Network traffic anomaly detection based on packet bytes.In SAC, pages 346–350. ACM, 2003. 2.2, 5.2.2

[78] Udi Manber and Sun Wu. A fast algorithm for multi-pattern searching. TechnicalReport TR94-17, May 1994. 2.3

[79] Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts overdata streams. PVLDB, 5(12):1699, 2012. 2.1.2.2

BIBLIOGRAPHY 129

[80] Nick McKeown, Tom Anderson, Hari Balakrishnan, Guru M. Parulkar, Larry L.Peterson, Jennifer Rexford, Scott Shenker, and Jonathan S. Turner. Openflow:enabling innovation in campus networks. Computer Communication Review,38(2):69–74, 2008. 1.1.1, 2.4, 3.3, 8.2

[81] Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Efficient computationof frequent and top-k elements in data streams. In ICDT, pages 398–412, 2005.1, 1.1.1, 1.1.2, 2.1.2.2, 2.1.2.2, Space-Saving Heavy Hitters, 2.1.3, 3.4.1, 3.4.2.3,3.5, 4.6.2, 5.4.3, 6.4, 7.3.2.3, 8.1

[82] Jelena Mirkovic, Gregory Prier, and Peter L. Reiher. Source-end ddos defense.In 2nd IEEE International Symposium on Network Computing and Applications(NCA 2003), 16-18 April 2003, Cambridge, MA, USA, pages 171–178. IEEEComputer Society, 2003. 2.2

[83] Jayadev Misra and David Gries. Finding repeated elements. Sci. Comput. Pro-gram., 2(2):143–152, 1982. 1, 2.1.2.1, 2.1.2.2, 2.1.2.2, 6.4

[84] J. Strother Moore. Problem 81-5. Journal of Algorithms, 2:208–209, 1981. 2.1.2

[85] Masoud Moshref, Minlan Yu, Ramesh Govindan, and Amin Vahdat. DREAM:dynamic resource allocation for software-defined measurement. In SIGCOMM,pages 419–430, 2014. 3.2.1

[86] Athicha Muthitacharoen, Benjie Chen, and David Mazieres. A low-bandwidthnetwork file system. In SOSP, pages 174–187, 2001. 4.5

[87] AS Navaz, V Sangeetha, and C Prabhadevi. Entropy based anomaly detectionsystem to prevent ddos attacks in cloud. arXiv preprint arXiv:1308.6745, 2013.2.2, 5.2.2

[88] James Newsome, Brad Karp, and Dawn Xiaodong Song. Polygraph: Automatical-ly generating signatures for polymorphic worms. In IEEE Symposium on Securityand Privacy, pages 226–241. IEEE Computer Society, 2005. 5.2.1

[89] Network functions virtualization – introductory white paper.http://portal.etsi.org/NFV/NFV White Paper.pdf, 2012. 8.2

[90] Hung Q. Ngo, Ely Porat, and Atri Rudra. Efficiently decodable compressedsensing by list-recoverable codes and recursion. In Christoph Durr and ThomasWilke, editors, STACS, volume 14 of LIPIcs, pages 230–241. Schloss Dagstuhl -Leibniz-Zentrum fuer Informatik, 2012. 4.5

[91] Latest internet plague: Random subdomainattacks. https://nominum.com/wp-content/uploads/2014/10/Nominum-Whitepaper-Latest-Internet-Plague-Random-Subdomain-Attacks.pdf, 2014. 7.1, 7.1

[92] George Nychis, Vyas Sekar, David G. Andersen, Hyong Kim, and Hui Zhang. Anempirical evaluation of entropy-based traffic anomaly detection. In Konstanti-na Papagiannaki and Zhi-Li Zhang, editors, Internet Measurement Comference,pages 151–156. ACM, 2008. 2.2, 5.2.2

[93] Hyundo Park, Peng Li, Debin Gao, Heejo Lee, and Robert H. Deng. Distinguish-ing between fe and ddos using randomness check. In Tzong-Chen Wu, Chin-Laung

130 BIBLIOGRAPHY

Lei, Vincent Rijmen, and Der-Tsai Lee, editors, ISC, volume 5222 of Lecture Notesin Computer Science, pages 131–145. Springer, 2008. 1.1.3, 5.1.1

[94] Vern Paxson. Bro: a system for detecting network intruders in real-time. Com-puter Networks, 31(23-24):2435–2463, 1999. 1.1.3

[95] Tao Peng, Christopher Leckie, and Kotagiri Ramamohanarao. Protection fromdistributed denial of service attacks using history-based IP filtering. In Proceed-ings of IEEE International Conference on Communications, ICC 2003, Anchor-age, Alaska, USA, 11-15 May, 2003, pages 482–486. IEEE, 2003. 2.2

[96] PhilippeFlajolet and G.Nigel Martin. Probabilistic counting algorithms for database applications. J. Comput. System Sci., 31:182–209, 1985. 1, 6.1.1, 6.3, 6.5.5

[97] Ely Porat and Martin J. Strauss. Sublinear time, measurement-optimal, sparserecovery for all. In Yuval Rabani, editor, SODA, pages 1215–1227. SIAM, 2012.4.5

[98] M. Zubair Rafique and Juan Caballero. Firma: Malware clustering and networksignature generation with mixed network behaviors. In Salvatore J. Stolfo, Ange-los Stavrou, and Charles V. Wright, editors, RAID, volume 8145 of Lecture Notesin Computer Science, pages 144–163. Springer, 2013. 5.2.1

[99] J. Rajahalme, A. Conta, B. Carpenter, and S. Deering. Rfc 3697, 2004.http://tools.ietf.org/html/rfc3697. 3.3

[100] Supranamaya Ranjan, Ram Swaminathan, Mustafa Uysal, and Edward W.Knightly. Ddos-resilient scheduling to counter application layer attacks underimperfect detection. In INFOCOM 2006. 25th IEEE International Conferenceon Computer Communications, Joint Conference of the IEEE Computer andCommunications Societies, 23-29 April 2006, Barcelona, Catalunya, Spain. IEEE,2006. 2.2

[101] Supranamaya Ranjan, Ram Swaminathan, Mustafa Uysal, Antonio Nucci, andEdward W. Knightly. Ddos-shield: Ddos-resilient scheduling to counter applica-tion layer attacks. IEEE/ACM Trans. Netw., 17(1):26–39, 2009. 2.2

[102] B. Rosen. Asymptotic theory for successive sampling with varying probabilitieswithout replacement, I. The Annals of Mathematical Statistics, 43(2):373–397,1972. Space-Saving Heavy Hitters, 6.5.3, 6.6

[103] Secure64. Defending against ddos attacks that target the dns, 2017.https://secure64.com/solutions/defending-against-ddos-attacks/. 7.1

[104] Vyas Sekar, Michael K. Reiter, Walter Willinger, Hui Zhang, Ramana Rao Kom-pella, and David G. Andersen. csamp: A system for network-wide flow monitoring.In USENIX NSDI, pages 233–246, 2008. 3.2.1

[105] Sajad Shirali-Shahreza and Yashar Ganjali. Flexam: flexible sampling extensionfor monitoring and security applications in openflow. In HotSDN, pages 167–168,2013. 3.2.1

[106] Sumeet Singh, Cristian Estan, George Varghese, and Stefan Savage. Automatedworm fingerprinting. In OSDI, pages 45–60. USENIX Association, 2004. 4.5, 1,5.2.1

BIBLIOGRAPHY 131

[107] Mudhakar Srivatsa, Arun Iyengar, Jian Yin, and Ling Liu. Mitigating application-level denial of service attacks on web servers: A client-transparent approach.TWEB, 2(3):15:1–15:49, 2008. 2.2

[108] Brent Stephens, Alan L. Cox, Wes Felter, Colin Dixon, and John B. Carter.PAST: scalable ethernet for data centers. In Conference on emerging NetworkingExperiments and Technologies, CoNEXT ’12. 3.1

[109] Yong Tang and Shigang Chen. Defending against internet worms: a signature-based approach. In INFOCOM, pages 1384–1394. IEEE, 2005. 5.2.1

[110] Akamai Technologies. How the mirai botnet is fuel-ing today’s largest and most crippling ddos attacks, 2016.https://www.akamai.com/us/en/multimedia/documents/white-paper/akamai-mirai-botnet-and-attacks-against-dns-servers-white-paper.pdf. 7.1

[111] Justin Thaler, Michael Mitzenmacher, and Thomas Steinke. Hierarchical heavyhitters with the space saving algorithm. In David A. Bader and Petra Mutzel,editors, ALENEX, pages 160–174. SIAM / Omnipress, 2012. 2.1.3, 4.5

[112] Alok Tongaonkar, Ruben Torres, Marios Iliofotou, Ram Keralapura, and AntonioNucci. Towards self adaptive network traffic classification. Computer Communi-cations, 56:35–46, 2015. 5.2.1

[113] Matt Torrisi. Advanced secondary dns for the technically inclined, November2016. Published on of dyn.com. 7.1

[114] Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249–260,1995. 4.5

[115] Niels L. M. van Adrichem, Christian Doerr, and Fernando A. Kuipers. Open-netmon: Network monitoring in openflow software-defined networks. In NOMS,pages 1–8. IEEE, 2014. 3.2.1

[116] Shobha Venkataraman, Dawn Xiaodong Song, Phillip B. Gibbons, and AvrimBlum. New streaming algorithms for fast detection of superspreaders. In Proc.Network and Distributed System Security Symposium (NDSS), 2005. 1.1.4, 6.1.1,6.4, 6.7.2.2

[117] Verisign distributed denial of service trends report q4 2015.https://www.verisign.com/assets/report-ddos-trends-Q42015.pdf, 2015. 7.1,8.2

[118] Luis von Ahn, Manuel Blum, Nicholas J. Hopper, and John Langford.CAPTCHA: using hard AI problems for security. In Eli Biham, editor, Advancesin Cryptology - EUROCRYPT 2003, International Conference on the Theoryand Applications of Cryptographic Techniques, Warsaw, Poland, May 4-8, 2003,Proceedings, volume 2656 of Lecture Notes in Computer Science, pages 294–311.Springer, 2003. 2.2

[119] Bing Wang, Yao Zheng, Wenjing Lou, and Y. Thomas Hou. Ddos attack protec-tion in the era of cloud computing and software-defined networking. ComputerNetworks, 81:308–319, 2015. 2.2

132 BIBLIOGRAPHY

[120] Ke Wang and Salvatore J. Stolfo. Anomalous payload-based network intrusion de-tection. In Erland Jonsson, Alfonso Valdes, and Magnus Almgren, editors, RAID,volume 3224 of Lecture Notes in Computer Science, pages 203–222. Springer,2004. 5.2.2

[121] Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposiumon Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973,pages 1–11. IEEE Computer Society, 1973. 4.5

[122] Qiao Yan, F. Richard Yu, Qingxiang Gong, and Jianqiang Li. Software-definednetworking (SDN) and distributed denial of service (ddos) attacks in cloud com-puting environments: A survey, some research issues, and challenges. IEEE Com-munications Surveys and Tutorials, 18(1):602–622, 2016. 2.2

[123] Ke Yi and Qin Zhang. Optimal tracking of distributed heavy hitters and quantiles.Algorithmica, 65(1):206–223, 2013. 8.2

[124] Minlan Yu, Lavanya Jose, and Rui Miao. Software defined traffic measurementwith opensketch. In USENIX NSDI, pages 29–42, 2013. 1, 3.1.1, 3.2.1, 3.4.3.1,3.4.3.1

[125] Ye Yu, Chen Qian, and Xin Li. Distributed and collaborative traffic monitoringin software defined networks. In HotSDN, pages 85–90, 2014. 3.2.1

[126] Ali Zand, Giovanni Vigna, Xifeng Yan, and Christopher Kruegel. Extractingprobable command and control signatures for detecting botnets. In Yookun Cho,Sung Y. Shin, Sang-Wook Kim, Chih-Cheng Hung, and Jiman Hong, editors,Symposium on Applied Computing, SAC 2014, Gyeongju, Republic of Korea -March 24 - 28, 2014, pages 1657–1662. ACM, 2014. 5.2.1

[127] Saman Taghavi Zargar, James Joshi, and David Tipper. A survey of defensemechanisms against distributed denial of service (ddos) flooding attacks. IEEECommunications Surveys and Tutorials, 15(4):2046–2069, 2013. 2.2, 5.2.2

[128] Qi Zhao, Zihui Ge, Jia Wang, and Jun (Jim) Xu. Robust traffic matrix estimationwith imperfect information: making use of multiple data sources. In Proceedingsof the Joint International Conference on Measurement and Modeling of ComputerSystems, SIGMETRICS/Performance 2006. 3.1

אוניברסיטת תל אביב

הפקולטה למדעים מדוייקים ע"ש ריימונד ובברלי סאקלר

ביה"ס למדעי המחשב ע"ש בלווטניק

הרחבות של בעיית האיברים החוזרים ונשנים לצורך זיהוי אנומאליות מורכבות בתעבורת רשת

חיבור לשם קבלת תואר "דוקטור לפילוסופיה"

מאת

שיר לנדאו פייביש

עבודת המחקר בוצעה בהדרכתו של

פרופסור יהודה אפק

הוגש לסנאט של אוניברסיטת תל אביב

אב תשע"ז

תמצית

תעבורה ברשת יחד עם וקטורי התקפה חדשים, הבשנים האחרונות, הכמות העצומה של

ולאתר תופעות ייחודיות בתעבורה. מעלים את הצורך ליצירת כלים חדשים שיכולים לזהות . חדשות של מחרוזות ונתונים בתעבורה רשתית חזרות ותבניותתופעות אלו כוללות

,מורכבות ומציעים טכניקות יסודיות לזהותןזה זו, אנו בוחנים חלק מאותן תבניות וחזרות בת

שלכות של הן ברשתות מוגדרות תוכנה והן ברשתות קלאסיות. בנוסף, אנו בוחנים את הה

על ניטור הרשת והאבטחה של הרשת ומציעים כלים להתמודדות עם תעבורה מסוג זה השלכות אלו.

חוזרים ונשניםאבן יסוד מרכזית שאנו מרחיבים ומכלילים הינה הבעיה של מציאת איברים

). בעבודתנו אנו חוקרים שלוש וריאציות של בעיה זו. לכל וריאציה HittersHeavyבנתונים (

נו מציגים קונספטים חדשים והגדרות חדשות של הבעיה, לצד אלגוריתמים חדשים ויעילים א

, אנו מציעים הגדרות חדשות ומבוססות זמן לבעיה של מציאת 3ראשית, בפרק לפתרונה.

כבדות , אנו בוחנים את הבעיה של מציאת מחרוזות 4. שנית, בפרק חוזרים ונשניםאיברים

, אנו חוקרים את הבעיה של 6מחרוזות (הודעות). לבסוף, בפרק באורכים משתנים בזרם של

מפתח>, שהיא -בזרם של זוגות מתצורה <מפתח, תת חוזרים ונשניםמציאת איברים נבדלים הבעיה של מציאת מפתחות שיש להם הרבה תתי מפתחות שונים.

באחת מציגים שלוש אפליקציות, שכל אחת עושה שימוש אנו בעזרת אלגוריתמים אלו,

, פיתחנו מכניזם הטכניקות שמוצגת לעיל. ראשית, בהתבסס על ההגדרות מבוססות זמן

) לצורך 3ברשתות מוגדרות תוכנה (פרק חוזרים ונשניםלזיהוי סוגים שונים של זרמים

אבטחה, ניטור וניהול של רשתות אלו. שנית, באמצעות האלגוריתם למציאת מחרוזות כבדות

ל מחרוזות, פיתחנו מערכת לזיהוי חתימות של התקפות מניעת באורכים משתנים בזרם ש

לבסוף, אנו מציגים מערכת למיטיגציה ). 5) ברמת האפליקציה (פרק DDOSשירות מבוזרות (

באלגוריתם ), שעושה שימוש 7) (פרק DNSשל התקפות רנדומית על מערכת שמות התחום (

זוגות. שתי ההתקפות הללו ובפרט בזרם של חוזרים ונשניםשלנו למציאת איברים נבדלים

הניכרת שיש להן על מיליוני משתמשי ההאחרונה צברו עניין רב בזמן האחרון לאור ההשפע אינטרנט והמערכות שלנו מציגות שיטות חדשות למיטיגציה של אותן התקפות.

פירי. באופן אמהן ביצענו שיערוך של הכלים שלנו ואנו מוכיחים את יעילותן הן באופן אנליטי ו

יות וביצענו לצורך כך, מימשנו את המערכות והאלגוריתמים שלנו באמצעות מגוון טכנולוגתיות.בדיקות על תעבורה אמתית לרבות דגימות של התקפות אמ

תקציר

עלתה באופן דראסטי בעשרים כמות המידע שעוברת ברשתות התקשורת הגלובאליות

השנים האחרונות. ככל שכמות התעבורה גדלה, הכמות האדירה של הפקטות שעוברות

ברשתות מייצרות סיכונים חדשים לפונקציונאליות של הרשת. קהיליית הרשתות, הכוללת

חוקרים, מתכננים ומפעילים, נמצאת במאבק מתמיד על יכולת המעבר של מידע לגיטימי יעת המעבר של תוכן לא לגיטימי.ברשת ומנ

" שיכולים big-dataסיכונים אלו יוצרים צורך מתמשך במציאת פתרונות חדשים מעולם ה"

התעבורה לבדו מציב מכשולים עשרות מיליוני פקטות בשנייה. ראשית, נפחלהתמודד עם

ות של או התפרצויות גדול eventsflashרבים בתפקוד התקין של הרשתות. אירועים כגון

מנגנונים של איזון עומסים לצורך שימור איכות אמצעות וטיפול ב תעבורה דורשים זיהוי מידי

שנית, גורמים זדוניים ברחבי העולם מבצעים אינספור התקפות מידי יום. השירות ברשת.

). exploitszero-dayתוקפים מייצרים התקפות מסוגים חדשים אשר אין לגביהן ידע מוקדם (

תוקפים עושים שימוש בקבוצות גדולות מאוד של מכונות פגועות המכונות בוטנט בנוסף, ה

)Botnet .הגורמות לעלייה מתמדת בהיקף ועוצמת ההתקפות (

על מנת להתמודד עם סיכונים אלו, מנהלי הרשתות נדרשים למעשה למצוא מחט בערימה

ברחבי הרשת, אפילו לגיטימיות עושות דרכןות פקטעצום של עוד מספר של שחט. כלומר, ב

מכרעת על הרשת. יש צורך אחוז קטן מאוד של פקטות חריגות יכולות להיות בעלות השפעה

בטכניקות מתקדמות אשר יכולות לאתר את הפקטות החריגות ולטפל בהן. תופעות אלו של

הפקטותלדוגמה, וכן תצורות שונות. ולות להיות בעלות מאפיינים מגווניםפקטות חריגות יכ

גדול במיוחד או שהן יכילו payloadיכולות להיות בעלות מבנה משונה. ייתכן שיהיה להן

ת ה להיות בעלשל פקטות יכול קבוצהשאינו בשימוש נפוץ. במקרים אחרים, headerשדה ב

דות לאותו יעד בודד או עמאפיינים מיוחדים. למשל, פקטות רבות ממקורות שונים שמיו שות לאתר מסוים.לדוגמה מספר חריג של בק

חלק מאותן תופעות רשת יוצאות דופן התזה שלי מציגה טכניקות מתקדמות לאפיון וזיהוי של

וכן כלים big-dataאנו מספקים תובנות חדשות לעולם ה שנצפו לאחרונה ברחבי הרשת.

ואלגוריתמים לזיהוי חזרות של מידע בתעבורה ברשת לצורך מגוון של אפליקציות רשתיות.

פית, אנו מתמקדים בחריגות בתעבורה הקשורים לאספקטים של אבטחת הרשת, ספצי

לרבות התקפות zero-dayומייצרים מכניזמים לצורך מיטיגציה של התקפות שונות מסוג

), שמאיימים על לב ליבה של System )DNSNameDomainשנצפו לאחרונה על מערכת ה

.]65[פונקציונאליות הרשת

ים עמו אנו מתמודדים, הינו הזיהוי של כמויות גדולות של תעבורה אחד האתגרים המרכזי

תכונה משותפת כלשהי, אשר נקראת גם זרימה גדולה או כבדה של תעבורה (ראה להש

). במובן הקלאסי, זרימה אופיינה כרצף פקטות שנשלחו ממקור אחד ליעד 3הגדרה בפרק

ו נרחבת בהרבה. כיום, זרימה הינה אחד. ברשתות כפי שאנו מכירים אותן כיום, ההגדרה הז

רצף פקטות אשר יש להן שדות כותרת זהים כלשהם. זיהוי של זרימה כבדה בתעבורה הינה

הבטחת רמת שירות אחת היכולות הבסיסיות הדרושות ברשת. זוהי יכולת מפתח לצורך

)QualityofService,זיהוי של ), תכנון נפחים ועומסים, והנדסת תעבורה יעילה. יתרה מזאת

DenialDistributedזרימה כבדה קריטית לצורך זיהוי של התקפות מניעת שירות מבוזרות (

ofService(DDoS) .ברשת (

], אך אלה אינן 4,33זרימה כבדה התבססו על מדידת זרימות [טכניקות מסורתיות לזיהוי

]. לפיכך, נוכח הכמויות העולות של תעבורה, 49[ ניתנות להרחבה, כלומר אינן סקאלאביליות

אנו ממשיכים את ]. 28,49,124קיים פיתוח של שיטות מתוחכמות יותר כגון אלו המוצגות ב[

המחקר אודות שיטות יעילות לזיהוי זרימה כבדה ומציעים פתרונות שונים אשר מבוססים על .החוזרים ונשניםרים אשר פותחו לבעיית האיב streamingמשפחה של אלגוריתמי

ועוסקת במציאת האיברים הינה בעיה שנחקרה רבות החוזרים ונשניםבעיית האיברים

איברים, איבר Nבמובן הקלאסי של הבעיה, בהינתן זרם של הפופולריים בזרם של איברים.

]. פתרונות כגון 83כלשהו [ θ<1>0פעמים עבור Nθהינו איבר שמופיע לפחות חוזר ונשנה

של Sample and Hold] או אלגוריתם ה81של מטוואלי ועוד [ Space-Savingריתם האלגו

ומספקים הערכה של מספר הפעמים החוזרים ונשנים ] מזהים את האיברים49אסטן וורגיז [ בזרם. ושהם הופיע

ובוחנים את הבעיה של מציאת איברים יריעה המקורית של הבעיה האנו מרחיבים את

בסוגים שונים של תעבורה או ארכיטקטורות רשת. כפי שנראה בתזה זו, זיהוי חוזרים ונשנים דורש אלגוריתמים ייחודיים. חוזרים ונשניםסוגים שונים של איברים

ראשית, אנו מרחיבים את ההגדרה הקלאסית של הבעיה ומשלבים בה היבטים של לוקאליות

אנחנו מציעים למימד הזמן. של זמן. אנו מציגים הגדרות חדשות לבעיה תוך התייחסות

DefinedSoftware(ברשתות מוגדרות תוכנה חוזרים ונשניםשיטות לזיהוי זרמים

Networks(SDN) .( ,חוזרתאנו מציעים אלגוריתמים לזיהוי סוגים שונים של זרימה בנוסף

הן עבור מתג בודד והן עבור מערך מבוזר של בארכיטקטורה של רשת מוגדרת תוכנה, ).3(פרק מתגים

עוסקת בזרם של חוזרים ונשניםשנית, בעוד שהבעיה הקלאסית של מציאת איברים

. בצורות שונות של נתונים חוזרים ונשניםמספרים, אנו חוקרים את הבעיה של זיהוי איברים

אנו בוחנים את הרעיון מורכבים מזוג של נתונים. הבפרט אנו מתמקדים במחרוזות ובאיברים

במידע טקסטואלי (הווה אומר בזרם של מחרוזות) ומציגים חוזרים ונשניםשל איברים

HeavyStringמחרוזות שכיחות בעלות אורכים משתנים (-אלגוריתמים יעילים לזיהוי תתי

Hitters בנוסף, אנחנו מציעים אלגוריתמים חדשים 4) בזרם גדול של מחרוזות (פרק .(

בזרם של זוגות מהתצורה ) HittersHeavyDistinctנבדלים ( חוזרים ונשניםלמציאת איברים

בזוגות חוזרים ונשנים). הגישה שאנו מציגים למציאת איברים 6מפתח> (פרק -<מפתח, תת

approximateמתבססת על אלגוריתמים לבעיה של שיערוך מספר האיברים השונים (

distinctcountingאיברים שונים נצפו בהינתן זרם של איברים, כמה :). הגדרת הבעיה הינה

עד לנקודה מסוימת בזרם. קיימים מספר פתרונות לבעיה זו המבוססים על סקיצות כגון אלו

].96, 66, 58, 47, 37, 25ב [המוצגים

אנו מציגים את השימושיות של האלגוריתמים הנ"ל בזיהוי רצפים חריגים של פקטות ושל

סוגים שונים של התקפות מיטיגציה של בי ותבניות נתונים ומדגימים כיצד הם מסייעים בזיהו

DDoS .טההמוצג מ 1 הגישה הכללית שלנו, אשר מוצגת בתרשים שנצפו בשנים האחרונות

של שני שלבים. ראשית, מבצעים אנליזה של תעבורה בזמן שלום, לצורך כוללת תהליך

קפה נעשית יצירת קו בסיס של דפוסים ותבניות אשר מצויים בתעבורה בזמן שלום. בזמן הת

אשר נבדקים ביחס לקו הבסיס על מנת לבחון אם הם דפוסים ותבניותאנליזה לזיהוי

ייחודיים להתקפה או שהם חלק מהתעבורה גם בזמן שלום. הדפוסים או החזרות שהם

וניתן להשתמש בהם לצורך מיטיגציה ייחודיים להתקפה יוצרים את החתימות של ההתקפה של ההתקפה.

סקר כללי של מערכת למיטיגציה של התקפות מניעת שירות מבוזרות:1תרשים

מחרוזות שכיחות בעלות אורכים משתנים -תתיאנו משתמשים באלגוריתם לזיהוי 5בפרק

בזרם של מחרוזות לבנית מערכת למיטיגציה של התקפות מניעת שירות מבוזרות ברמת

]. הפקטותattacks] (11,12serviceofdenialdistributedlevelApplicationהאפליקציה (

רגל קטנה הנגרמת על ידי הכלים -שמרכיבות את ההתקפות הללו מכילות לרוב תביעת

יכולות להיות קטנות מאוד, לדוגמה, שורה רגל אלו -שמייצרים את פקטות ההתקפה. תביעות

יתם שלנו ) נוספת שלא קיימת בדרך כלל בפקטות מסוג זה. האלגורreturncarriageחדשה (

יכול למצוא את תביעות הרגל האלו בהקשר של התוכן של הפקטה. לאחר מכן, ניתן

להשתמש בתוכן של תביעת הרגל לצורך זיהוי הפקטות הבאות של ההתקפה ולפיכך לעצור את ההתקפה.

נבדלים חוזרים ונשניםאנחנו עושים שימוש באלגוריתמים שלנו למציאת איברים 7בפרק

מערכת שמות המתחם מערכת למיטיגציה של התקפות רנדומיות על בזוגות לבניית

)DomainNameSystem(DNS)( ]50 אל אלו, נשלחות בקשות ייחודיות רבות ]. בהתקפות

רנדומי. האלגוריתמים שלנו מסוגלים -מתחם שהינו פסאודו-תת אשר מכילותמתחם כלשהו

אשר מופיעים בזרם הבקשות יחד עם זיהוי מתחמים (מפתחות) לזהות בקשות אלו על ידי

, ובכך לזהות את המתחם המותקף.מפתחות) שונים-מתחמים (תתי-הרבה תתי

Date post:	13-Aug-2020
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

Heavy Hitters Extensions for Advanced Tra c …level DDoS attacks (Chapter5). Finally, we present a...

Documents