+ All Categories
Home > Documents > Accurately Measuring Global Risk of Amplification Attacks using … · 2020. 10. 13. · Accurately...

Accurately Measuring Global Risk of Amplification Attacks using … · 2020. 10. 13. · Accurately...

Date post: 27-Feb-2021
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
21
Accurately Measuring Global Risk of Amplification Attacks using AmpMap Soo-Jin Moon, Yucheng Yin, Rahul Anand Sharma, Yifei Yuan, Jonathan M. Spring, Vyas Sekar October 13, 2020 CMU-CyLab-19-004 CyLab Carnegie Mellon University Pittsburgh, PA 15213
Transcript
Page 1: Accurately Measuring Global Risk of Amplification Attacks using … · 2020. 10. 13. · Accurately Measuring Global Risk of Amplification Attacks using AmpMap . Soo-Jin Moon, Yucheng

Accurately Measuring Global Risk of Amplification Attacks using AmpMap

Soo-Jin Moon, Yucheng Yin, Rahul Anand Sharma, Yifei Yuan, Jonathan M. Spring, Vyas Sekar

October 13, 2020

CMU-CyLab-19-004

CyLab Carnegie Mellon University

Pittsburgh, PA 15213

Page 2: Accurately Measuring Global Risk of Amplification Attacks using … · 2020. 10. 13. · Accurately Measuring Global Risk of Amplification Attacks using AmpMap . Soo-Jin Moon, Yucheng

Accurately Measuring Global Risk of Amplification Attacks using AmpMap

Soo-Jin Moon†, Yucheng Yin†, Rahul Anand Sharma†, Yifei Yuan§∗, Jonathan M. Spring⋄, Vyas Sekar†

†Carnegie Mellon University, §Alibaba Group, ⋄CERT/CC®, SEI, Carnegie Mellon University

Abstract

Many recent DDoS attacks rely on amplification, where an

attacker induces public servers to generate a large volume

of network traffic to a victim. In this paper, we argue for

a low-footprint Internet health monitoring service that can

systematically and continuously quantify this risk to inform

mitigation efforts. Unfortunately, the problem is challenging

because amplification is a complex function of query (header)

values and server instances. As such, existing techniques that

enumerate the total number of servers or focus on a specific

amplification-inducing query are fundamentally imprecise. In

designing AmpMap, we leverage key structural insights to

develop an efficient approach that searches across the space

of protocol headers and servers. Using AmpMap, we scanned

thousands of servers for 6 UDP-based protocols. We find

that relying on prior recommendations to block or rate-limit

specific queries still leaves open substantial residual risk as

they miss many other amplification-inducing query patterns.

We also observe significant variability across servers and

protocols, and thus prior approaches that rely on server census

can substantially misestimate amplification risk.

1 Introduction

Many recent high-profile Distributed Denial-of-Service

(DDoS) attacks rely on amplification [54,57]. In an amplifica-

tion attack, an attacker spoofs the victim’s source IP address

and sends queries to a public server (e.g., DNS, NTP, Mem-

cached), which in turn sends large responses to the victim. If

a source IP address can be spoofed, any stateless protocols

in which the response is larger than the query can be abused.

While there are various best practices to mitigate this situ-

ation (e.g., [1–3]) given that spoofing is possible, they are

unevenly applied. Spoofing the victim’s IP may be avoidable

in a future Internet (e.g., [26]), but it continues to be possible

from a large number of ISPs [11, 23]. Finally, there continue

to be many public-facing servers that can be exploited for

∗Contributions by Yifei Yuan were made during the time he was a post-

doctoral researcher at Carnegie Mellon University.

amplification [57]; many servers do not apply best-practice

mitigations (e.g., rate limiting, restricting access).

As networks evolve and server deployments change, the

potential for amplification attacks changes over time. For in-

stance, new avenues for amplification emerge (e.g., botnet,

gaming protocols), and unexpected vectors for known proto-

cols are discovered [16]. In light of the continued threat of am-

plification, we argue that we need an Internet-scale monitoring

service that can systematically and continuously measure the

empirical risk of amplification [7, 13]. We envision a service

that periodically maps each server to query patterns yielding

high amplification and quantifies these amplification factors

(AF). Such a framework can serve as an empirical foundation

for cyber-risk quantification that many have argued for [5,10].

Furthermore, this framework can inform remediation efforts

such as throttling servers, generating signatures, informing

protocol changes, and provisioning defenses.

At first glance, it seems that we can use or extend existing

scanning services that look for and enumerate open/public

servers for different protocols (e.g., Censys [34], ZMap [35],

and openresolver [9] monitor open DNS resolvers, and

shadowserver [19] reports on open CharGen, LDAP, QOTD,

and SNMP servers, among others). For instance, we can mul-

tiply the number of open servers with previously reported

amplification factors (AF) [5, 57]. We can also extend these

scanning services to probe servers using a set of known query

patterns (e.g., send ANY requests to DNS servers) to account

for per-server factors (rather than using a single global am-

plification factor for all servers). Unfortunately, these have

fundamental shortcomings (§2.2). These solutions assume

that the amplification that servers yield is homogeneous or

that they share an identical set of query patterns. In reality, we

see significant and unpredictable variability in amplification

across servers (including within servers running the same

software versions) and query patterns that yield amplifica-

tion. Thus, these approaches are inaccurate for estimating the

empirical risk and for informing remediation efforts.

At the other extreme, we can envision a brute-force ap-

proach of sending all possible protocol-compliant queries to

Page 3: Accurately Measuring Global Risk of Amplification Attacks using … · 2020. 10. 13. · Accurately Measuring Global Risk of Amplification Attacks using AmpMap . Soo-Jin Moon, Yucheng

Known

pattern

AmpMap-discovered patterns

new patternpolymorphic

variants

DNS

EDNS:0,

ANY [1], TXT [18]

lookups

EDNS 6= 0,

LOC, SRV, URI

lookups · · ·

rd:0 (off)

DNSSEC:0 (off)

EDNS payload<512

· · ·

NTP monlist [2, 57]

if stats

if reload

get restrict

peer list

· · ·

None

SNMP

v2

GetBulk request

[3, 57]

GetNext request

Get request

Vary an object

identifier (OID);

Vary the # of OIDs

Chargen Character genera-

tion request [57]

None None

Mem-

cached

Stats command

[3]None None

SSDPSEARCH request

[3, 57]None

ssdp:all

upnp:rootdevice

· · ·

Table 1: Summarizing known, unforeseen, and polymor-

phic query patterns found using AmpMap

servers for each protocol. Unfortunately, the search space of

possible queries is large (e.g., NTP has multiple 32-bit fields).

We can also consider simple fuzzing or existing heuristic-

based optimization techniques but they all have fundamen-

tal limitations as the relationship between the packet field

values and amplification can be quite complex. This high-

lights a fundamental tension between the overhead of such an

amplification-monitoring service and its utility.

In this paper, we present AmpMap, a framework for mea-

suring the risk of amplification with a low network footprint

that accounts for both the server- and query-specific variabil-

ity. Our approach builds on key structural insights. First, we

observe that distinct amplification-inducing query patterns

overlap in terms of values in protocol fields. This locality

structure suggests that if we find one such pattern, we can

potentially uncover other related patterns. Second, we observe

that large fields (e.g., 16 or 32 bit) either do not affect am-

plification (e.g., timestamp for NTP), or when they do, have

contiguous structure (e.g., EDNS payload for DNS). This

structure suggests that we can use smart sampling strategies

to efficiently explore the search space of large fields. Finally,

even though protocol server implementations are diverse, they

share some similarities. This helps us further reduce overhead

and improve fidelity by sharing insights across servers.1

Findings: We implemented AmpMap, validated our parame-

ter settings in lab settings, and ran real-world measurements.

Our key findings (§5) are :

• Uncovering new patterns and polymorphic variants: We

discovered new patterns and polymorphic variants (from

known ones) in addition to confirming findings from prior

1While we acknowledge that these insights may not be universal for all

protocols, these hold in practice for many protocols that have been popular

targets.

work (e.g., GetBulk for SNMP [3], ANY or TXT lookups for

DNS [3, 57, 62]). Table 1 summarizes our findings. For

DNS, we also uncover multiple patterns (e.g., URI, SRV,

CNAME lookups) that collectively incur 21.9 × more risk

than a popular-known pattern (ANY lookup). Specifically,

while some of DNS patterns have been pointed by (mostly)

the operational community (e.g., A, RRSIG [58, 62, 64]),

many have not been documented to the best of our knowl-

edge. For NTP, apart from the monlist request, we dis-

cover get restrict and if stats can too also incur

higher than 500× amplification factor (AF). For SNMP,

apart from GetBulk [3,57], GetNext requests can incur am-

plification up to a few hundred! We also discover polymor-

phic variants due to server diversity. For GetBulk request,

SNMP servers can incur magnitudes higher amplification

with requesting for certain object identifiers (OIDs) and

querying the right number of OIDs.

• Variability across servers and protocols: We observe sig-

nificant variability with the amplification that each server

can yield; e.g., the amplification factor (AF) can vary be-

tween 0 to 1300 for NTP. This confirms we cannot assess

amplification risk by looking at mega-amplifiers or simply

counting the number of servers. We also observe substan-

tial variability in the AF distribution across protocols; e.g.,

60.4% of Chargen servers can yield AF above 100 but

only 0.02% of servers for DNS. Such variability across

multiple dimensions calls for the need to do periodic mea-

surements rather than one-time analysis.

• Empirical risk quantification: By analyzing our measure-

ment data, we unfortunately find that just disabling the few

known patterns (Table 1) is far from enough; e.g., block-

ing EDNS0 and ANY or TXT lookups for DNS still leaves

17.9× the residual risk from “other” patterns (Table 6).

Further, using an additive risk metric (§2), we highlight the

imprecision of the risk estimated by prior work. Even if

we focus on the known patterns (e.g., GetBulk for SNMP),

existing techniques underestimate SNMP risk by 3.5×and overestimate Memcached risk by 5.6K× and DNS

by 1.9×. If we consider new patterns, then the inaccuracy

gets worse; e.g., DNS risk is underestimated by 11.9×.

Ethics and Disclosure: We carefully adhered to the ethical

principles in running our measurements (§6.1). We have also

disclosed the newly discovered patterns to relevant stakehold-

ers such as CERT, vendors, and IP address owners (§6.2). We

also discuss countermeasures in light of our findings (§8).

2 Background and motivation

We start with background on amplification attacks. We then

motivate the need for empirically measuring amplification

risk and discuss why strawman solutions are insufficient.

Primer on amplification: In an amplification attack (Fig-

ure 1), the attacker spoofs a victim’s source IP and sends

a small query/request (e.g., 60 bytes) to one or more pub-

Page 4: Accurately Measuring Global Risk of Amplification Attacks using … · 2020. 10. 13. · Accurately Measuring Global Risk of Amplification Attacks using AmpMap . Soo-Jin Moon, Yucheng

Public server(amplifier)

Amplification Factor (AF) = 100 X

“Spoofed” query, q

Attacker Victim

|q| = 60 bytes|r| = 6000 bytes

response, r

Figure 1: Primer on amplification attack and amplifica-

tion factor (AF)

lic servers that act as amplifiers. Amplifiers send large re-

sponses to the victim. The amplification factor (AF) is the

ratio of the query and response sizes, e.g.,|r||q| = 100 in Fig-

ure 1. AF is also referred to as BAF (i.e., bandwidth AF) in

prior work [5, 57]. (We do not report packet amplification fac-

tors or PAF [57] for brevity.2) Amplification attacks are well

known [54] and have been exploited at scale [16, 21, 22]. For

example, one of the query patterns that induce high ampli-

fication for DNS is 〈 EDNS:0, EDNS payload:(1000,65535),

record type:ANY · · · 〉. Here, EDNS is set to version 0, allow-

ing a DNS server to use the non-default payload size and

send large responses (default value is 512-bytes). The EDNS

payload is set to greater than 1K to overwrite the default 512-

bytes, and record type is set to ANY to look up all records for

a given domain.

2.1 Motivating use cases

We summarize two motivating use cases as argued by prior

academic and policy efforts (e.g., [5, 10, 57]). For both use

cases, there are two relevant aspects for each server/amplifier:

(1) which query patterns cause large amplification, and (2)

how much amplification each query pattern induces.

U1) Assessing cyber risk: Network operators need to know

whether, and by how much, their deployments are susceptible

to amplification. Policy makers and Internet security experts

need a risk assessment to focus their remediation efforts on the

highest priority risk. Given a query pattern, p, for a protocol,

Proto, and a set of servers, S, we define a simple additive risk

metric as follows:

RiskMetric(p,S) = ∑si∈S

AF(si, p) (1)

Then, given a set of patterns, P, the total risk then is the sum-

mation of the risk for each pattern, p ∈ P. Even though this

does not consider other factors [5] (e.g., outbound link capac-

ity), it is an instructive metric to quantify risk.

U2) Inform defense efforts: Operators need to know which

query patterns induce high amplification to take appropriate

defenses (e.g., block or throttle responses). Similarly, proto-

col designers need to know these patterns to (1) guide the

design of future protocols, and (2) assess whether particular

remediation (e.g., disabling a feature) can reduce the risk.

Lastly, ISPs need to know the degree to which servers are sus-

ceptible to amplification to inform capacity provisioning for

2PAF is the the number of IP packets that an amplifier sends for a request.

defenses. For this, the per-pattern risk can also help prioritize

the remediation efforts to focus on the largest threats first.

2.2 A case for a measurement service

Given these use cases, we can consider some seemingly natu-

ral strategies derived (or extended) from prior work in ampli-

fication analysis (e.g., [5, 32, 57]):

• S1) Scan for open servers: Using a count of the number

of open servers, we can multiply this number by a fixed,

known AF (e.g., 556 for NTP [24]). For instance, if there

are 1M open NTP servers, this approach would multiply

1M by 556 AF; for a 50 bytes request, this translates to

27.8 billion bytes. Such information can be used for risk

quantification (U1) and for informing network operators

of their servers (U2) akin to existing efforts (e.g., [5]).

• S2) Probe servers using fixed patterns: S1 assumes that

servers have identical risk and does not account for multi-

ple patterns. A more advanced strategy is to probe servers

using previously known patterns and record their AFs (e.g.,

DNS [61], NTP [32]). Then, we can use this to assess risk

(U1) and construct signatures (U2). However, there can

be different options for choosing which patterns to probe

(e.g., taking the known patterns, taking the top-K patterns

from random sampling).

• S3) Customize S2 for different server software: S2 did not

account for the variability of query patterns across servers.

If servers with the same software setup have similar pat-

terns, then we can run (S2) once for each software setup

(e.g., Bind 9.3, Dnsmasq 2.76). That way, we can reduce

the number of probes we send.

To understand if these strategies are effective, we run a

small-scale measurement study using DNS as an example. We

use DNS as its amplification properties are seemingly well

understood [24, 57]. We identify a set of 172 queries based

on three fields (record type, EDNS, recursion desired, or rd

for short) that are known to affect amplification [1, 3, 57].3

(As we will see later, these three fields do not represent the

full set of fields that affect amplification. Rather, we use this

as an illustrative set of query patterns to highlight why these

strategies are imprecise.) Then, we pick a random sample

of 1K DNS servers from Censys [34], send each of the 172

queries, and record the AF per query. We also obtained the

version string (if available) for each server using Nmap.

In this dataset, we observe 94 unique patterns that incur

≥ δ AF, where δ=10, with a total risk of 125.8K AF (using

Eq. 1); if these servers are connected to a mere 10 Mbit/sec

connection, 125.8K translates to 918 Gbps across 1K servers.4

Using this “ground truth”, we evaluate the above strategies

using two metrics: (1) the risk estimation accuracy (for U1);

and (2) the number of missed query patterns (for U2).

3We generated 172 queries using combinations of 43 values of

record type={A, NS, CNAME, · · ·}, EDNS={0,1}, and rd={0,1}460 bytes/query × 128.5 avg AF / server × 1K servers × 8 bits/byte ×

14,880 query/sec (using 10 Mbps and a frame size of 84 bytes)

Page 5: Accurately Measuring Global Risk of Amplification Attacks using … · 2020. 10. 13. · Accurately Measuring Global Risk of Amplification Attacks using AmpMap . Soo-Jin Moon, Yucheng

Strategies% Error in

Risk (U1)

# of Missed

Patterns (U2)

S1 Scaling by number of servers 4.5× ↓ N/A

S2Using known patterns 5.7× ↓ 90 (out of 94)Top-K from random samples 20× ↓ 86 (out of 94)Top-K from ground-truth data 3.6 × ↓ 84 (out of 94)

Table 2: Effectiveness of S1 and S2 in enabling use cases

Table 2 summarizes these metrics for S1 and S2. For S1

of multiplying the number of servers by a known AF factor,

we use an amplification factor of 28, as reported earlier [1].

For S2, we considered three possible instantiations: (1) us-

ing known query patterns from prior works (EDNS:0 and

record type set to ANY or TXT [1, 62]), (2) using the top-10

queries across servers w.r.t. the AF values after randomly

sampling 20% of the possible values of three fields space;

and (3) using the global top-10 patterns from the ground-truth

data. Note that (2) and (3) are extremely generous; in practice,

we do not know the global top-10 a priori, and the actual

space of queries is much larger than just 172 queries. We see

that S1 of scaling server count under-estimates the risk by

4.5×. Depending on the scaling factor, the risk may also be

significantly over-estimated. S2 also under-estimates the risk

(U1). We also see that S2 misses many query patterns (U2).

We also observe that this aggregate estimation error across

1K servers translates to large percentages (%) of residual risk

for each server (if we had used S2). If we consider a cumu-

lative distributive function (CDF) of the % of the residual

risk for each server, 50% of the servers would have: (1) ≥68% residual risk (if we had blocked the top-10 patterns from

the ground-truth, which is infeasible in practice), (2) ≥ 72%

residual risk (if we had blocked only the known patterns), and

(3) ≥ 82% residual risk (if we had taken top-10 patterns after

random sampling the header space). The trend does not really

get better, even if we had used other top-Ks (e.g., 20).

Finally, Table 3 shows the ineffectiveness of S3 for the

top-5 version (ranked by the number of servers that have at

least one query that induces AF≥ δ in the dataset). Here, we

define that servers have identical software setup if they share

the same vendor and a major version.

% Error in Risk Estimation for U2;(# of Missed Patterns / # of Total Patterns) for U2

Microsoft

6.1

Dnsmasq

2.52

Dnsmasq

2.40

Dnsmasq

2.76

Bind

9.9

Using knownpatterns

14.4× ↓(76/80)

2.7× ↓(27/31)

6× ↓(38/42)

3.8× ↓(44/48)

8.8× ↓(72/76)

Top-K fromrandom samples

8.7 × ↓(70/80)

3.6 × ↓(27/31)

44.2 × ↓(41/42)

31.6 × ↓(45/48)

7 × ↓(66/76)

Top-K fromgroundtruth

4.5× ↓(70/80)

1.2× ↓(21/31)

3.8× ↓(31/42)

1.7× ↓(38/48)

6× ↓(66/76)

Table 3: Effectiveness of S3 that does per-version analysis

To understand why these strategies are inaccurate, we an-

alyzed this data further. To explain our analysis, we define

some terms. Given a server, si, let Qi be the set of queries

that incur AF ≥ δ; Qi is the set of queries that elicit large

responses. Given n servers, let Q be the union of Q1 · · ·Qn; Q

is the union of all amplification-inducing queries.

164 78 79 165 128 140 54 55 148 162Query Pattern (QP) ID

0

20

40

60

Am

plifi

catio

n Fa

ctor

Figure 2: Diversity of AF given a query across servers

Variability in magnitude across servers: Figure 2 shows

the distribution of the AF value across servers. (Due to space,

we only show this for 10 queries that induce the highest AF

if sorted by the AF across our dataset.) For a given q, the

standard deviation ranges from 3.9 to 17. Looking beyond the

global top-10 queries, if we consider a maximum AF for each

server (across all 172 queries), there is significant variability

with a standard deviation of 16.7. This trend also holds for

servers sharing the same software versions (not shown).

Variability in query patterns across servers: If only a

small subset of patterns induce amplification on all servers

(i.e., Qi are identical), then S2 and S3 would have been suffi-

cient. To this end, we analyze the similarity (or lack thereof) of

query patterns across servers in two ways. Let TopK(Qi) de-

note a set of Top-K queries when Qi is sorted by the AF value.

Then, we analyze: (1) How similar are high-amplification

query patterns between every pair of servers (i.e., TopK(Qi)from TopK(Q j))? (2) How similar is a server-specific query

pattern set, TopK(Qi), to the global set, TopK(Q)? We com-

pare the top-K queries where K=10. Note that we are not just

looking at the maximum query (K=1) as we want to con-

sider multiple patterns. We observe the same trend holds for

varying Ks such as 5, 20 (not shown).

If we look at the histogram of similarity score when K is

10, more than 60% of server pairs have low similarity scores

equal or below 0.2, and only 4% of server pairs have above

0.8 similarity scores. This trend is also similar for servers with

identical software (Figure 3). For example, more than 45%

of Microsoft 6.1 servers have similarity scores ≤ 0.1. For the

question (2), compared to the global TopK(Q), we find that

more than 70% of servers’ TopK(Qi) has ≤ 0.2 similarity

scores w.r.t. the global TopK(Q).

Taken together, these results suggest that we cannot at-

tribute the homogeneous risk per pattern and across servers.

Furthermore, we cannot just extrapolate the risk from one

server instance (or one per software version) for our use cases.

Given this empirical variability across servers, query patterns,

and the AF values, we argue that we need an active measure-

ment framework to quantify the risk and inform defenses for

amplification attacks.

3 AmpMap Problem Overview

Having made a case for a measurement service, we formu-

late the goals for such a service we call AmpMap. Then, we

discuss the challenges in realizing such a service.

Formulation: We consider S servers implementing a proto-

Page 6: Accurately Measuring Global Risk of Amplification Attacks using … · 2020. 10. 13. · Accurately Measuring Global Risk of Amplification Attacks using AmpMap . Soo-Jin Moon, Yucheng

[0.0,0

.1)[0.

1,0.2)

[0.2,0

.3)[0.

3,0.4)

[0.4,0

.5)[0.

5,0.6)

[0.6,0

.7)[0.

7,0.8)

[0.8,0

.9)[0.

9,1.0]

Jaccard Similarity Score (Query Patterns)

0

10

20

30

40

50

% o

f Pai

rwis

e S

erve

rs

(a) Microsoft 6.1

[0.0,0

.1)[0.

1,0.2)

[0.2,0

.3)[0.

3,0.4)

[0.4,0

.5)[0.

5,0.6)

[0.6,0

.7)[0.

7,0.8)

[0.8,0

.9)[0.

9,1.0]

Jaccard Similarity Score (Query Patterns)

0

10

20

30

40

% o

f Pai

rwis

e S

erve

rs

(b) Dnsmasq 2.52

[0.0,0

.1)[0.

1,0.2)

[0.2,0

.3)[0.

3,0.4)

[0.4,0

.5)[0.

5,0.6)

[0.6,0

.7)[0.

7,0.8)

[0.8,0

.9)[0.

9,1.0]

Jaccard Similarity Score (Query Patterns)

0

10

20

30

40

50

% o

f Pai

rwis

e S

erve

rs

(c) Dnsmasq 2.40

[0.0,0

.1)[0.

1,0.2)

[0.2,0

.3)[0.

3,0.4)

[0.4,0

.5)[0.

5,0.6)

[0.6,0

.7)[0.

7,0.8)

[0.8,0

.9)[0.

9,1.0]

Jaccard Similarity Score (Query Patterns)

01020304050607080

% o

f Pai

rwis

e S

erve

rs

(d) Dnsmasq 2.76

[0.0,0

.1)[0.

1,0.2)

[0.2,0

.3)[0.

3,0.4)

[0.4,0

.5)[0.

5,0.6)

[0.6,0

.7)[0.

7,0.8)

[0.8,0

.9)[0.

9,1.0]

Jaccard Similarity Score (Query Patterns)

0

10

20

30

% o

f Pai

rwis

e S

erve

rs

(e) Bind 9.9

Figure 3: Histogram showing the Jaccard similarity scores between Top-10 query patterns of pairwise servers

col, Proto. For each server, s ∈ S, our goal is to uncover as

many distinct amplification-inducing query patterns as pos-

sible (say AF≥ δ=10 ) while keeping our network footprint

low. These per-server patterns output by AmpMap can inform

our use cases, such as assessing risk and informing defenses.

Intuitively, each pattern is a template for describing protocol

queries. In a given pattern, each field takes (1) a value or (2)

a contiguous range. Queries in the same pattern trigger sim-

ilar protocol behavior, and hence, have similar AFs (formal

definitions in our extended technical report [?]).

We obtain the list of open servers implementing a given

protocol from public services (Shodan [20], Censys [34]). We

prune out inactive protocol servers or servers owned by the

military or government. Each protocol is defined by a set of

fields (F = { f1 · · · fn}), and a set of accepted values for

each field (AV ( f1) · · · AV ( fn)). We obtain the protocol format

from protocol specifications (e.g., RFCs). For instance, DNS

defines fields such as DNSSEC, id, and their accepted values

(e.g., DNSSEC takes a value from {0,1}). A valid query of

Proto is a list of values for each field ( fi=vi ∈ AV ( fi)) and re-

turns a response. To avoid malformed queries that may impact

server operation, we only consider valid queries. We do not

include derived fields (e.g., checksum, count-related fields).

Some fields take a value from a set of strings (e.g., domain for

DNS, OID for SNMP). For these, we sample values. For DNS

domain fields, we take popular domains and with different

features (DNSSEC-enabled vs. not). To this end, we keep the

set of values for these fields small (a few tens). For the fields

that take a list of values (e.g., OID list for SNMP), we also

specify a length of a list as an input (§4).

To keep our footprint and impact on servers low, we impose

a total query budget for each server, Btotal (400–1500, §5). We

also consider additional precautions such as limiting the rate

per server and avoiding malformed requests (§6.1).

Scope: We focus on stateless and unicast protocols (e.g.,

UDP) and stateless amplification strategies. Thus, stateful

protocols (e.g., TCP-based [30,49]) and broadcast or multicast

protocols (e.g., [50]) are out of scope. Additionally, stateful

attack strategies that seed entries to a server and subsequently

launch a high AF query are outside our scope; e.g., we do not

consider an attacker who registers his own domain for DNS

with many records to amplify the attack.

Challenges: We now discuss three key challenges in achiev-

ing our goal. To illustrate these concretely, we consider a

Fields: F = { f1, f2, f3, f4, f5}Accepted values for each field: AV ( fi)1. f1 takes a value from 0 to 1; AV ( f1) = [0,1]

2. f2 takes a value from 0 to 99; AV ( f2) = [0,99]

3. f3 takes a value from 0 to 65535; AV ( f3) = [0,65535]

4. f4 takes a value from 0 to 7; AV ( f4) = [0,7]

5. f5 takes a value from 0 to 1; AV ( f5) = [0,1]

Figure 4: Simplified protocol definition to highlight chal-

lenges of uncovering amplification queries

f3

0 655354000

48

99

f2

High Query Pattern

(High QP)

Query Pattern 2 (QP2) :

v(f1) = 0

v(f2) = 48

v(f3) = [4000, 65535]

v(f4) = 0

v(f5) = 1000

QP119

QP2

QP3

AF Heatmap 1: f1 = 0 AF Heatmap 2: f1 = 1

0

33

99

655354000

QP4

QP5

Figure 5: Query space for one server, s1. QPi refers to a

query pattern

simplified protocol inspired by the structural properties of

real protocols. The protocol is shown in Figure 4 and consists

of 5 fields with their accepted values. Figure 5 represents the

structure of amplification-inducing query patterns for a single

server s1 varying two of these fields, f2 and f3, while fixing

the other three fields’ values. The left side is when f1=0, and

the right side is when f1= 1. In both cases, f4 and f5 are 0

and 1000, respectively. Each such “red” (darker) region in

these heatmaps is a potential query pattern. Even this rela-

tively simplified protocol highlights several key challenges.

We observe these challenges across protocols we surveyed

(especially for more complex protocols like DNS and NTP):

• C1: We observe a large query space of 2×100× 65K×8

×2 >200M values; i.e., it is infeasible to explore this space

exhaustively.

• C2: Even for a single server, the structure of amplification

can be complex as the fields in a query are dependent on

each other and need to be simultaneously set. For instance,

both f2 and f3 in QP2 (Figure 5) need to be set to 48 and

[4K, 65535], respectively, to yield high AF. Intuitively, in

real protocols, such behavior occurs as certain flags need

to be set to trigger a relevant behavior. For certain servers

to yield large AF for DNS (§2.2), we need to set EDNS

to 0 and rd to 1. Also, note the relationship between the

query and AF does not necessarily have a nice continuous

Page 7: Accurately Measuring Global Risk of Amplification Attacks using … · 2020. 10. 13. · Accurately Measuring Global Risk of Amplification Attacks using AmpMap . Soo-Jin Moon, Yucheng

structure. Worse, our goal is to uncover as many patterns as

possible in this complex, multi-field search space, making

the problem even more challenging.

• C3: Servers have a large degree of variability. As we saw

in §2.2, the exact AF for a given query may differ, and

the set of query patterns also may differ. Figure 6 shows

the structure for three servers (including s1) for the case

when f1 is 1. In our simplified protocol, queries in QP1

for s1 incurs high AF for s2 (i.e., QP1) but not for s3. Due

to the server configuration and the view of data a server

has (e.g., the number of peers for the NTP server), s3 does

not have any query patterns that cause high AF.

999

For server 1 (f1=0)

0

f2 f2

0 4000

f2

QP 1

QP 2

QP 3

65535

f3

999 0

0 4000

QP1’

QP2’

65535

99 99

0 4000 65535

99

f3 f3For server 2 (f1=0) For server 3 (f1=0)

Figure 6: Query space across multiple servers, only show-

ing the case when f1=0. (The left-most heatmap for s1 is

the heatmap 1 in Figure 5.)

4 AmpMap Overview and Design

In this section, we discuss our key insights regarding the struc-

tural properties of amplification common to many protocols

that enable our practical design. We start with a single server

case (§4.1) and use that to build a multi-server solution (§4.2).

4.1 Single-Server Algorithm

Before we explain our insights, let us consider two seemingly

natural baselines and see why these are not practical. (We

empirically confirm this in §5.)

1. Random fuzzing: We can randomly pick a field value to

construct a query. Unfortunately, achieving coverage across

distinct patterns would be prohibitively expensive. For in-

stance, if there are 10 patterns and the density of each

pattern to the total query space is 0.1 (ε), we need at least

29K queries to discover all patterns. We present analysis

in §A.

2. Heuristic optimization techniques: Existing heuristic op-

timization techniques (e.g., Simulated Annealing) may

find only a few patterns. These are ill-suited to achieve

coverage as these getting stuck in local optima.

4.1.1 Single-Server Insights

Next, we present our insights to make the problem tractable.

At a high level, these insights were derived from a combi-

nation of simple analysis, local server experiments, and the

measurements we saw in §2.2.

Insight 1 (I1): Amplification-inducing query patterns

exhibit locality and overlap in their field values.

Intuitively, we observe that query patterns often share a sub-

set of specific field values. This structural property suggests

that given a query, q, in one of the amplification-inducing

query patterns, we may not need to change all N fields at

a time. Instead, we can discover other nearby patterns by

sweeping one field at a time. Conceptually, we can view the

query space as a logical graph and look for “neighboring”

queries that differ in the value of just one field to discover

other patterns. Figure 7 shows a logical graph representation

of the query space for the abstract protocol (Figure 5). In

this graph, each node is a query and an edge between two

queries, q, and q′, indicates that they differ in only one field

value (e.g., f2). For instance, from a query in QP1, a simple

per-field search approach, as described above, can discover

queries in QP2 and QP3 by changing f2. To discover QP5,

we need to search f1 from a query in QP3.

f2

f2

q ∈ QP1

<f1:0, f2:19, f3:4K…>

q ∈ QP2<f1:0, f2:48, f3:4K…>

q ∈ QP3

<f1:0, f2:99, f3:4K…>

f2

q ∈ QP5

<f1:1, f2:99, f3:4K…>

f1

q ∈ QP4

<f1:1, f2:33, f3:4K…>

f2

An edge indicates that two queries differ in a value for fi

A query, q, in a query pattern, QPj;

q has f1 set to x, f2 set to y, and f3 set to z …

Legend

<fi>

q ∈ QPj<f

1:x, f

2:y, f

3:z…>

Figure 7: Viewing the query space as a logical graph (for

the abstract protocol shown in Figure 5)

Insight 2 (I2): If the density of amplification-inducing

queries is > ε, then random sampling will likely find one

such query using ≥ 1ε

queries.

This is a very simple probabilistic analysis insight. If the

overall density of the queries that give high AF is ε, then

the probability of picking one such query is ε. Then, the

expected budget to find one such query is 1ε. For instance, if

a probability of a picking an amplification-inducing query is1

1000 , then we need an expected budget of 1000 samples. This

analysis suggests a viable path to find at least one query in

one of the amplification-inducing query patterns, which can

subsequently be used to exploit the above locality structure.

Insight 3 (I3): Fields with large accepted value ranges

either do not affect amplification or exhibit contiguous

range structure w.r.t. AF.

Even if we use I1 and only need to vary one field value at a

time, we still may require a high query budget as some fields

take a very large set of accepted values. Fortunately, many of

the large-range fields tend not to affect amplification. If they

Page 8: Accurately Measuring Global Risk of Amplification Attacks using … · 2020. 10. 13. · Accurately Measuring Global Risk of Amplification Attacks using AmpMap . Soo-Jin Moon, Yucheng

PerField

Search

Random

Sampling

Single server workflow Multi server workflow

AFthresh

Random

Sampling

Random

Sampling

Random

Sampling

Probing Stage

Get queries with high AF and probe other servers

PerField

Search

PerField

Search

PerField

Search

Server 1 Server 2 Server N

Server 1 Server 2 Server N

AFthresh

Insight 2

Insight 1 +

Insight 3

Insight 4

Server 1

{QtoAF}

for Brandom samples

Qstart

Qstart

Figure 8: AmpMap Workflow

do, we observe that there is a large contiguous range (e.g., f3

with [4K, 65535]) that exhibit similar behavior. For instance,

as long as the EDNS payload is set to a large value (i.e.,

4096), an EDNS feature will allow large responses. Thus,

instead of exhaustive sweeping, we can sample values for

large fields. Specifically, we use a logarithmically-spaced

sampling strategy to get at least one query from a contiguous

range if the ranges are sufficiently large.

Algorithm 1: AmpMap algorithm for a single server

Input: B: query budget, AV ( fi) for i = 1, ..,n: accepted value for

each packet header field

Output: QtoAF : maps each query to corresponding AF

/* Step 1: Random Search */

1 QtoAF = RUNRANDOMUPDATEMAP(Brand)

2 Qstart = FINDTOPKQUERIES(QtoAF,K = 1)

3 AFthresh = COMPUTETHRESH(QtoAF) /* Step 2: Local

Search */

4 LOCALSEARCHUPDATEMAP(QtoAF, Qstart , AFthresh)

4.1.2 Single-Server Workflow

Putting the above insights together, we present our workflow

for a single server (left side of Figure 8 and pseudo code in

Algo. 1). Recall that we want to maximize coverage of distinct

query patterns given a fixed query budget, Btotal . Note that

in choosing a value for Btotal , we want to strike a balance

between coverage and network load. Our goal is not to find

optimal parameters, but to use reasonable ranges that work

well in practice. We empirically find that 1200-1500 is a

good operating range for relatively complex protocols like

DNS, as we see diminishing returns beyond this (Figure 18

in §5.7). For simple protocols (with a smaller search space),

this property still holds.

RandomSample Stage: Given a fixed Btotal , the algorithm

randomly samples Brand queries to discover an amplification-

inducing query (I2). The discovered queries are the starting

points to run the next phase, per-field search, to improve cover-

age. For choosing a Brand , we empirically observe that choos-

ing 10% to 45% of the total budget is sufficient (Figure19a in

§5.7). Recall that to leverage the locality (I1), we just need to

find one (or a handful) query that induces amplification. As

we will later, we use multi-server experiments to make this

further robust to potential misestimation of the Brand needed

for a server, i.e., even when the RandomSample Stage fails to

find a feasible starting point (§4.2.2).

Per-field search: We then run the Per-field search (Algo. 2)

leveraging I1. It takes an input of QtoAF, which contains each

query to the AF from the RandomSample Stage. We also need

to determine other relevant input parameters.

• Starting queries for the per-field search (Qstart): We pick

top K queries w.r.t. the AF values. Given the locality

structure, we find choosing one starting query is sufficient.

• The threshold to prune low AF queries (AFthresh): If

neighboring queries have AF below AFthresh, the per-field

search prunes them from further exploration. If the value

is too low, the search will degenerate into an exhaustive

search. If too high, the search terminates without explo-

ration. As a practical trade-off, if the maximum AF is

above 2δ, we make the threshold to be δ (i.e., 10). If it is

below 2δ, we use a threshold equal to some fraction of the

maximum AF observed in the random stage (e.g., half).

Using each query from Qstart, the per-field search searches

the neighboring queries by varying one field value

(SEARCHNEIGHBOR(...) referenced in Line 7; defined in

Line 13 of Algo. 2). It uses a log-sampling for large fields

and exhaustive search for other fields. Further, for fields that

take a set of strings as an input (e.g., domains for DNS), we

recommend inputting an accepted set as a small set (i.e., few

tens). This is a conscious decision as such fields tend not to

have a “contiguous” structure w.r.t. the AF, and each concrete

value has a distinct semantic. Hence, we need to treat these

fields as small fields (where we do an exhaustive search). For

fields that take a list as an input (e.g., SNMP takes a list con-

sisting of object identifiers or OIDs), we search over both the

item (OID) and the size of the list. For this field type, it is

worthwhile to see how the AF changes when this list size is

large. Hence, we recommend putting a non-small value (i.e.,

≥ 256) to log sample the values.

Avoiding already-visited patterns: We have one more prac-

tical challenge as each query pattern consists of tens of thou-

sands of queries. Some field take ranges (e.g., f3=[4000,

65535] in a pattern). If we naively explore, we may redun-

dantly explore other queries in the same query pattern, wast-

ing our query budget. To avoid this, we heuristically detect

if we have already explored a pattern to decide if we can

skip exploring this further. To do so, we infer the contigu-

ous range of a field that incur above-the-threshold AF as we

sweep each field (INFERRANGE(...), defined in Line 24 of

Algo. 2). When we need to explore a query, q’, we first check

whether q’ has already been visited (ISNEWPATTERN(· · ·),referenced in Line 5) and only explore if it was not. We refine

the inferred pattern structure during the per-field search as

Page 9: Accurately Measuring Global Risk of Amplification Attacks using … · 2020. 10. 13. · Accurately Measuring Global Risk of Amplification Attacks using AmpMap . Soo-Jin Moon, Yucheng

Algorithm 2: Per-Field Search and Helper Functions

1 Function PerFieldSearch(QtoAF, Qstart, AFthresh):

2 Qexplore = {Qstart}; PatternsFound = {}

3 while Qexplore is not empty do

4 q← Extract from Qexplore

5 if ISNEWPATTERN(q.pattern , PatternsFound) then/* Search neighbors for a new pattern */

6 PatternsFound.insert(q.pattern)

7 tmpQtoAF = SEARCHNEIGHBOR(q, AFthresh)

8 QtoAF.insert(QtoAFneighbor)

9 Qexplore = Qexplore∪ tmpQtoAF.keys()

10 else/* if not new, skip exploration */

11 MERGEQUERIES(q.pattern, PattersFound)

12 return QtoAF

13 Function SearchNeighbor(q, AFthresh):

14 NeighborQToAF = {}15 foreach protocol field fi do

16 Qi = {q[ fi← vi], for vi ∈ Valuesi}17 QtoAFi = SENDQUERY(q ∈ Qi)

/* Merge queries into contiguous ranges with

high AF */

18 HighRanges = INFERRANGE(q, Valuesi, QtoAFi,

AFthresh)

/* Find representative sample from each range

*/

19 for 〈vl ,vr〉 ∈ HighRanges do

20 patternid = q.pattern[ fi← (vl ,vr)]21 qn = q[ fi← rand([vi,vr])]22 NeighborQToAF.append( qn→ AFn )

23 return NeighborQToAF

24 Function InferRange(q, Valuesi, QtoAF, AFthresh):

25 IsCurRangeActive = False; HighRanges = {}

26 CurStart = CurEnd = NULL

27 for v ∈ Valuesi sorted in ascending order do

28 if IsCurRangeActive then

29 if QToAF j ≥ AFThresh then

30 CurEnd = v

31 else

32 IsCurRangeActive = False

33 HighRanges.append(〈 CurStart,CurEnd 〉 )

34 else/* we encounter a new high range */

35 if QtoAF j ≥ AFThresh then

36 IsActive = True; CurStart = CurEnd = v

/* If still active, include the last entry */

37 if IsCurRangeActive then

38 HighRanges.append( 〈 CurStart, v 〉)

we get a new range that contains the old range. The search

terminates if the budget is exhausted or there are no more

queries to explore.

Let us look at a concrete example using the abstract pro-

tocol presented in §3. Suppose we are currently exploring

a query q, 〈 f1:0, f2:48, f3:6000 · · · 〉, from a QP 2. When it

is a turn to explore f3, we log sample f3 to obtain the AFs

and find that [5K, 65535] has contiguously “high” AFs. Then,

we use this range to describe the pattern (i.e., 〈 f1:0, f2:48,

f3:[5K, 65535] · · · 〉). We first check whether this is contained

in already-visited patterns and only explore if not already

visited. We present the analysis for a single server in §A.

4.2 Multi-Server Algorithm

We now discuss how we extend the insights and workflow

from a single-server case to handle the multi-server case.

4.2.1 Multi-Server Insights

Insight 4 (I4): While servers exhibit variability, some share

a subset of amplification-inducing queries.

Recall the abstract protocol on multiple servers in Figure 6.

In that example, the queries in QP1 for s1 also incur high

amplification for s2 but not for s3. While these servers are not

identical in all query patterns that induce amplification, some

of these servers can share a subset of query patterns (even if

the specific AF values may differ). We also have observed this

in our small-scale experiment in §2. Specifically, while the

similarity of query patterns between a pair of servers is low,

it is not always 0. This is natural as these servers implement

the same protocol. This property allows us to further reduce

overhead by sharing insights across servers. That is, we can

use already-found amplification-inducing queries (from the

RandomSample Stage) and probe other servers using these

queries. This probing increases the probability of having a

good starting point to run the per-field search for each server.

Note that our workflow still accounts for server heterogeneity

(while sharing insights across servers) as we still run the

per-field search for each server.

4.2.2 Multi-Server Workflow

We start with the RandomSample Stage per server as in the

single-server case. The key addition is a new stage called

the Probing Stage (Figure 8), which ensures that the in-

sights are shared across servers. Specifically, using the high-

amplification queries found for each server from the Random-

Sample Stage, we test them on other servers to increase the

chance of finding good starting queries for each server.

Probing Stage: Turning this idea into practice, we take all

queries that give high AFs across servers from the Random-

Sample Stage. Then, we pick a small number of queries to

probe other servers (say Bprobe queries). A relevant question is

how many queries to use for Bprobe. We observe that anywhere

between 5% to 30% of the total budget is sufficient, where

we chose 10% (validation in §5.7). We do not want to assign

too much for this value to ensure a sufficient available budget

Page 10: Accurately Measuring Global Risk of Amplification Attacks using … · 2020. 10. 13. · Accurately Measuring Global Risk of Amplification Attacks using AmpMap . Soo-Jin Moon, Yucheng

Algorithm 3: AmpMap algorithm for multiple servers

Input: Btotal: query budget

AV ( fi) for i = 1, ..,n: accepted value for each packet field

S: a set of servers

Output: PerServerQToAF : maps each query to corresponding AF

1 PerServerQToAF = {} /* Step 1: Random Search */

2 for s ∈ ServerSet do

3 RUNRANDOMUPDATEMAP(Brand ,PerServerQToAF[s])

/* Step 2: Pick probes based on current obs. */

4 Qprobe = PICKPROBES( PerServerQtoAF , Bprobe )

/* Run additional probes per server */

5 for s ∈ S do

6 ProbeQToAFs = SENDQUERY(Qprobe)

7 PerServerQToAF[s].insert(ProbeQToAFs)

/* Step 3: Per-field search for each server */

8 for s ∈ S do

9 Qstarts = FINDTOPKQUERIES(PerServerQToAF[s],K)

10 AFthresh = COMPUTETHRESH(PerServerQToAF [s])

11 PERFIELDSEARCH(PerServerQToAF[s], Qstarts , AFthresh)

12 return PerServerQToAF

for other (critical) stages. Specifically, the Probing Stage is

designed to supplement the RandomSample Stage for spe-

cific servers where the RandomSample Stage was could not

discover amplification-inducing queries. The next relevant

question is how to pick these probing queries. Consider a

strategy where we pick the top-X queries w.r.t. the AF. This

strategy may “overfit” to a specific query pattern or certain

servers with many AF-inducing queries. We want to use a

diverse set of probing queries. To this end, we take all queries

with AF above the threshold, δ, and then run a simple K-

means clustering where we conservatively set the number of

clusters, K (e.g., 20).5 To achieve diversity of patterns, we

sample queries such that we have at least one query from

each cluster, and for the remaining ones, we uniformly sam-

ple queries proportional to the cluster size. Here, the key for

boosting the coverage is the fact that we use probing queries

(Figure 19b in §5.7); the number of clusters is less critical.

The rest of the algorithm mirrors the single-server approach

to pick starting points and run the per-field search. However,

the input parameters (i.e., Qstart, AFthresh) are server-specific

to account for server diversity. The only difference is that the

top-K starting points are based on the original set of random

queries and the new additional Bprobe queries. Note that for

fields that take a set of strings (e.g., domain for DNS), we

do not split the query budget across different field values

(e.g., different domains). However, given that the per-field

search does not favor queries with higher AF (as long as

AF ≥ AFthresh), our algorithm does not bias one particular

field value (e.g., a particular domain) over another. Further, as

we will see in §5.3, we combine the queries across all servers

to infer patterns. Combining data allows us to infer patterns

despite having a small per-server budget (e.g., 1500).

5To run K-means clustering, we define our custom distance function. We

normalize N fields and then bin the large fields

5 Evaluation

In this section, we present findings from our Internet mea-

surements for 6 UDP-based protocols (DNS, NTP, SNMP,

Memcached, Chargen, SSDP) and local testing for 3 protocols

(QOTD, Quake, RPCbind). In contrast to a scoped experi-

ment in §2.2, the results here cover more protocols, servers

and search over the packet header space (opposed to sending

a fixed set of queries). We also validate our design against

strawman solutions and parameter choices.

# IPs

Scanned

(a)

# Pruned IPs (b) # IPs

Taken (c)

= (a)+(b)

# IPs

in DB

(d)

% IPs

Scanned

(c) / (d)

Invalid

Proto

Gov’t

Mil.

DNS 10K 18,698 15 28,713 8.02M 0.36

NTP OR 10K 4317 5 14,322 8.4M 0.17

NTP AND 3,083 234,374 7 237,464 8.4M 0.28

SNMP OR 10k 4,933 3 14,936 2.16M 0.69

SNMP AND 10K 60,187 9 70,196 2.16M 0.33

Memchd 10K 11,736 9 21,745 63K 3.5

Chargen 10K 68,065 6 78,071 83K 9.4

SSDP 10K 78,617 3 88,620 2.16M 3.3

Table 4: Statistics on (a) the # of IPs we scanned per proto-

col, (b) the # of pruned IPs, (c) the # of raw IPs we needed

from the DB ; (d) the # of total public-facing IPs as is

(Shodan and Censys); and (e) the % of IPs we scanned

Measurement setup: We use nodes from CloudLab [33],

where 1 node is used as a controller, and 30 as measurers.6 For

these 6 protocols, we scanned 10K sampled servers for each

protocol: DNS with OPT records for EDNS, NTP, SNMP,

Memcached, Chargen, SSDP. For DNS, we scan the servers

obtained from Censys and, hence, these are mostly open re-

solvers.7 As the protocol formats for SNMP’s Get, GetNext,

and GetBulk requests differ, we treated each as a separate

protocol and ran separately. Similarly, we ran separate runs

for NTP’s mode 7 (private), mode 6 (control), and mode 0-5

(normal). We obtained public server IPs from Censys [34]

and Shodan [20]. We randomly sampled IPs from these lists

and pruned out inactive servers (e.g., those that do not respond

to dig for DNS) or owned by the military or government. For

certain protocols (SNMP, NTP) that have different modes

of operation with distinct formats, we consider two notions

of active server, whether the server responds to (1) “any” of

the modes (OR filter); or (2) “all” of them (AND filter). We

present results for both schemes, using AND/OR superscripts

to denote each (e.g., SNMP AND).

To finish our measurements in a few days and restrict the

number of (shared) nodes we use, we target 10K servers per

protocol.8 Table 4 shows: (1) the number of IPs we needed

from Shodan and Censys to get our final server lists,9 (2) the

6We restricted our node usage to 31 per experiment, as CloudLab is a

shared platform across institutions7We can easily extend AmpMap to handle authoritative servers.8We could not obtain 10K servers for NTP AND.9For DNS, we posit that many are inactive because the Censys DB was

from Jan 2020 when the measurements were conducted in May 2020.

Page 11: Accurately Measuring Global Risk of Amplification Attacks using … · 2020. 10. 13. · Accurately Measuring Global Risk of Amplification Attacks using AmpMap . Soo-Jin Moon, Yucheng

total number of public-facing IPs for each protocol (as of May

30, 2020) from Censys (for DNS) and Shodan (for others);

and (3) the % of IPs we scanned from the Internet. When

we refer to servers to present our results, we are referring to

sampled servers rather than the entire Internet servers.

In our experiments, each server is pinned to a measurer. We

do not spoof IP addresses, and we send legitimate queries and

listen to responses. We impose a limit of 1 query per 5 s for

each server with a timeout of 2 seconds (i.e., 7 seconds per

query). This rate gives approximately 3 days to complete for

10K servers as 30 measurers can handle 500 servers at a given

time.10 Our network load is low: 48 kbps (egress) across all

measurers and 1.6 kbps per measurer. If we assume an average

AF of 5, then we incur 240 kbps in ingress bandwidth.

Protocol specifics: For protocols with more than 10 fields

(DNS, NTP, RPCbind), we used a query budget of 1500

queries per server, setting 45% for RandomSample Stage and

10% for the Probing Stage. For simpler protocols, we used a

budget of 400 queries with the same budget split. For QOTD,

Quake, RPCbind, we set up a single CloudLab server running

the protocol. Some fields, such as domain fields for DNS,

took strings. As discussed in §4.1.2, we picked 10 popular

domains11 spanning different industry sectors, and enabled

features (e.g., DNSSEC supported vs. not). For SNMP, we

pick v2’s OIDs based on the RFC up to depth 4 (i.e., A.B.C.D).

For fields that take as input a list of values (e.g., an OID for

SNMP), we also search over the list’s length.

5.1 Protocol and server diversity

DNS NTPOR NTPAND SNMPOR SNMPAND Chargen SSDP MemcachedProtocols

100

101

102

103

Max

Am

plifi

catio

n Fa

ctor

10.44

1.0

5.1113.11

32.49

204.46

4.081.68

Figure 9: Boxplot showing the distribution of the maxi-

mum AF achieved by each server given a protocol

Finding 1: There is significant variability in the maximum

amplification a server can yield across servers.

Figure 9, where y-axis is log-scale, shows the distribution

of the maximum AF achieved by each server for each protocol.

(For SNMP and NTP, we combine the results across different

modes.) For many protocols, we observe a long tail in the

distribution. For instance, while the median for SNMP OR is

13.01 AF, the maximum is 495. While the median is 1 AF

for NTP OR, the maximum is 860. For NTP AND, while the

median is 5.11 AF, the maximum is as large as 1300! This

10Each run takes 3 hours (7s×1500 queries) and need 69 hours to handle

10K servers (not accounting for timeouts).11berkeley.edu, energy.gov, chase.com, aetna.com, google.com, Naira-

land.com, Alibaba.com, Cambridge.org, Alarabiya.net, Bnamericas.com

high variability confirms we cannot simply count the number

of open servers or attribute the same risk to each server.

DN

S

NTP

AND

SN

MPAN

D

SS

DP

Cha

rgen

Mem

cach

ed

Protocols

0

20

40

60

80

100

% o

f Ser

vers

< 10[10,30)[30,50)

[50,100)>=100

(a) May-June 2020

DN

S

NTP

AND

SN

MPAN

D

SS

DP

Cha

rgen

Mem

cach

ed

Protocols

0

20

40

60

80

100

% o

f Ser

vers

< 10[10,30)[30,50)

[50,100)>=100

(b) May-June 2019

Figure 10: Summary across servers and protocols (from

2019 and 2020 runs)

Finding 2: There is substantial variability in the maximum

AF distribution across protocols.

Figure 10a shows the maximum AF distributions with vary-

ing AF ranges (e.g., 10-30) across protocols; these experi-

ments ran in May–June 2020. For SNMP and NTP, we only

show the results for AND schemes for brevity. First, protocols

vary in the percentage of potential amplifiers with AF≥ 10:

52% for DNS, 34% for NTP AND, 69% for SNMP AND · · ·0.6% for Memcached. Further, protocols differ in the most

common AF ranges (≥ 10) that servers can yield. AF range

for DNS is concentrated on 10 to 30 but above 100 for Char-

gen. For NTP AND, 14% of servers give above 100 AF. These

results suggest that measuring the risk should take into ac-

count the AF distribution per protocol.

Finding 3: There is variability across time in the AF

distribution across servers for different protocols.

Figure 10b shows the maximum AF distribution from mea-

surements done in 2019, as opposed to 2020 for Figure 10a.

(Across two runs, there are minor differences in the AmpMap

parameters such as 53% budget for the RandomSample Stage

in 2019 vs. 45% in 2020, but they do not really affect the re-

sults.) These figures visually highlight the differences across

the two years. For instance, only 7% of NTP AND servers

yielded AF≥ 100 in 2019 vs. 14% in 2020. 90th percentile of

DNS servers induced above 30 AF in 2019 but above 59 AF

Page 12: Accurately Measuring Global Risk of Amplification Attacks using … · 2020. 10. 13. · Accurately Measuring Global Risk of Amplification Attacks using AmpMap . Soo-Jin Moon, Yucheng

in 2020 (almost doubled) using the identical domain lists. We

acknowledge that as we sample servers, we cannot attribute

the root cause of differences, i.e., the change in server list

vs. the actual attack landscape. However, such variability is

the reason that calls for the need to do continuous (periodic)

measurements rather than a one-time analysis.

5.2 Assessing amplification risks

Known PatternRisk Quantification

ResultsPrior Work AmpMap

DNSEDNS:0,ANY [1, 57] 287K 149K 1.9× ↑EDNS:0,ANY,TXT[57, 62]

Unknown 183K N/A

DNS

(domains w/oDNSSEC)

ANY,TXT [57, 62] Unknown 126K N/A

NTP OR monlist [2, 57] 5,569K 13K 427× ↑

NTP AND monlist [2, 57] 5,569K 635K12

8.8× ↑

SNMP OR GetBulk [3, 57] 64K 223K 3.5× ↓

SNMP AND GetBulk [3, 57] 64K 317K 5× ↓Chargen Request 3588K 1399K 2.9× ↑SSDP Search [3, 57] 308K 126K 2.7× ↑Memcached Stats [3, 17] 100M [3] 18K 5.6K × ↑

Table 5: Contrasting the risk extrapolated from prior

works and measured by AmpMap for 10K servers

Finding 4: Even for known patterns, extrapolations

(e.g., [32, 57]) mis-estimate amplification risk.

Table 5 summarizes the known patterns and their corre-

sponding risks assessed using AmpMap and prior works [1,

57] (same risk used in §2.2). For AmpMap, given a pattern

for each protocol (e.g., monlist for NTP), we calculate the

total risk across 10K servers using the Eq. 1. We find that

the baseline techniques from prior work have significant mis-

estimation. For instance, these techniques overestimate NTP

by 427×, underestimate SNMP v2 by 3.5×, and overestimate

Chargen by 2.9×. The large inaccuracy of 427× overestima-

tion for NTP is because the previously reported AF of 556 [57]

does not generalize to most NTP servers. Our findings con-

firm a study of NTP amplification [32], which specifically

focuses on the monlist feature. Further, the underestimation

of 3.5× for SNMP is because the prior analysis (by assum-

ing a fixed query) does not account for polymorphic variants.

Specifically, we can achieve higher amplification using Get-

Bulk requests with varying OID fields and the number of OIDs

to request. While the previously reported average of the worst

10% servers for GetBulk requests (SNMP) is 11.3 AF [57],

the average of the worst 10% from our measurement dataset

is 90 for SNMP OR (7.9× larger than 11.3), and 97 AF for

SNMP AND (8.6× larger).

Finding 5: Prior recommendations (e.g., [32, 57]) miss

many query patterns and leave substantial residual risk.

We now quantify the risks from new patterns that will be

missed by prior analysis (Table 6). For DNS, there are other

combinations of EDNS and record type fields that yield large

New Patterns Risk Quantification

DNS¬( EDNS:0 ∧ ANY lookup) 3274K (21.9× known pattern)¬(EDNS:0∧(ANY ∨TXT) lookup) 3127K (17.1× known pattern)

NTP OR req code 6= monlist (20,42) 43K (3.3 × known pattern)

NTP AND req code 6= monlist (20,42) 663K (1 × known pattern)

SNMP OR GetNext 61K (0.27 × known pattern)Get 10K (0.04 × known pattern)

SNMP AND GetNext 101K (0.32 × known pattern)Get 11K ( 0.03 × known pattern)

SSDP None 0

Memcached Get, Gets 33K (1.9 × known pattern)

Table 6: Amplification risk from new patterns whose

risks will be missed by prior analysis

(and considerable) amplification. The total risk from these

other patterns (e.g., record types: LOC, URI lookups) across

10K servers is 3,274K. This unforseen risk is 21.9× larger

than the risk of known patterns (149K)! Figure 11 shows a

bird’s-eye view of the residual risk. We observe similar trends

for other protocols. For instance, for NTP, a collective risk

from other features (e.g., get restrict) is 276× higher risk

than the known risk. For simpler protocols like SSDP, our

measurements do not reveal new patterns.

Figure 11: Visualizing the DNS residual risk when known

patterns (EDNS:0 and record type:ANY |TXT) are blocked.

The size of the circle ∝ the max AF of each server. Red

circles denote when the delta is ≥ 20%.

AF ≥ 10 AF ≥ 30 AF ≥ 50 AF ≥ 100Range of Amplification Factors

0

25

50

75

100

% o

f Rem

aini

ngVu

lner

able

Ser

vers

<EDNS, ANY|TXT> <EDNS, * > <*, ANY|TXT >

Figure 12: % of DNS servers that remain susceptible to

amplification even if we use recommendations by prior

works to block query patterns; i.e., 〈 EDNS, ANY |TXT 〉 is a

filter that blocks queries EDNS:0 and ANY or TXT lookups.

Next, we conduct what-if analysis to analyze what percent-

age of servers are susceptible to amplification if we were to

block known patterns. Given that prior works do not provide

concrete signatures, we consider a few possible interpreta-

tions, i.e., a combination of EDNS:0 and record type:ANY or

TXT. Figure 12 shows that even with EDNS:0 and (ANY or TXT)

lookups blocked, more than 97% of servers still can yield AF

greater than 10. For NTP (mode 7), even with monlist as a

signature,13 30.5% servers can still yield AF≥ 10 and 4.8%

13A follow-up paper mentioned the possibility of other settings that induce

Page 13: Accurately Measuring Global Risk of Amplification Attacks using … · 2020. 10. 13. · Accurately Measuring Global Risk of Amplification Attacks using AmpMap . Soo-Jin Moon, Yucheng

≥ 100! We observe similar trends for SNMP. However, prior

recommendations achieve high coverage for SSDP, Chargen,

and Memcached.

TXT

ANY

DN

SKEY

RRSIG DS

NAPT

RD

NAM

ESRV

URI

SIG HIP RP

NSEC

OPE

NPG

PKEY

CER

TTA

LOC

KX

IPSEC

KEY

CN

AM

ETL

SA

NS

NSEC

3PARAM

CD

SCD

NSKEY

DH

CID

PTR

SSH

FPCAA

APL

DLV

SO

AKEY

AFS

DB

NSEC

3M

X AAAAA

TSIG

OPT

AXFR

TKEY

IXFR

Record Types

0

10

20

30

40

50

% O

f Vun

lera

ble

Ser

vers

AF 10-30 AF 30-50 AF 50-100

Figure 13: The variability of field values (for a spe-

cific field, record type) that contribute to high amplifica-

tion. Apart from known ones (record type:ANY, TXT), many

other record type values can lead to large AF.

5.3 In-depth analysis on DNS

The previous discussion suggests there are many patterns not

highlighted by prior work. We analyze this further, focusing

on DNS here and deferring other protocols to §5.4-§5.6.

We start with a record type field as this field determines

ANY vs. NS record lookups. Figure 13 shows the percentage

of servers that can induce considerable AF for each possible

value of this field. While the top-2 record types are TXT and

ANY (pointed by prior work), more than 20% of our sampled

servers can yield more than 10 AF with 19 other record type

values (e.g., URI, HIP, RP, LOC, CNAME). Some of these (e.g.,

NAPTR) incur very high AF, especially if used in conjunction

with the DNSSEC (DNSSEC-OK) set. While many DNSSEC-

related record type values (e.g., RRSIG, DNSKEY) can yield

high AF [61], we also observe many record type values “un-

related” to DNSSEC (e.g., NAPTR, SRV). This finding is sig-

nificant — even if we block ANY, TXT queries, there are many

other types that can induce high amplification.

Summarizing and analyzing query patterns: The above

analysis only considers one field. In practice, many other com-

binations of fields are susceptible, and we want to understand

the structure of amplification-inducing query patterns (QPs).

For this summarization, we considered several standard data

mining techniques (i.e., hierarchical clustering, K-means clus-

tering, decision trees) but found that none were suitable.14

Given this, we designed a custom heuristic (Figure 14).

Starting from AF-inducing queries across all servers, we gen-

erate a set of candidate patterns where some fields are set to

concrete values or ranges, and others are wildcarded. Specifi-

cally, for large fields (e.g., id, payload for DNS) we identify

candidate ranges by dividing the accepted values for a large

field into exponentially-spaced bins (e.g., {[0,10], [11,100]...}. Then, for each server, we generate a bit vector (e.g., 1111)

amplification, they did not specify which request types [32].14Clustering assume that we know the number of clusters or the right

distance metric/threshold. Given the large combinatorial space, decision trees

produce uninterpretable outputs.

Step 1: Preprocessing

. . .

Q à AF for

all servers

Q with

AF > 10

For large fields:

Infer Ranges Flarge à R

For other fields:

Prune if needed.

Get distinct values

Fother à V

f1: v or r

f2: v or r

fm: v or r

Step 2: Merge Queries

Step 3: Create a DAG

*

*, f1:1 *,f5:[0,100]. . .

f1:1, f2:1 … f1:1, f2:1 …

. . . . . . . . .

Output 1:

Find a Minimum Set

QPs at

level 1

QPs at

level m

. . .

Output 2: Infer a Tree

Still very large! (Redundancies)

QP0

QP11QP13

QPm1 QPm2 QPm3

. . .

f2:0 f5:[0,100]

Prune based on max

or median AF

Figure 14: Steps to obtain query patterns to shed light

on the patterns of amplification

to represent these bins; a bit is set to 1 if a server has a query

with AF≥10 using a field value that belongs to the bin range.

Finally, given a set of bit vectors for all servers, we take can-

didate vectors that are observed across at least 10% of servers.

We prune out fields that appear not to affect amplification; i.e.,

we count the number of queries (with AF≥ 10) by checking if

wildcarding the field makes the AF value histogram follow a

uniform distribution. We then generate candidate patterns by

generating all combinations of values and ranges. From these

candidates, we prune out QPs with AF less than 10 based

on the maximum or the median AF. We represent the QPs

as a logical Directed Acyclic Graph (DAG), with these

QPs are leaf nodes (Step 3, Figure 14). We create a parent

node by taking one of the nodes in the current level and wild-

carding one field; the DAG root is a node where all fields are

wildcards. Given this DAG, we consider two analysis:

1. Minimum set cover per level (Output 1, Figure 14): We

compute the minimum set-cover of QPs at each level that

logically covers all leaf nodes; e.g., the set of QPs obtained

at level 10 represents the minimum set of QPs to describe

QPs using only 10 fields as concrete values or ranges.

2. Hierarchical analysis (Output 2, Figure 14): To see depen-

dencies across fields, we create a tree where the edge is

annotated with the field and its value, which became con-

crete as we increase the level (an example in Figure 16).

We run the above procedure separately for (1) domains

with DNSSEC support, and (2) domains without support.

Corollary 1: Many unexpected patterns lead to high AF,

e.g., with DNSSEC off and unrelated to ANY records.

DNSSEC-related patterns: Figure 15a shows a boxplot of

the top-10 QPs w.r.t. the median AF when 8 fields are left

concrete (level 8). QP 82 incurs the largest median AF of 30

with 〈 EDNS:0, payload:*, record type:RRSIG, rd:* · · · 〉. In

this pattern, it is not necessary to have a rd set to 1 and shows

that RRSIG lookups can also cause high AF. The rank-2 QP

has EDNS set to 1 and not 0, which is a known pattern. In

Page 14: Accurately Measuring Global Risk of Amplification Attacks using … · 2020. 10. 13. · Accurately Measuring Global Risk of Amplification Attacks using AmpMap . Soo-Jin Moon, Yucheng

QP82 QP59 QP67 QP20 QP32 QP45 QP21 QP104 QP55 QP23

Query Patterns (QP) ranked by median AF

20

40

60

Am

plifi

catio

n Fa

ctor

(a) Rank based on median AF

ID Field Values

QP82 〈 EDNS:0, payload:*, record type:RRSIG,ad:1,rd:*,rcode:8 · · · 〉QP20 〈 EDNS:1, payload:*, record type:*, ad:0, rd:1, rcode:* · · · 〉QP32 〈 EDNS:1, payload:*, record type:TXT, ad:0, rd:1, rcode:* · · · 〉

(b) Describing query patterns (QPs)

Figure 15: DNS: Top 10 query patterns for a particular

depth where 8 fields are left as concrete values of ranges

qr { 0}

id (0,

65536)

opcode

{ 0}

rdataclass

{ 1}

edns {0}

edns {1}

edns {0}

payload

(>370)

opcode

{ 0}

ad { 1}

ad { 0}

rd {1}

rd {0}

rd { 0}

rd {1}

ad { 0}

ad { 1}

opcode

{ 0}

rdataclass

{ 255}

payload

(>776)

payload

(>776)

rcode { 1}

rdatatype{NS, MX, TXT, SIG, KEY,

DNSKEY, TLSA, ANY, URI}

{TXT, RRSIG, DNSKEY, ANY}

{TXT}

{TXT, RRSIG, DNSKEY, ANY}

Figure 16: Tree showing how the query patterns change

across levels. An edge means a field value transitioned

from a wildcard (*) in level L to a concrete value or range

in the next level, L+1.

fact, several servers that yield high AF had EDNS not set to 0.

Further, as we find many record type values that lead to high

AF (also seen in Figure 13), this QP has a record type set to *.

Further, as a side note, when we were pruning out fields that

appear not to affect AF (Figure 14), a DNSSEC-OK field got

pruned out. However, we observed that setting this bit to 1 on

certain queries can induce high AF on some servers.

Non-DNSSEC patterns: For certain servers, domains with-

out DNSSEC support can yield high AFs. The median AF for

the top-1 QP is 21 with 〈 EDNS:1, record type:TXT, rd:1 · · · 〉.This confirms that TXT records can cause high AF [62]. We

also observe record type values such as DS appear among the

QPs; some are attributed to anomalous servers.

Corollary 2: There are many query patterns that, while

not maximum, provide high enough amplification. Hence,

focusing on only one or a handful of patterns can render

existing mitigation (i.e., [41]) ineffective.

At each level of the DAG, more QPs are concentrated at

AF between 10 and 20. At the leaf nodes, 699 query patterns

produce a median AF of 10 to 20 while only 47 above 20

AF. Purely focusing on one pattern or a handful to drive the

mitigation plan will be insufficient.

Corollary 3: There are complex dependencies across field

values inducing high AF change based on other fields.

The DAG output (Figure 14) shows complex dependencies

across field values that yield high AF. Specifically, Figure 16

shows a subset of a tree (for DNSSEC-related) where the

QPs are filtered based on the “median” AF. If we consider

a top branch with EDNS:0 and rd:1, with NS, MX, · · · TLSA,

URI record types cause high AF. Some other combinations

(i.e., blue edges) will cause different record type values to

induce high AF. Surprisingly, we find a non-trivial number of

servers that yield high AF even when rd (recursion desired)

is 0 (off)! These suggest that (1) there are many combinations

of multiple fields values that lead to high AF, and (2) this

finding generalizes to many servers (as QPs are kept if the

median AF across servers is≥ 10 AF). Further, if we consider

a tree where QPs are pruned based on the maximum AF (less

aggressive pruning), we see even more combinations leading

to high AF (e.g., OPENPGPKEY, SOA record types).

Further, we observe that not all servers behave according

to specifications, further adding to variability in QPs. For in-

stance, when EDNS:0 is used, the response should be chopped

to the specified EDNS payload value. Unfortunately, for many

servers, this is not the case. For instance, 88 servers out of

10K yield AF above 50 with payload less than 512. During

our 2019 measurements, we saw 311 AF for one server (for

SRV records lookup), where we saw many IP fragments. This

server went offline shortly after the experiment. While DNS

over UDP does use IP fragmentation to deliver large pay-

loads [15], this makes defenses more difficult as they miss

key fields such as port information [4].

Vendor# of Total

Servers

# of Server

(AF ≥ 10)

% of Servers

(AF ≥ 10)

Bind 946 236 24.9%

Dnsmasq 917 819 89.3%

Version:recursive-main/* 522 12 2.3%

Microsoft 261 250 95.8%

PowerDNS 78 50 64.1%

unbound 40 26 65%

Table 7: Statistics on the affected DNS vendors

Corollary 4: Given the variability of query patterns, block-

ing the top-K percentage of patterns still leave significant

residual risk; i.e., the 50th percentile of servers has 80%

or more residual risk, even with blocking 20% of query

patterns (infeasible in practice).

We now analyze the percentage (%) of the residual risk if

we had used the top-K percentage (%) of QPs to block these

queries. For this analysis, from the inferred QPs (Figure 14),

we do not prune them based on the maximum or median AF;

we need to know all QPs that lead to high AF for each server.

We take the top-5 and 20% of these 11K QPs (sorted by their

median AF) and use them to block amplification-inducing

queries from each server. Unfortunately, we observe that even

blocking the top-20% QPs (which is infeasible in practice)

still leaves 50% of the servers with an 80% risk or higher

Page 15: Accurately Measuring Global Risk of Amplification Attacks using … · 2020. 10. 13. · Accurately Measuring Global Risk of Amplification Attacks using AmpMap . Soo-Jin Moon, Yucheng

QP10 QP4 QP8 QP6 QP7 QP9 QP3

Query Patterns (QP) ranked by median AF

101

102

103

Am

plifi

catio

n Fa

ctor

Figure 17: NTP top query patterns, where the top-2 are

monlist patterns. Other top QPs have peer list, if reload,

peer list sum, and peer stats as req code.

(blocking the top-5 % leaves 96.7% risk or higher).

Corollary 5: Many DNS vendors are affected.

Table 7 shows the affected vendors with servers that can

yield AF ≥ 10. We only show vendors with more than 20

servers. We discuss our efforts to notify these vendors of the

vulnerability in § 6.2.

5.4 Amplification patterns for NTP

We discuss amplification patterns for NTP. As we do not

discover new patterns for mode 0-6,15 we focus on mode-7

(private mode). Recall that we need to prune candidates QPs

based on maximum or median AF (Figure 14). As we ob-

serve a high variance across AF achieved by different NTP

servers, we looked at the QPs where they are pruned based on

the maximum. Figure 17 shows these QPs (pruned based the

maximum AF) where they ranked by the median AF. Apart

from monlist (QPs 10 and 4), we observe request codes of

peer list , if reload , peer list sum, and peer stats

from NTP OR. Some of these other QPs can yield as large as

a few hundred as seen by the long tails in Figure 17. From

NTP AND servers, we also observe mem stats, if stats,

and get restrict. Our findings again complement Corol-

lary 2. Furthermore, the software versions (with servers that

can yield ≥ 10AF) are 4.1.1-2, 4,2,4, 4,2,6-8, and 4.2.0. We

observe that the servers that can induce high AF with other

request codes (other than monlist) are not particularly tied

to one single version but span across multiple versions.

5.5 Amplification patterns for SNMP

We now discuss SNMP patterns, which have 3 modes of oper-

ations, i.e., GetBulk, GetNext, and Get. We start with GetBulk,

which is a known pattern [3] (reported average of 6.3 AF [57]).

However, our measurements revealed polymorphic variants

that lead to significantly higher AFs. For instance, we saw

an average of 22.4 AF for SNMP OR and 31.8 AF for SNMPAND, which are higher than the reported. Specifically, an at-

tacker can modify OID value and the number of OIDs to yield

higher AFs. We generally observe higher AF for query pat-

terns with (1) a single-digit OID (near the root) such as 2, 1,

0, and (2) a list containing multiple OID (i.e., 2-15 but above

15). However, given server variability, there are exceptions.

15There was one packet that incurred high AF for mode-6 but this packet

contained many ICMP redirects so we do not report this.

For example, an OID of 1.3.6.1.2, and a list size of 1 appears

in one of the top-4 patterns. The top-1 QP from the SNMP

servers yields a median AF of 35 with 〈 community:public

· · · OID:2, numoid: (0,8) 16〉. From SNMP AND servers, the

top-1 QP yields 45 median AF with OID:0.

Vendors# Total

Servers

GetBulk GetNext

# Server

(AF≥10)

% Servers

(AF≥10)

# Server

(AF≥10)

% Servers

(AF≥10)

net-snmp 5357 5044 94.2% 3445 64.3%

cisco

Systems594 96 16.2% 60 10.1%

Sonic

Wall220 21.7 98.6% 27 12.3%

Broadcom

Corp.205 193 94.1% 81 39.5%

Table 8: Statistics on the affected SNMP vendors

We now discuss GetNext requests. While only GetBulk has

been highlighted in the prior analysis, AmpMap discovers

that a single GetNext request can also yield hundreds of AF

(similarly, by varying the OID and the number of OIDs). From

SNMP AND servers, 37% of servers can yield AF above 10 and

0.74% above 100 AF! From SNMP OR servers, 10% servers

yield above 10 AF and 0.14% above 100 AF. However, unlike

SNMPbulk, we saw high AFs for various OIDs (e.g., 1.3.6.1.2,

0, 1); this is expected because GetNext just requests the next

variable in the tree, unlike a GetBulk request, which requests

several GetNext requests. Note that while we also replicated

that a local server can yield 15 AF with GetNext by varying

the list size, we posit that we see higher AF in the wild given

server variability. Table 8 shows the affected vendors for

servers using GetBulk or GetNext requests. We only show for

vendors with more than 200 servers, combining the results

from both SNMP AND and SNMP OR servers. Similar to DNS

and NTP, this amplification vulnerability affects multiple

vendors and not just one.

Lastly, measurements reveal that Get requests also can yield

tens of AF (but not as large as GetNext). From SNMP OR,

0.73% servers that have AF greater than 10. Unlike GetNext

patterns, we observe high AF for OID of 1.3, and 1.3.6.1.3-4.

5.6 Amplification patterns for other protocols

SSDP: Amplification risk is inherent with SSDP’s “discov-

ery” feature. Our inferred QPs are quite simple. For QPs

pruned based on the median AF, we see a discovery request

with one UUID of ssdp:all. This is expected as this feature

fetches “all” UUID information. However, for QPs based on

the maximum AF, we see many UUIDs leading to ≥ 10 AF.

Again, this confirms the presence of multiple query patterns.

Memcached: We did not find any QPs that lead to above 10

AF other than the “stats” request (a known pattern) from our

2020 run. If we use our runs from 2019, some of the QPs with

get and gets requests did induce above 10 AF. However, it

16More accurate version is (2, 8) but our range inference is a heuristic.

Page 16: Accurately Measuring Global Risk of Amplification Attacks using … · 2020. 10. 13. · Accurately Measuring Global Risk of Amplification Attacks using AmpMap . Soo-Jin Moon, Yucheng

is still the case that “stats” are by far the dominant pattern,

and the residual risk from get and gets requests are negligi-

ble. Further, while the known AF for Memcached is tens of

thousands [24], the maximum we find from our 2020 run is

35 AF (we believe many have been patched or taken offline).

Chargen: As Chargen servers respond to any UDP datagram,

the QPs learned at the leaf nodes contain all possible charac-

ters and lengths. We represented the search space as a list of

hex strings where we search over the hex character and the

length of the hex character.

We validate the existence of amplification-inducing query

patterns for three protocols in a lab setting. For these, we

confirm the known patterns but do not find additional ones.

Quake: “Get status” message induces AF of 10 in our setting.

QOTD: As this server responds with random quotes, we see

higher AF with smaller list sizes and larger quote size.

RPCbind: The request for the process number running on

the server with a correct version ID incurs high AF (i.e., 10).

5.7 Parameters and Validation

Given the lack of ground-truth for all servers, we use a com-

bination of local-server experiments, a large-scale simulation,

and example measurements for validation. In the local ex-

periment, we randomly sampled 2M queries on a local DNS

server and measured the AFs to infer the signatures (§5.3).

Our simulator models an amplification function that maps a

query to AF based on (1) field types, (2) the # of servers, (3)

the # of pattern structures across servers, (4) the # of pattern

for each (3). For (3), indicating 100 pattern types instantiates

100 graph structures across servers where each gets mapped to

one type. (3) simulates the pattern variability across servers.

Structure (Original) Structure A (Halving the density)Structure B (Disable TXT/ANY/RRSIG)Different Pattern Structures

0102030405060708090

% o

f Gro

undt

ruth

Pat

tern

s Fo

und 100

200500800

10001200

15002000

Figure 18: Validating the choice of total budget (Btotal)

Validating parameters: There are three key parameters:

(1) per-server total budget, Btotal, (2) allotting Btotal across

different stages (e.g., Probing Stage), and (3) the number of

clusters for K-means.

To see the impact of the total budget (Btotal), we use the

local DNS server experiment. Fixing other parameters (50%

for Brand), we varied the B from 100 to 2000 (Figure 18). To

show the robustness across multiple pattern structures, we

“emulated” different pattern structures given one setup. We

emulated the effect of (A) reducing the % of AF-inducing

queries by half (emulating this by adding “dummy” field

100 400# of Pattern Structures Across Servers

0

20

40

60

80

100

% o

f Gro

undt

ruth

Pat

tern

s Fo

und

115

2535

4555

65

(a) % of budget for Brand

100 400# of Pattern Structures Across Servers

0

20

40

60

80

100

% o

f Gro

undt

ruth

Pat

tern

s Fo

und

05

1020

3040

5060

(b) % of budget for Bprobe

Figure 19: Validating the choice of budget allocation

entries that yield 0 AF), and (B) disabling certain patterns

(TXT, RRSIG, ANY lookups). Clearly, using only a few hundred

achieves low coverage but starts seeing the diminishing return

at 1200 or 1500. We chose 1500 for complex protocols (e.g.,

DNS). This experiment shows that our chosen Btotal is in a

sufficiently good operating region.

To see the impact of the budget across stages, we use our

simulator with 1K servers. We configured 30% of servers not

to induce high amplification (similar to the real-world). To

analyze the robustness w.r.t. different levels of diversity, we

test against 100 to 400 pattern structures. First, using 50%

for Brand, we vary the Bprobe from 0 to 40% (Figure 19b).

Using 0% for probing hurts coverage but using 5% and 30%

is robust across settings. We chose 10% (lower end of the

range) to spare the budget for other (more critical) stages.

Similarly, we vary the Brand from 0 to 70% (Figure 19a). We

observe robustness across 5% to 45%. As it is crucial for this

RandomSample Stage to discover at least one AF-inducing

query (for most servers), we chose 45% (the higher end). This

leaves a per-field search with the remaining 45%.

To validate the number of clusters, we use the same simula-

tor and evaluate based on the % of servers, which the chosen

Bprobe discovered at least one high AF query. Then, we vary

the number of clusters from 2 to 200 and observe robustness

across these values; i.e., this is not a crucial factor.

0 20 40 60 80 100% of Patterns Found (for Each Server)

AmpMap

Random

Sim. Ann.

Figure 20: Validation of coverage of AmpMap and alter-

nate solutions using 1K server measurements

Comparing alternatives: We compare AmpMap vs. two

baselines: 1) Simulated Annealing (SA), and 2) pure random

search. Our success metric is pattern coverage across a set of

servers. We compared these solutions using small-scale 1K

measurements. As we lack the ground-truth for each server,

we compare the relative performance across these solutions

rather than to claim optimality or completeness. Using a query

Page 17: Accurately Measuring Global Risk of Amplification Attacks using … · 2020. 10. 13. · Accurately Measuring Global Risk of Amplification Attacks using AmpMap . Soo-Jin Moon, Yucheng

budget of 1500, we inferred the signatures combining the out-

put across all solutions. Then, we analyze the coverage for

each server. For a given server, we take all queries with AF≥10 across three solutions, which serves as the basis of compar-

ison for this server. Then, for each strategy, we compute the

% of patterns discovered for each server. Figure 20 shows the

coverage across 1K servers. While SA performs better than

pure random strategy, the median coverage is 16.7%, while

the pure random strategy has an 11.9% median. AmpMap

achieves 97% coverage in this relative comparison.

6 Precautions and Disclosure

We carefully considered the impact of our measurements

and the disclosure of our findings. We followed the ethical

principles (Menlo Report [27] ) and the scanning guidelines

suggested by prior efforts (Zmap [35]). At a high-level, we

adhered to these principles of (1) minimizing the harm by tak-

ing multiple measurement precautions (§6.1), and (2) being

transparent in our method and results by informing relevant

stakeholders of our findings and explaining the purpose of our

scanning (e.g., when we send out email notifications) (§6.2).

6.1 Scanning precautions

We took precautions to ensure that there was no harm to the

servers and the network. Our study was approved by IRB

under non-human subject criteria. We took care to ensure that

our measurements do not burden servers or the Internet.

• We send at most one query per 5 seconds, do not send

malformed requests, and cap overall budget per server.

• We do not scan the IPv4 network space but only known

public servers obtained from Censys [34] and Shodan [20].

• We do not spoof the source IPs to induce responses to

others. Our measurers explicitly receive the responses.

Abuse complaints: We worked closely with the Cloud-

Lab [33] administrators whom we notified of our measure-

ments and the purpose of AmpMap. We only received one

abuse complaint from running back-to-back SNMP small-

scale experiments (500 servers) on June 3, 2020. This com-

plaint came from a third-party monitoring framework called

greynoise.io [12]; their goal is to notify the probing ac-

tivities in the Internet and mass scanners (e.g., Censys [34],

Shodan [20]) are also likely to be flagged by them [12]. We re-

solved this abuse complaint by discussing this with Cloudlab

admins. We did not receive any other abuse complaints from

our 10K server measurements. Across all 6 protocols, we

also ran small-scale runs (300 servers) from our public-facing

server. We are not aware that the campus network operators

received any abuse complaints from these measurements.

6.2 Disclosure

Next, we discuss our steps for responsible disclosure to rele-

vant stakeholders.

Protocol # Sent # Resp Protocol # Sent # Resp

DNS 4335 49SNMP AND

bulk 4433 36

NTP OR priv 112 0 next 2387 34normal 2 0 get 26 2

NTP AND priv 915 4 SSDP 3563 6

SNMP OR bulk 4007 30 Chargen 6008 9next 1670 11 Memcached 51 0

Table 9: Statistics on the # of notification emails we sent

and the responses we got from system owners

SUBJECT: Vulnerable DDoS Amplifier

BODY: Security researchers at Carnegie Mellon University have been

conducting Internet measurements to quantify the risk of amplification

distributed denial-of-service (DDoS) attacks. Our team has noticed your

system, $IP$ with $PORT$ running $PROTOCOL$, can be abused to

create an amplification attack (US-CERT). That means certain network

queries can induce large responses (i.e., amplification factor as defined by

US-CERT). Note that this may or may not be a result of mis-configuration

of the server. An example of a network packet that can cause an amplifi-

cation factor greater than 10 is: $PACKET INFO$.

Please feel free to contact us at [email protected] should you

have any questions and/or concerns. The details and motivation of our

project can be found in $OUR WEB$.

Figure 21: A sample notification email to IP owners

Notifying IP owners: We notified the IP owners whose

servers can induce AF greater than 10. Following best

practices, we obtained the abuse and/or contact email from

WHOIS [51]. We include an example notification sent from a

project’s email, [email protected] in Figure 21. Ta-

ble 9 shows the number of emails we sent and human (not

automated) responses we got; e.g., for DNS, we send 4335

emails and received 49 responses. Example responses include

“Thanks · · · service detected on ADDR has been shutdown the

time to install necessary mitigation” and “We were not even

aware this was the case. we have disabled SNMP.” We also

received detailed responses such as “The server is operated

by one of our downstream sites ... this server gives an upward

referral instead of returning SERVFAIL or REFUSED. This

is consistent with particular implementation of DNS server

(and IMO, it’s wrong, for exactly the reasons you state ...)”

Vulnerability reporting: We have initiated a process of dis-

closing our findings to the affected parties mediated by the

CERT® Coordination Center (CERT/CC). CERT/CC has ac-

cepted our coordination request and is in the process of iden-

tifying and notifying the affected parties. Our findings require

multi-party coordination because unexpected amplification is

potentially a protocol issue, and so all relevant vendors need

to be notified in a consistent manner. Further, we have tested

the effectiveness of the Response Rate Limiting (RRL) [41],

a mitigation feature for DNS amplification attacks. We in-

formed the vendor that having multiple patterns can partially

degrade the performance (more details in §8).

Notifying the vendors: Our vulnerability reports to

CERT/CC specify affected vendors for DNS, SNMP, and

NTP. CERT/CC is initiating the conversation with the ven-

Page 18: Accurately Measuring Global Risk of Amplification Attacks using … · 2020. 10. 13. · Accurately Measuring Global Risk of Amplification Attacks using AmpMap . Soo-Jin Moon, Yucheng

dors so that we can share the packet captures and commands

that elicit large amplification.

7 Related Work

Amplification attack and mitigation: Many network pro-

tocols have amplification vulnerabilities [54]. Rossow [57]

discovered amplification vulnerabilities in 14 UDP-based pro-

tocols by manually analyzing the code and the binary. Follow

up research also analyzed detailed amplification vector in

specific protocols by focusing on a specific set of features

(e.g., analyzing DNSSEC in DNS [61], monlist in NTP [32]).

However, using AmpMap, we found many other record type

values that can incur high AF. Some have looked at TCP-

based amplification [30, 49], which is outside the scope of

AmpMap. There is also an active discussion on the mitigation

of amplification attacks (e.g., [6, 8, 14]). Jonker et al., [44]

have done a measurement study on the adoption of these

DDoS protection services [44]. Further, some orthogonal ef-

forts focus on monitoring (e.g., [43, 47]) and linking DDoS

services (e.g., [48]). For instance, prior work [43] leverages

data from multiple Internet infrastructures (e.g., backscat-

ter traffic, honeypots) to macroscopically characterize DDoS

attacks (including amplification attacks), attack targets, and

mitigation behaviors. Our work is inspired by these prior ef-

forts. Specifically, our goal is not in characterizing attacks

or linking attacks that are happening in the wild. Instead, to

the best of our knowledge, AmpMap is the first to study the

problem of automatically mapping Internet-wide amplifica-

tion vulnerabilities by precisely identifying query patterns

that can induce large amplification.

Protocol implementation testing and verification: There

is a rich literature on testing and verification of protocol imple-

mentations. Bishop et al. [29] develop a practical specification-

based testing technique for both TCP and UDP based network

protocols; PIC [55] applies symbolic execution for check-

ing interoperability in protocol implementations; Kothari et

al. [46] apply symbolic execution for manipulation attacks.

Recent work [45, 53] also applied model checking techniques

for protocol implementations. Our work is different from this

line of work because of our specific focus on uncovering

amplification vectors rather than protocol bugs.

Existing machine learning techniques: The problem that

AmpMap tackles can be also viewed as a black-box opti-

mization problem. Hence, one interesting future work is to

leverage and customize these techniques for AmpMap’s pur-

pose, e.g., derivative-free optimization [36,42,56] or Bayesian

Optimization that can optimize for a black-box function. For

instance, we would need to customize these algorithms to

achieve coverage rather than finding the maximum value and

also handle server diversity. These efforts can benefit from

our observations and insights. Further, the current AmpMap

algorithm can also benefit from parameter tuning, e.g., auto-

matically decide the % spent on the RandomSample Stage

based on the density observed so far.

Fuzz testing: Our technique is closely related to a large

body of work on fuzz testing of software [52]; some well-

known tools are DART [38], SAGE [40], grammar-based

fuzzing [39], mutational fuzzing [65], among many others

(see [59]). Some have been applied for testing protocol im-

plementations; e.g., [25, 28] focus on finding security flaws

in the SIP protocol, and [60] focuses on security protocols.

However, these approaches focus more on safety bugs (e.g.,

memory). While our technique is a form of fuzzing, we tackle

a different application domain that will benefit from a differ-

ent set of domain-specific insights.

Message format extraction: AmpMap currently assumes

that the protocol formats are known. As such, our work can

benefit from prior work on message format extraction and

protocol model inference (e.g., [31, 63])

8 Countermeasures

In this section, we discuss countermeasures against ampli-

fication DDoS attacks in light of our findings in §5. More

extensive countermeasures are discussed by Rossow [57] and

we omit them for brevity.

Response rate limiting: As a response to UDP-based am-

plification attacks, an authoritative name servers should, and

mostly do, use response rate limiting (RRL) [1]. The idea of

RRL is to limit the number of requests that a server sends to

a client, so the server cannot be used to reflect an attack on

the client [57]. Popular DNS servers already support this fea-

ture [41]. In light of our findings that revealed multiple query

patterns, we revisit the effectiveness of the RRL mitigation.

Given that the implementation of RRL focuses on identical

response and client identity, it calls into a question of RRL’s

effectiveness if an attacker rotates multiple patterns. To test,

we set up a local DNS authoritative bind server (9.16) and ob-

tained amplification-inducing queries using AmpMap. Then,

we varied (1) the number of distinct queries to rotate (37 vs.

2111), and (2) the inter-query time (0 vs. 0.05 s). We com-

pared the total response bytes (within a window of 15 s) and

the average AF when the RRL feature is on vs. off. Our results

reveal that using multiple query patterns and carefully con-

trolling the inter-query time can degrade the performance of

RRL and give an adversary power. Specifically, if an attacker

uses more patterns (2,111 instead of 37) and an appropriate

inter-query time (0.05 s), the average AF even when the RRL

is on is 92% that of the case when RRL is off. However,

by using a larger inter-query time, an attacker consequently

generates less attack traffic. That is, an adversary will need

to trade off between the efficiency of an attack vs. the total

bandwidth of the attack. Understanding this trade-off is an

exciting research direction to explore. In light of our findings,

what we need is more advanced RRL going forward. Given

the diversity of patterns, it is unclear whether focusing on the

exact query or exact response is the right mechanism.

Page 19: Accurately Measuring Global Risk of Amplification Attacks using … · 2020. 10. 13. · Accurately Measuring Global Risk of Amplification Attacks using AmpMap . Soo-Jin Moon, Yucheng

Secure configuration and setups: Network operators and

device vendors can help mitigate some of these threats by

either taking the server offline (for legacy protocols) or chang-

ing configurations. For instance, certain network devices (e.g.,

network-enabled printers) have SNMP on by default, and

fixing these configurations could help mitigate these threats.

Our experiences in informing IP owners show that multiple

cases when operators were unaware that their devices are

publicly accessible. Furthermore, the suggested best practice

for public-facing DNS servers is to restrict access to only

authorized clients. While we also advocate following the best

practices, mitigating these attacks is unfortunately not as sim-

ple. Even in the perfect scenarios where all the servers are

correctly configured, our measurements uncovered valid fea-

tures within a protocol exploitable for attacks. Therefore, a

long-term solution is to carefully consider the protocol design

choices or design protocols that are correct-by-construct.

9 Conclusions

Given the constant evolution of protocols, server implemen-

tations, we need a systematic approach to map the DDoS

amplification threat. This paper bridges this gap by synthe-

sizing structural insights with careful measurement design to

realize a low-overhead service called AmpMap. AmpMap can

systematically confirm prior observations and also uncover

new-possibly-hidden amplification patterns that are ripe for

abuse. As future work, we plan to add support for more pro-

tocols and expand the scale of measurement to make this a

continuous “health monitoring” service for the Internet.

Acknowledgements

We thank the anonymous reviewers, Nicolas Christin, and

Min Suk Kang for their helpful suggestions. We thank the

artifact evaluation committee for their efforts and suggestions,

and Devdeep Ray and Ankit Jena for their help with an earlier

version of AmpMap. This work was also supported in part by:

NSF awards CNS-1440065 and CNS-1552481; the CONIX

Research Center, one of six centers in JUMP, a Semiconductor

Research Corporation (SRC) program sponsored by DARPA;

and by the U.S. Army Combat Capabilities Development

Command Army Research Laboratory and was accomplished

under Cooperative Agreement Number W911NF-13-2-0045

(ARL Cyber Security CRA). The views and conclusions con-

tained in this document are those of the authors and should not

be interpreted as representing the official policies, either ex-

pressed or implied, of the Combat Capabilities Development

Command Army Research Laboratory or the U.S. Govern-

ment. The U.S. Government is authorized to reproduce and

distribute reprints for Government purposes notwithstanding

any copyright notation here on.

References

[1] Alert (TA13-088A) UDP-Based Amplification Attacks. https://www.

us-cert.gov/ncas/alerts/TA13-088A.

[2] Alert (TA14-013A) NTP Amplification Attacks Using CVE-2013-5211. https://www.us-cert.gov/ncas/alerts/TA14-013A.

[3] Alert (TA14-017A) UDP-Based Amplification Attacks. https://www.

us-cert.gov/ncas/alerts/TA14-017A.

[4] Broken packets: IP fragmentation is flawed. https://blog.cloudflare.com/ip-fragmentation-is-broken/.

[5] CyberGreen. https://stats.cybergreen.net/.

[6] Ddos and security resource center. https://tinyurl.com/y8h7o9vw.

[7] DDoS Attacks Get Bigger, Smarter and More Diverse. https://tinyurl.com/ydzdnfur.

[8] Dns reflection defense. https://tinyurl.com/lbffebt.

[9] DNS SURVEY: OPEN RESOLVERS. http://dns.measurement-factory.

com/surveys/openresolvers.html.

[10] Executive Order 13800 - Strengthening the Cybersecurity of Federal Net-works and Critical Infrastructure. https://www.govinfo.gov/content/pkg/DCPD-201700327/pdf/DCPD-201700327.pdf.

[11] Flooding the web: The internet’s epic attack amplification problem. https://tinyurl.com/ycjnqg9n.

[12] Grey Noise. https://greynoise.io/about.

[13] Here’s how much money a business should expect to lose if they’re hit with a

DDoS attack. https://tinyurl.com/y7s45ls3.

[14] How to defend against amplification attacks. https://tinyurl.com/

yb5gotte.

[15] IPv6, Large UDP Packets and the DNS. http://www.potaroo.net/ispcol/

2017-08/xtn-hdrs.html.

[16] Memcrashed - Major amplification attacks from UDP port 11211. https://

tinyurl.com/yatp4649.

[17] Open Memcached Key-Value Store Scanning Project. https:

//memcachedscan.shadowserver.org/.

[18] Security Bulletin: Crafted DNS Text Attack. https://tinyurl.com/y9zpevuy.

[19] ShadowServer. https://www.shadowserver.org/.

[20] SHODAN. https://www.shodan.io/.

[21] Technical Details Behind a 400Gbps NTP Amplification DDoS Attack. https://tinyurl.com/mcf32xg.

[22] The DDoS That Almost Broke the Internet. https://tinyurl.com/pl26tw3.

[23] The Spoofer Project. http://spoofer.cmand.org.

[24] UDP-Based Amplification Attacks. https://www.us-cert.gov/ncas/

alerts/TA14-017A.

[25] H. J. Abdelnur, R. State, and O. Festor. Kif: A stateful sip fuzzer. In Proc.

IPTComm, 2007.

[26] D. G. Andersen, H. Balakrishnan, N. Feamster, T. Koponen, D. Moon, andS. Shenker. Accountable internet protocol (aip). In Proc. ACM SIGCOMM,2008.

[27] M. Bailey, D. Dittrich, E. Kenneally, and D. Maughan. The menlo report. IEEE

Security Privacy, 10(2):71–75, 2012.

[28] G. Banks, M. Cova, V. Felmetsger, K. Almeroth, R. Kemmerer, and G. Vigna.Snooze: Toward a stateful network protocol fuzzer. In Proc. ISC, 2006.

[29] S. Bishop, M. Fairbairn, M. Norrish, P. Sewell, M. Smith, and K. Wansbrough.Rigorous specification and conformance testing techniques for network proto-cols, as applied to tcp, udp, and sockets. In Proc. ACM SIGCOMM, 2005.

[30] K. Bock, A. Alaraj, Y. Fax, K. Hurley, E. Wustrow, and D. Levin. Co-optingFirewalls for TCP Reflected Amplification. In Proc. USENIX Security, 2021.

[31] W. Cui, J. Kannan, and H. J. Wang. Discoverer: Automatic protocol reverseengineering from network traces. In Proc. USENIX Security, 2007.

[32] J. Czyz, M. Kallitsis, M. Gharaibeh, C. Papadopoulos, M. Bailey, and M. Karir.Taming the 800 pound gorilla: The rise and decline of ntp ddos attacks. In Proc.

IMC, 2014.

[33] D. Duplyakin, R. Ricci, A. Maricq, G. Wong, J. Duerig, E. Eide, L. Stoller, M. Hi-bler, D. Johnson, K. Webb, A. Akella, K. Wang, G. Ricart, L. Landweber, C. El-liott, M. Zink, E. Cecchet, S. Kar, and P. Mishra. The design and operation ofCloudLab. In Proc. ATC, 2019.

[34] Z. Durumeric, D. Adrian, A. Mirian, M. Bailey, and J. A. Halderman. A searchengine backed by internet-wide scanning. In Proc. CCS, 2015.

[35] Z. Durumeric, E. Wustrow, and J. A. Halderman. Zmap: Fast internet-wide scan-ning and its security applications. In Proc. USENIX Security, 2013.

Page 20: Accurately Measuring Global Risk of Amplification Attacks using … · 2020. 10. 13. · Accurately Measuring Global Risk of Amplification Attacks using AmpMap . Soo-Jin Moon, Yucheng

[36] D. E. Finkel. Direct optimization algorithm user guide. 2003.

[37] P. Flajolet, D. Gardy, and L. Thimonier. Birthday paradox, coupon collectors,caching algorithms and self-organizing search. Discrete Appl. Math., 39(3):207–229, Nov. 1992.

[38] P. Godefroid et al. Dart: Directed automated random testing. In Proc. PLDI,2005.

[39] P. Godefroid, A. Kiezun, and M. Y. Levin. Grammar-based whitebox fuzzing.SIGPLAN Not., 43(6):206–215, June 2008.

[40] P. Godefroid, M. Y. Levin, D. A. Molnar, et al. Automated whitebox fuzz testing.In NDSS, volume 8, pages 151–166, 2008.

[41] Internet Systems Consortium. Using the response rate limiting feature. https://kb.isc.org/docs/aa-00994, 9 2018.

[42] K. G. Jamieson et al. Query complexity of derivative-free optimization. In Proc

NIPS, pages 2672–2680, 2012.

[43] M. Jonker, A. King, J. Krupp, C. Rossow, A. Sperotto, and A. Dainotti. Millionsof Targets Under Attack: a Macroscopic Characterization of the DoS Ecosystem.In Proc. IMC, 2017.

[44] M. Jonker, A. Sperotto, R. van Rijswijk-Deij, R. Sadre, and A. Pras. Measuringthe adoption of ddos protection services. In Proc. IMC, 2016.

[45] C. Killian, J. W. Anderson, R. Jhala, and A. Vahdat. Life, death, and the criticaltransition: Finding liveness bugs in systems code. In Proc. NSDI, 2007.

[46] N. Kothari, R. Mahajan, T. Millstein, R. Govindan, and M. Musuvathi. Findingprotocol manipulation attacks. In Proc. ACM SIGCOMM, 2011.

[47] L. Krämer, J. Krupp, D. Makita, T. Nishizoe, T. Koide, K. Yoshioka, andC. Rossow. Amppot: Monitoring and defending against amplification ddos at-tacks. In Proc. RAID, 2015.

[48] J. Krupp, M. Karami, C. Rossow, D. McCoy, and M. Backes. Linking amplifica-tion ddos attacks to booter services. In Proc. RAID, 2017.

[49] M. Kührer, T. Hupperich, C. Rossow, and T. Holz. Exit from Hell? Reducing theImpact of Amplification DDoS Attacks. In Proc. USENIX Security, 2014.

[50] S. Kumar. Smurf-based distributed denial of service (ddos) attack amplificationin internet. In Proc. ICIMP, 2007.

[51] F. Li, Z. Durumeric, J. Czyz, M. Karami, M. Bailey, D. McCoy, S. Savage, andV. Paxson. You’ve got vulnerability: Exploring effective vulnerability notifica-tions. In Proc. USENIX Security, 2016.

[52] B. P. Miller, L. Fredriksen, and B. So. An empirical study of the reliability ofunix utilities. Commun. ACM, 33(12):32–44, Dec. 1990.

[53] M. Musuvathi and D. R. Engler. Model checking large network protocol imple-mentations. In Proc. NSDI, 2004.

[54] V. Paxson. An analysis of using reflectors for distributed denial-of-service at-tacks. SIGCOMM CCR, 31(3):38–47, July 2001.

[55] L. Pedrosa, A. Fogel, N. Kothari, R. Govindan, R. Mahajan, and T. Millstein.Analyzing Protocol Implementations for Interoperability. In Proc. NSDI, 2015.

[56] L. M. Rios and N. V. Sahinidis. Derivative-free optimization: a review of algo-rithms and comparison of software implementations. Journal of Global Opti-

mization, 56(3):1247–1293, 2013.

[57] C. Rossow. Amplification Hell: Revisiting Network Protocols for DDoS Abuse.In Proc. NDSS, 2014.

[58] T. Rozekrans and J. de Koning. Defending against DNS reflection amplificationattacks. https://tinyurl.com/bvw3d85.

[59] M. Sutton, A. Greene, and P. Amini. Fuzzing: Brute Force Vulnerability Discov-

ery. Addison-Wesley Professional, 2007.

[60] P. Tsankov, M. T. Dashti, and D. Basin. Secfuzz: Fuzz-testing security protocols.In 2012 7th International Workshop on Automation of Software Test (AST), 2012.

[61] R. van Rijswijk-Deij, A. Sperotto, and A. Pras. Dnssec and its potential for ddosattacks: A comprehensive measurement study. In Proc. IMC, 2014.

[62] R. Vaughn and G. Evron. Dns amplification attacks preliminary release. 2006.

[63] Y. Wang, X. Yun, M. Z. Shafiq, L. Wang, A. X. Liu, Z. Zhang, D. Yao, Y. Zhang,and L. Guo. A semantics aware approach to automated reverse engineering un-known protocols. In Proc. ICNP, 2012.

[64] R. Weber. Better than Best Practices for DNS Amplification Attacks. https:

//tinyurl.com/y75u32ju.

[65] M. Woo, S. K. Cha, S. Gottlieb, and D. Brumley. Scheduling black-box muta-tional fuzzing. In Proc. CCS, 2013.

A Formal Analysis

We present analysis sketch of why AmpMap can discover

medium-to-high modes and compare it with other strawman

solutions. To make analysis easier, we make two simplify-

ing assumptions: (1) We only consider a single-server case

(§4.1.2); and (2) The ratio of the number of high AF queries

to the total number of possible queries, d , is known.

Definitions: We first give necessary definitions for the for-

mal analysis. We use query ranges to denote a set of queries.

Particularly, we write a query range QR as 〈 f1 : [vl1,v

r1], f2 :

[vl2,v

r2], . . . , fn : [vl

n,vrn]〉, where vl

i ,vri ∈ AV ( fi) and vl

i < vri for

i = 1, ...,n. A query range represents a set of queries in a nat-

ural way. A query q = 〈 f1 = v1, .., fn = vn〉 is in QR (written

q ∈ QR) iff vli ≤ vi ≤ vr

i for i = 1, ..,n.

Given a constant δ, a δ-high query pattern (or simply high

query pattern if δ is clear from the context) QP is a query

range 〈 f1 : [vl1,v

r1], f2 : [vl

2,vr2], . . . , fn : [vl

n,vrn]〉 satisfying the

following two conditions: 1) all queries in the query range

induce high AF. That is, ∀ q∈QP, AF(q)≥ δ; 2) the specified

range of each field in QP is a maximal in terms of inducing

high AF. That is, ∀ i = 1, ..,n, v′li and v′ri , if v′li < vli ≤ vr

i ≤ v′rior v′li ≤ vl

i ≤ vri < v′ri , then ∃ a query q ∈ 〈 f1 : [vl

1,vr1], . . . , fi :

[v′li ,v′ri ], . . . , fn : [vl

n,vrn]〉 such that AF(q)< δ.

Given a protocol, Proto, we assume that the set of all high

query patterns of Proto is unique. We denote the set of all

high query patterns as PProto.

Given a Proto and a total budget, Q, the covered high query

pattern by Q, denoted co(Q), is the set of high query patterns

of Proto where each high query pattern shares at least one

query with Q. That is, co(Q) = {QP ∈ PProto|Q∩QP 6= /0}.Based on this definition, we can now formally state the goal of

AmpMap. Given a server s running protocol Proto, AmpMap

seeks to maximize the size of co(Q).

A.1 Analysis of strawman approaches

Here, we analyze the expected budget for different strategies

for the one-server case.

Exhaustive Search: An exhaustive search enumerates valid

queries of the protocol. While this can discover all patterns,

the budget is prohibitively large: E(B) =∏Ni=1 |AV ( fi)|, where

N is a number of fields.

Random Search: For pure random search, the expected num-

ber of queries to cover all high query patterns is: E(B) =∫ ∞

0 (1−∏|P|i=1(1− e−pit))dt

Here, pi is the probability of picking a query in the i-th

high query pattern [37]. The expected budget increases expo-

nentially as |P| increases.

A.2 Analysis of AmpMap approach

Under some simplifying assumptions we can analyze the

expected budget to discover all patterns. To make analysis

easier to present, we make three simplifying assumptions: (1)

Page 21: Accurately Measuring Global Risk of Amplification Attacks using … · 2020. 10. 13. · Accurately Measuring Global Risk of Amplification Attacks using AmpMap . Soo-Jin Moon, Yucheng

Each field, fi, is of homogeneous size F; (2) Each distinct

pattern just has one query; and (3) We know the number of

distinct patterns, NumPatterns.

In reality, our goal is to discover as many as possible. At a

high-level, we can show that our worst-case run time is linear

in the NumPatterns×F . First, note that given d, the density

of queries that give high AF, the expected budget to find one

query in one of the patterns is 1d

. Second, note that the number

of queries required to sweep the all neighboring queries from

a given query is F×NumField.

Given these preliminaries and our assumptions on the “lo-

cality” structure, we can consider the best-case and worst-case

analysis to discover all patterns. The best-case is when all pat-

terns form a fully connected clique, where two queries in two

distinct patterns are neighbors. This means, that when we start

from a query in a q1, we will discover all other NumPatterns-

1 patterns in just one sweep. The worst case is when all 4

distinct patterns (QP1 · · ·QP4) form a chain. That is, we need

to do one sweep to discover an additional mode. Note that we

are guaranteed to find another pattern (Observation 1) because

all patterns can be reached by sweeping each field. Hence,

we need to do NumPatterns−1 sweep. Since we assume we

know what is NumPatterns, our search will terminate when

we discover all patterns. Taken together, the best-case run-

time is 1d+F ×NumField, and the worst-case run-time is

1d+(NumPatterns−1)×F×NumField.


Recommended