Post on 07-Feb-2016
description
transcript
Reverse Hashing for High-speed Network Monitoring: Algorithms, Evaluation, and Applications
Robert Schweller1, Zhichun Li1, Yan Chen1, Yan Gao1, Ashish Gupta1, Yin Zhang2, Peter Dind
a1, Ming-Yang Kao1, Gokhan Memik1
1 Lab for Internet and Security Technology (LIST), Northwestern Univ. 2 University of Texas at Austin
The Spread of Sapphire/Slammer Worms
Motivation (online change detection)
• Online network anomaly/intrusion detection over high speed links– Small memory usage– Small # of memory access per packet– Scalable to large key space size
• Primitives for online anomaly detection– Heavy hitters (lots of prior work)– Heavy changes: enabler for aggregate queries
over multiple data streams• Asymmetric routing demands spatial aggregation• Time Series Analysis (TSA) need temporal
aggregation
Outline
• Background on k-ary sketch• Reversible sketch problem• Modular hashing• IP mangling • Reverse hashing• Evaluation• Conclusion
[Krishnamurthy, Sen, Zhang, Chen, 2003][Krishnamurthy, Sen, Zhang, Chen, 2003]First to detect flow-level heavy changes in massive data streams at network traffic speeds
K-ary sketch
1
j
H
0 1 K-1…
……
k-ary sketch
1
j
H
0 1 K-1…
……
hj(k)
hH(k)
h1(k)Update (k, u): Tj [ hj(k)] += u (for all j)
Estimate v(S, k): sum of updates for key k
KKsumkhT jj
j /11/)]([
median
[Krishnamurthy, Sen, Zhang, Chen, 2003][Krishnamurthy, Sen, Zhang, Chen, 2003]APIs:
+ =
S=COMBINE(,S1,,S2):
??
• Main problem– Cannot efficiently report keys with heavy change
INFERENCE(S,t)– Important function for anomaly detection!
• Our Contribution– Determine set of keys that have “large” estimates in
a sketch
Reverse Sketch Problem
Reversible sketch framework
Streamingdatarecording
reversiblek-ary
sketch
value storedvalue
Modularhashing
IP manglingkey
Heavychangedetection reversible
k-ary sketch
Reversehashing
ReverseIP mangling
heavychangekeys
changethreshold
Outline
• Background on k-ary sketch• Reversible sketch problem• Modular hashing• IP mangling • Reverse hashing• Evaluation• Conclusion
• Intersect A1, A2, A3, A4, A5
Taking Intersections
H = 5 K = 212 #keys = 232 (IP addresses)
E[false positives] << 1
The problem with simple intersection
• Each set Ai can be very large !H = 5 K = 212 #keys = 232 (IP addresses)
|A1| = 232 / 212 = 220
The problem with simple intersection
• Each set Ai can be very large !
• Solution:
Modular hashing
Modular hashing reduces the set size
32 bits
8 bits
10010100 10101011 10010101 10100011
010 110 001 101
h()
12 bits
Modular hashing reduces the set size
32 bits
8 bits
10010100 10101011 10010101 10100011
h1() h2() h3() h4()
010 110 001 101
010 110 001 101
Greatly reduces size of reverse mapped sets
Modular hashing reduces the set size
1
2
3
5
4
b1
b2
b4
b5
b3
A1: 25 * 25 * 25 * 25 Intersection:Only 32 elements per word set
1
2
3
5
4
b1
b2
b4
b5
b3
A1: 25 * 25 * 25 * 25 A2: 25 * 25 * 25 * 25
Intersection:
Modular hashing reduces the set size
Problem: Too many collisions
129.105.56.23 129.105.56.28129.105.56.109129.105.56.35129.105.56.98 ...
7 . 4 . 0 . *
32 bits 12 bits
Problem: Too many collisions
129.105.56.23 129.105.56.28129.105.56.109129.105.56.35129.105.56.98 ...
7 . 4 . 0 . *
32 bits 12 bits
IP Mangling with GF (Galois Extension Field)Solution:
IP Mangling: a bijective mapping function for breaking the key space continuity
Outline
• Background on k-ary sketch• Reversible sketch problem• Modular hashing• IP mangling • Reverse hashing• Evaluation• Conclusion
Handling Multiple Intersections…
1
2
3
5
4
b1
b2
b4
b5
b3b3
b1
b2
b4
b5
2H different intersections
Much more difficult – Solution: Reverse Hashing algorithms• Step 1: Reverse hashing for each module• Step 2: Infer the whole key through bucket index matching among candidates from each module
Reverse Hashing for Each Module
123
54
H=5, r=1, K=212
r tolerance level
412
312
212
11212 AAAAA 32w
ijA
}5,2{}3,2{112
111
11 AAG
i
ir GI 11
candidate set of the first word in Hash table i
All possible values of the first word in the sketch
1iG
Take the first word as an example
}3,2{}3,0{132
131
13 AAG
}10,9{}6,2{122
121
12 AAG
}8,2{}10,3{142
141
14 AAG
}9,6{}7,3{152
151
15 AAG
{ 2,3,5}{ 2,
6,9,10}{0,2,3}{ 2,3,8,10}{ 3,6,7,9}
{2}{2,3}
Bucket Index Matrix of Candidates
H=5, r=1, K=212 For each x in I1, we can get B1(x), a vector of the heavy bucket sets which x hashes to.
192.168.0.1
123
54
b11
b21
b42
b51
b32b31
b12
b22
b41
b52
123
54
b11
b21
b42
b51
b32b31
b12
b22
b41
b52
192.123.47.62
123
54
b11
b21
b42
b51
b32b31
b12
b22
b41
b52
192.*.*.* hash to the red heavy buckets
5251
4241
32
21
1211
1
,,
,
)192(
bbbb
bb
bb
B
Prefix Extension Algorithm
I1 I2B1 B2
150
47
236
36,3,19,4,1
15,2
41153
31
5,27,3,2
2
72
104
8,7,35
9,45,12,1
9,312
6,22,1+ =
<150.72>
}8,7,3{}3{}5{}6,3,1{
}9,4{}9,4,1{}5,1{}1{
}2,1{}5,2{
3*
9,412
<47.72>
***5*
* more than r=1Ignore!
<236.104>
31222
Ignore!
Path discovery algorithm
<150.72>
3*
9,412
<236.104>
31222
+ =
<150.72.182>
3*412
<236.104.49>
31222
<150.72.32>
3*912
182
32
49
31
4,31
2,1
37,1
912
312
6,22
I3 B3
+ =75
9,5,314
2,12
I4 B4
3*412
<150.72.182.75>
31*22
<236.104.49.75>
Prefix Extension Algorithm
Recap:
Streamingdatarecording
reversiblek-ary
sketch
value storedvalue
Modularhashing
IP manglingkey
Heavychangedetection reversible
k-ary sketch
Reversehashing
ReverseIP mangling
heavychangekeys
changethreshold
)( loglog/1 nn
)loglog
log(n
n
n is the size of key space
Outline
• Background on k-ary sketch• Reversible sketch problem• Modular hashing• IP mangling • Reverse hashing• Evaluation• Conclusion
Evaluation
• Dataset– A large US ISP (330M Netflow records)– NU (19M Netflow records)
• Efficient data recordingFor the worst case traffic, all 40-byte packets– Software: 526Mbps on P4 3.2Ghz PC– Hardware: 16Gbps on a single FPGA broad– Only a few hundred KB to a couple of MB memory used– Only 15 memory access per packet for 48 bit reversible s
ketches and 16 per packet for 64 bit reversible sketches• Efficient heavy change detection and key inference
– 0.34 seconds for 100 changes. 13.33 seconds for 1000 change
Key Inference Accuracy• True positives and false positives of 16bit
reversible sketches for 32bit IP addresses
88
92
96
100
20001600120085045050
0.040.060.080.120.252.40
True
Pos
itive
Per
cent
age
Number of heavy changes
H=6, r=1H=6, r=2H=5, r=1Deltoids
0.2
0.6
1
20001600120085045050
0.040.060.080.120.252.40
Fals
e P
ositi
ve P
erce
ntag
e
Number of heavy changes
H=6, r=1H=6, r=2H=5, r=1
Deltoid
[Deltoids]: S.Muthukrishnan and Graham Cormode, What's New: Find Significant Differences in Network Data Streams. Infocom 2004
• Stress test with larger dataset still accurate• Scalable to larger key space size: similar res
ults for 64bit IP pairs• Built anomaly/intrusion detection system to d
etect, e.g., SYN flooding and port scans [ICDCS 2006]
More Results
Conclusions
Proposed the first reversible sketches which• Record high speed network streams online• Detect the heavy changes and infer the
keys online• Small memory usage, small # of memory
access per packet• Scalable to large key space size
Backup Slides
Related work
• Compare with [deltoids]– Accuracy better– Scalable to large key space better– # of Memory access less
• [PCF, IMC2004]: not reversible• [Q. Zhao et al, IMC2005] [S.Venkataraman,
NDSS2005]: unique fan-out (fan-in) estimation.
Modular Hashing
Optimal Hashing
However… Not reversibleLack of an inference API: INFERENCE(S,t)• Important function for anomaly detection!• Decouple the recording stage of sketches from the detection stage to enable efficient combine and inference.• Given a threshold t, report keys whose corresponding sum of updates are larger than the threshold.Our contribution: an efficient algorithm for inference
Reversible sketch problem
??
Problem: Too many collisions
129.105.56.23 129.105.56.28129.105.56.109129.105.56.35129.105.56.98 ...
7 . 4 . 0 . *
32 bits 12 bits
IP Mangling with
Solution:
IP-mangling
• Use GF (Galois Extension Field) function for attack resilience
Modular Hashing
Modular Hashing with IP Mangling Optimal Hashing
Reverse Hashing for Each Module
123
54
b11
b21
b42
b51
b32b31
b12
b22
b41
b52
H=5, r=1, K=212
411
311
211
11111 AAAAA 4
12312
212
11212 AAAAA 32w
ijA
{*}112
111
11 AAG
{*}122
121
12 AAG
{*}132
131
13 AAG
{*}152
151
15 AAG
{*}142
141
14 AAG
s}hash table r)-(Hleast at in bucketsheavy tomapped is |{ 111 vGvGIi
ii
ir
all possible value of the first word for the No. j heavy bucket in Hash table i
all possible value of the first word in Hash table i
All possible value of the first word in the sketch
1ijA
1iG
Take the first word as an example
False positive reduction by original sketch verifying
<150.72.182.75>
Estimate(<150.72.182.75>, 180)
Threshold150
(<150.72.182.75>, 180)
Final result
Verified original k-ary sketch
K-ary sketch [Krishnamurthy, Sen, Zhang, Chen, 2003][Krishnamurthy, Sen, Zhang, Chen, 2003]
• first to detect flow-level heavy changes in massive data streams at network traffic speeds• APIs
– UPDATE(S,k,u): Tj [ hj(k)] += u (for all j)– ESTIMATE(S, k): sum of updates for key k– Linear combination: S=COMBINE(,S1,,S2)
+ =