1
Applying PCA for Traffic Anomaly Detection:
Problems and Solutions
Daniela Brauckhoff (ETH Zurich, CH)Kave Salamatian (Lancaster University, FR)
Martin May (Thomson, CH)
IEEE INFOCOM (April, 2009)
2010/3/2
2
Agenda
• Before Introduction• Objective• A Signal Processing View on PCA• Extension of PCA to Stochastic Processes• Validation• Conclusion
2010/3/2
3
What is PCA?
• PCA– Principle Component Analysis
• PCA’s Usage– lower the characteristic dimension– e.g., a picture with size 1024 * 768• its characteristic dimension is its length * width• with 786432 characteristic value• use PCA to lower the characteristic dimension
2010/3/2
4
What is PCA? (cont.1)
2010/3/2 Ref. Site- http://blog.finalevil.com/2008/07/pca.html
5
What is PCA? (cont.2)
2010/3/2
6
Agenda
• Before Introduction• Objective• A Signal Processing View on PCA• Extension of PCA to Stochastic Processes• Validation• Conclusion
2010/3/2
7
Problems and Solutions
• Consider the temporal correlation of the data
• Extend the PCA– Replaced by Karhunen-Loeve Transform
2010/3/2
8
Agenda
• Before Introduction• Objective• A Signal Processing View on PCA• Extension of PCA to Stochastic Processes• Validation• Conclusion
2010/3/2
9
Two different interpretations
1. As an efficient representation that transforms the data to a new coordinate system• Projection on the first coordinate contains the
greatest variance
2. As a modeling technique• using a finite number of terms of an orthogonal serie
expansion of the signal with uncorrelated coefficients
2010/3/2
10
Background
• Suppose that we have a column vector of correlated random variables:– Matrix =>
– Each random variable has its own observation vector through N dependent realization vector:
– Note:• Random variables means the data you collected from network
2010/3/2
kTKXX R),...,( X 1
TiK
i xx ),...,(x 1
X
11
Background (cont.1)
• In order to find the characteristic of the above data collected from network– i.e., the most suitable basis: ,• where is an eigenvector of the covariance matrix
defined as , estimated by
• where is a column vector containing the means of
2010/3/2
kTKXX R),...,( X 1
),...,( 1 Ki
})X)(X{(E T
iX
X
T
Nxx
1
1ˆ
12
Background (cont.2)
• The most suitable basis:• How to find the respectively?– i.e., solve the following linear equation:
–Method: SVD (Singular Value Decomposition)• Note: basis change matrix
2010/3/2
),...,( 1 Ki
iii λ
})X)(X{(E T
],...,[ 1 KU
13
Background (cont.3)
• But is a basis change matrix only when is zero mean
• Meanwhile, must replaced by– i.e., – not taking care of it could lead to large errors when using
PCA
• Rewrite the initial vector of random variables – is the essential property!– i.e., suitable for PCA representation
2010/3/2
],...,[ 1 KU X
X -XX~
xU~y~
X
K
1
Xi
iiYiY
14
Agenda
• Before Introduction• Objective• A Signal Processing View on PCA• Extension of PCA to Stochastic Processes• Validation• Conclusion
2010/3/2
15
Stochastic Process
• The extension to PCA Stochastic processes that have temporal as well as spatial correlations
• Assume we have a K-vector of zero mean stationary stochastic processes
– with a covariance function
2010/3/2
TK tXtXt ))(),...,(()(X 1
)}()({E)(, tXtX jiji
16
Stochastic Process (cont.1)
• The multi-dimension Karhunen-Loeve theorem states that one can rewrite this vector as a serie expansion (named KL expansion):
– Compared:
2010/3/2
K
i jji
ljil tYtX
1 1,, )()(
K
1
Xi
iiY
17
Stochastic Process (cont.2)
• How to get basis function ?– Solve the linear integral equations:
– Compared:
• Then we can obtained by
2010/3/2
K
i
b
a jljljili tdstss1
,,,, )(λ)()(
)(, tji
iii λ
ljiY ,
b
a jillji dsssXY )()( ,,
K
i jji
ljil tYtX
1 1,, )()(
18
Stochastic Process (cont.3)
• But Galerkin method transforms the above integral equations to a matrix problem that can be solved by applying the SVD technique
• It possible to derive the KL expansion using only a finite number of samples– Time-sampled version =>– Finally, we obtain a discrete version of the KL
expansion as:
2010/3/2
)(][ ,, kTk jiji
K
i
N
jji
ljil kYkX
1 1,, ][][
19
Stochastic Process (cont.4)
• Construct a KN × (n − N) observation matrix
• With KN eigenvector2010/3/2
K
i
N
jji
ljil kYkX
1 1,, ][][
20
Stochastic Process (cont.5)
• Use to estimate the all needed spatio-temporal convariance
2010/3/2
T
Nnxx
1
1ˆ
21
Agenda
• Before Introduction• Objective• A Signal Processing View on PCA• Extension of PCA to Stochastic Processes• Validation• Conclusion
2010/3/2
22
Data Set and Metrics
• Collect Three weeks of Netflow data– one of the peering links of a medium-sized ISP
(SWITCH, AS559)• Recorded in August 2007– comprise a variety of traffic anomalies– happening in daily operation such as network
scans, denial of service attacks, alpha flows, etc
2010/3/2
23
Data Set and Metrics (cont.1)
• The computing the detection metrics:– distinguish between incoming and outgoing
traffic, as well as UDP and TCP flows– for each of these four categories, compute seven
commonly used traffic features:• Byte• Packet• flow counts• Sources and destination IP address entropy• Source and destination IP address counts
2010/3/2
24
Data Set and Metrics (cont.2)
• All metrics obtained by aggregating the traffic in 15 minutes intervals resulting 28*96 matrix per measurement day
• Anomalies identified by using visual inspection
• Resulted in 28 detected anomalous events in UDP and 73 detected in TCP traffic
2010/3/2
25
Data Set and Metrics (cont.3)
• Use the vector of metrics containing the first two days of metrics for building the model
• Derive a spatio-temporal correlation matrix with the temporal correlation range set to N = 1, .., 5– Note that setting N = 1 gives the standard PCA
approach– apply SVD decomposition to the data, resulting in
a basis change matrix2010/3/2
26
ROC curves
• Receiver Operating Characteristics (ROC) curve combining the two parameters in one value captures this essential trade-off– false positive and true positive
2010/3/2
27
ROC curves (cont.1)
• Receiver Operating Characteristics (ROC) curve combining the two parameters in one value captures this essential trade-off– false positive and true positive
2010/3/2
28
ROC curves (cont.2)
2010/3/2
29
ROC curves (cont.3)
• The comparison of ROC curves shows a considerable improvement of the anomaly detection performance with use of KL expansion with N = 2, 3 consistently for UDP and TCP traffic and thereafter a decrease for N ≥ 4
2010/3/2
30
Effect of non-stationarity
• Stationarity issue:– N ≥ 4 the performance decreases– when N increases, the model contains more
parameters and becomes more sensitive to the stationarity of the traffic metrics
2010/3/2
31
Agenda
• Before Introduction• Objective• A Signal Processing View on PCA• Extension of PCA to Stochastic Processes• Validation• Conclusion
2010/3/2
32
Conclusion
• Direct application of the PCA method results in poor performance in terms of ROC curves
• The correct framework is not the classical PCA but rather the Karhunen-Loeve expansion
• Provide a Galerkin method for developing a predictive model and therefore an important improvement is attained when temporal correlation is considered
2010/3/2
33
Q & A
Thank you!
2010/3/2
342010/3/2
)(][
)(
],...,[
][][
][
][][ˆ
xx1
1ˆ
][][
)()(
)(λ)()(
)()(
)}()({E)(
))(),...,(()(X
xx1
1ˆ
X
~y~-XX
~λ
})X)(X{(E
),...,(
)e,...,e(
),...,(x
R),...,( X
),...,(
,,
,
,
1
,
1 1,,
1 1,,
,,
1,,,,
1 1,,
,
1
K
1
1
1
1
1
1
kTk
Y
t
Y
U
X
X
kQkD
k
kYkX
Nn
kYkX
dsssXY
tdstss
tYtX
tXtX
tXtXt
N
Y
xU
xx
XX
XX
jiji
lji
ji
i
K
i
i
h
ji
L
i
M
jji
ljil
T
K
i
N
jji
ljil
b
a jillji
K
i
b
a jljljili
K
i jji
ljil
jiji
TK
T
iii
iii
T
K
K
TiK
i
kTK
K
),...,( 1 KXXkT
KXX R),...,( X 1
TiK
i xx ),...,(x 1
)e,...,e( 1 K
),...,( 1 K})X)(X{(E T
iii λ-XX
~
xU~y~
K
1
Xi
iiY
T
Nxx
1
1ˆ
TK tXtXt ))(),...,(()(X 1
)}()({E)(, tXtX jiji
K
i jji
ljil tYtX
1 1,, )()(
i
iX
X
iY
352010/3/2
L
i
M
jji
ljil kYkX
1 1,, ][][ˆ
T
Nnxx
1
1ˆ
K
i
N
jji
ljil kYkX
1 1,, ][][
b
a jillji dsssXY )()( ,,
K
i
b
a jljljili tdstss1
,,,, )(λ)()(
hkQkD
][][
][, kji
],...,[ 1 KU
ljiY ,
)(][ ,, kTk jiji