Post on 11-Jul-2020
transcript
Introduction to Anomaly Detection
Chao Lan
Presented at the summer camp of RAMPE II: Cybersecurity and Internet of Things, University of Wyoming, 2018.
OutlineBackground
Learning-based Detection Approaches
Evaluation Metrics
Challenges
OutlineBackground
- what is anomaly detection and what are their applications?- why do we need computer to help anomaly detection?- why do we want machine learning to help design detection rule?
Learning-based Detection Approaches
Evaluation Metrics
Challenges
What is anomaly detection?
“Anomaly detection refers to the problem of finding patterns
in data that do not conform to expected behavior.”
Chandola et al. Anomaly detection: A survey. ACM Computing Surveys, 2009.
http://www.svcl.ucsd.edu/projects/anomaly/
Network Anomaly Detection – Do We Know What to Detect? 2013
Fraud Prevention with Neo4j: A 5-Minute Overview, 2017
Fujitsu Develops Traffic-Video-Analysis Technology Based on Image Recognition and Machine Learning, 2016.
Early detection of at-risk students using machine learning based on LMS log data. 2017.
Exercise: how to teach computer to detect spams?
An Example Spam Email on Google Lottery
Let me design & program some “rules” in computer!
Rule 1: Email with “lottery” is a spam.
What about this warning email?
Rule 2: Email containing “million” is a spam.
What about this UW email?
ML Solution: learn detection rules from example emails.
spam
normal
QuizQ1: what are the applications of anomaly detection?
Q2: why do we need computers to help detect anomalies?
Q3: what’s wrong with handcrafted detection rules?
QuizQ1: what are the applications of anomaly detection?
A1: surveillance, cyber-security, fraud transaction, health-care, education, etc
Q2: why do we need computers to help detect anomalies?
A2: massive amount of data makes manual detection inefficient (or, impossible)
Q3: what’s wrong with handcrafted detection rules?
A3: hard to design (need domain knowledge) and generalize
OutlineBackground
Learning-based Detection Approaches - preliminary: data representation and visualization- six common anomaly detection approaches
Evaluation Metrics
Challenges
Preliminary 1: Data Representation An example email is often represented by a vector (feature vector).
x =
google lotterycatemailtransportpandamillion ..
=
1101001..
Above example vector is called “bag-of-words” feature representation of a document.
Concepts: Feature, Label, InstanceEach element in the vector is a feature/attribute.
x =
google lotterycatemailtransportpandamillion ..
=
1101001..
The target variable we want to detect is label. (different tasks have different labels)
spam
normal
Concepts: Feature, Label, Instance
In summary, an example email (or, an instance) is a pair of feature vector & label.
x1 =
1101001..
, spam x2 =
1011010..
, ham
This is a most common representation of an example. There are, of course, more complicated representations.
Concepts: Feature, Label, Instance
Other Examples of Feature Vector Representation Image data represented as a vector.
.
.
.
Other Examples of Feature Vector Representation Student data represented as a vector.
# Steal
# Lie/Cheat
# Behavior Pro
# Peer Rej
.
.
.
=
0
1
2
1
.
.
.
We will repeatedly see example & label notations.
Preliminary 2: Data Visualization An example is a vector in a high dimensional space (feature space).
For easier interpretation, we often visualize examples in a 2D space.
x =
google lotterycatemailtransportpandamillion ..
=
1101001..
feature 1
feature 2
Two Common Strategies to Get 2D Space 1. Select two features from the pool (feature selection)
2. Project all features onto two new features (feature transformation)
x = =
feature 1
feature 2
google lotterycatemailtransportpandamillion ..
1101001..
1101
Feature Projection We can project all features on to a new feature using a projective vector w.
Projection on to the new feature is obtained by inner product between w and x.
wT * x = 0.3, -1.2, 0.8, 0.23 * = 0.3 - 1.2 + 0 + 0.23 = -0.67 new feature
Feature Projection Two get two new features, we need to projective vectors w1 and w2.
feature 1 = w1T * x
feature 2 = w2T * x
Get Projective Vectors using PCA Principal Component Analysis (PCA) is commonly used to get projective vectors.
https://qiita.com/bmj0114/items/db9145a707cb6ed13201
w2 w1
We will repeatedly see data distribution in 2D feature space (by PCA).
QuizRecap: to design a spam email detection model, we can design label as
- y = 1 for spam, y = 0 for ham
Q1: to design a fraud transaction detection model, how to design label?
Q2: to design an at-risk student detection model, how to design label?
QuizRecap: to design a spam email detection model, we can design label as
- y = 1 for spam, y = 0 for ham
Q1: to design a fraud transaction detection model, how to design label?
A1: y = 1 for fraud, y = 0 for normal transaction
Q2: to design an at-risk student detection model, how to design label?
A2: y = 1 for at-risk student, y = 0 for normal student
OutlineBackground
Learning-based Detection Approaches - preliminary: data representation and visualization- six common anomaly detection approaches
Evaluation Metrics
Open Challenges
Learning-based Anomaly Detection Approaches Classification-based
Clustering-based
Support Vector Data Descriptor (SVDD)
Statistics-based
Neighborhood-based
Spectral-based
1. Classification-based Approach Learn a detection model to classify emails into spam and ham (i.e. normal email).
model f
spam
ham
model f
How to learn model f ?
spam
ham
Step 1. construct a model f with some unknown parameters.
Step 2. estimate the parameters from data
Example: learn a linear regression model Step 1. Construct a linear regression model
- x·1 and x·2 are two features of example x (e.g. words “google” and “cat”)
- w0, w1, w2 are unknown parameters (w0 is called bias)
Example: learn a linear regression model Step 2. Estimate w0, w1, w2 from examples x1, x2, x3, …, xn by solving
- xi is the ith example (e.g. the ith email)
- yi is the label of xi, and yi= 0 (ham) or 1 (spam)
Example: learn a linear regression model The solution is where
A new email x = [x.1,x.2] is first input to the model
The result is then thresholded (by a proper value such as 0.5)
Example: apply model to classify email
Many models can directly output 0 and 1, so we do not need to threshold their outputs.
Recall: detection rule is
y=1 for spam, y=0 for ham.
QuizIf model has
- w0=0.5, w1=−0.1, w2=0.1
Are the following emails spam or ham?
- x1 = [x·1,x·2]T = [1, 0]T
- x2 = [x·1,x·2]T = [0, 1]T
- x3 = [x·1,x·2]T = [1, 1]T
- x4 = [x·1,x·2]T = [0, 0]T
QuizIf model has
- w0=0.5, w1=−0.1, w2=0.1
Are the following emails spam or ham?
- x1 = [x·1,x·2]T = [1, 0]T is ham, because f(x) = 0.5 - 0.1*1 + 0.1*0 = 0.4 < 0.5
- x2 = [x·1,x·2]T = [0, 1]T
- x3 = [x·1,x·2]T = [1, 1]T
- x4 = [x·1,x·2]T = [0, 0]T
Recall: detection rule is
y=1 for spam, y=0 for ham.
QuizIf model has
- w0=0.5, w1=−0.1, w2=0.1
Are the following emails spam or ham?
- x1 = [x·1,x·2]T = [1, 0]T is ham, because f(x) = 0.5 - 0.1*1 + 0.1*0 = 0.4 < 0.5
- x2 = [x·1,x·2]T = [0, 1]T is spam, because f(x) = 0.5 - 0.1*0 + 0.1*1 = 0.6 > 0.5
- x3 = [x·1,x·2]T = [1, 1]T
- x4 = [x·1,x·2]T = [0, 0]T
Recall: detection rule is
y=1 for spam, y=0 for ham.
QuizIf model has
- w0=0.5, w1=−0.1, w2=0.1
Are the following emails spam or ham?
- x1 = [x·1,x·2]T = [1, 0]T is ham, because f(x) = 0.5 - 0.1*1 + 0.1*0 = 0.4 < 0.5
- x2 = [x·1,x·2]T = [0, 1]T is spam, because f(x) = 0.5 - 0.1*0 + 0.1*1 = 0.6 > 0.5
- x3 = [x·1,x·2]T = [1, 1]T is ham, because f(x) = 0.5 - 0.1*0 + 0.1*0 = 0.5 ≤ 0.5
- x4 = [x·1,x·2]T = [0, 0]T
Recall: detection rule is
y=1 for spam, y=0 for ham.
QuizIf model has
- w0=0.5, w1=−0.1, w2=0.1
Are the following emails spam or ham?
- x1 = [x·1,x·2]T = [1, 0]T is ham, because f(x) = 0.5 - 0.1*1 + 0.1*0 = 0.4 < 0.5
- x2 = [x·1,x·2]T = [0, 1]T is spam, because f(x) = 0.5 - 0.1*0 + 0.1*1 = 0.6 > 0.5
- x3 = [x·1,x·2]T = [1, 1]T is ham, because f(x) = 0.5 - 0.1*0 + 0.1*0 = 0.5 ≤ 0.5
- x4 = [x·1,x·2]T = [0, 0]T is ham, because f(x) = 0.5 - 0.1*1 + 0.1*1 = 0.5 ≤ 0.5
Recall: detection rule is
y=1 for spam, y=0 for ham.
Learning-based Anomaly Detection Approaches Classification-based
Clustering-based
Support Vector Data Descriptor (SVDD)
Statistics-based
Neighborhood-based
Spectral-based
2. Clustering-based Approach Group examples into clusters. Assume those far from their cluster centers are
more likely to be anomalie.
Detection based on Anomalous Score Algorithm output anomalous score of an example, which indicates how likely the example is an anomaly. We can then threshold the scores to get final detection.
a.s. = 0.8
a.s. = 0.4
a.s. = 0.1
How to cluster examples? K-means is a most common clustering algorithm among others.
- choose a number of clusters k (e.g. k=3)
- initialize k cluster centers (randomly)
- repeat until convergence
- assign every example to its nearest cluster (nearest to cluster center)
- update cluster center to means of its member examples
A Demo of K-means Clustering Algorithm
Quiz (apply K-means Clustering with k=2) Which example has the highest anomalous score? Which has the lowest?
x1
x2
x3
Quiz (apply K-means Clustering with k=2) A: first, the k-means clustering result is roughly as follows
x1
x2
x3
Quiz (apply K-means Clustering with k=2) A: based on clustering result, x1 is the farthest from its center so it has the highest anomalous score. And x3 is the closest to its center so it has the lowest score.
x1
x2
x3
Learning-based Anomaly Detection Approaches Classification-based
Clustering-based
Support Vector Data Descriptor (SVDD)
Statistics-based
Neighborhood-based
Spectral-based
3. Support Vector Data Descriptor (SVDD)Learn a (smallest) normal region that encompasses all normal examples. Assume whatever falls outside the region is anomaly.
minimize
s.t.
Mathematical Model of One-Class SVMFirst, assume a sphere with radius R encompasses all normal examples.
- distance from normal example to normal region center is less than R
min
Mathematical Model of One-Class SVMThen, find the center and smallest radius of such a sphere.
- find sphere center and minimize sphere radius
s.t.
QuizIf is normal example. Which examples will be detected as anomalies by SVDD?
AB
C
QuizA: a normal region roughly looks like below. B & C are outside so are anomalies.
AB
C
Learning-based Anomaly Detection Approaches Classification-based
Clustering-based
Support Vector Data Descriptor (SVDD)
Statistics-based
Neighborhood-based
Spectral-based
4. Statistics-based Approach Estimate a distribution over examples. Assume those drawn from the distribution with lower probability are more likely to be anomalies.
Exercise
Student # Attendance
John 3
Nancy 2
Sam 2
Richard 1
Lily 3
p(x=3) =
p(x=2) =
p(x=1) =
What are the probabilities a student attend class for 1, 2, 3 times?
Exercise
Student # Attendance
John 3
Nancy 2
Sam 2
Richard 1
Lily 3
p(x=3) = 2 / 5 = 0.4
p(x=2) =
p(x=1) =
We can estimate probabilities by counting frequencies.
Exercise
Student # Attendance
John 3
Nancy 2
Sam 2
Richard 1
Lily 3
p(x=3) = 2 / 5 = 0.4
p(x=2) = 2 / 5 = 0.4
p(x=1) =
We can estimate probabilities by counting frequencies.
Exercise
Student # Attendance
John 3
Nancy 2
Sam 2
Richard 1
Lily 3
p(x=3) = 2 / 5 = 0.4
p(x=2) = 2 / 5 = 0.4
p(x=1) = 1 / 5 = 0.2
We can estimate probabilities by counting frequencies.
Exercise
Student # Attendance
John 3
Nancy 2
Sam 2
Richard 1
Lily 3
p(x=3) = 2 / 5 = 0.4
p(x=2) = 2 / 5 = 0.4
p(x=1) = 1 / 5 = 0.2
Richard is more likely to be an abnormal (at-risk) student because he attends class 1 time, and p(x=1)=0.1 is way smaller than the other probabilities.
Quiz Which student is most likely at-risk according to statistics-based approach?
- let x be # peer rejection
Student John Lily Sam Nancy Green Susan Peter Rose Jack Lucy
x 0 1 0 2 1 0 3 1 2 0
Quiz Which student is most likely at-risk according to statistics-based approach?
- let x be # peer rejection
Student John Lily Sam Nancy Green Susan Peter Rose Jack Lucy
x 0 1 0 2 1 0 3 1 2 0
p(x=0) = 4/10 = 0.4
p(x=1) = 3/10 = 0.3
p(x=2) = 2/10 = 0.2
p(x=3) = 1/10 = 0.1, lowest probability, Peter has x=1 so he is most likely at-risk
Learning-based Anomaly Detection Approaches Classification-based
Clustering-based
Support Vector Data Descriptor (SVDD)
Statistics-based
Neighborhood-based
Spectral-based
5. Neighborhood-based Approach Assume examples far from their neighbors are more likely to be anomalies.
Example: 2-nearest neighbor based approachOnly consider two nearest neighbors of examples.
A
B C
D1
1
1
2
2
Example: 2-nearest neighbor based approachTotal distance from A to its two nearest neighbors (B, C) are 1 + 1 = 2
A
B C
D1
1
1
2
2
Example Distance
A 2
B
C
D
Example: 2-nearest neighbor based approachTotal distance from B to its two nearest neighbors (A, C) are 1 + 1 = 2
A
B C
D1
1
1
2
2
Example Distance
A 2
B 2
C
D
Example: 2-nearest neighbor based approachTotal distance from C to its two nearest neighbors (A, B) are 1 + 1 = 2
A
B C
D1
1
1
2
2
Example Distance
A 2
B 2
C 2
D
Example: 2-nearest neighbor based approachTotal distance from D to its two nearest neighbors (A, C) are 2 + 2 = 4
A
B C
D1
1
1
2
2
Example Distance
A 2
B 2
C 2
D 4
Example: 2-nearest neighbor based approachD is more likely to be an anomaly because it has the largest distance to neighbors.
A
B C
D1
1
1
2
2
Example Distance
A 2
B 2
C 2
D 4
Quiz Which example is most likely an anomaly based on 2-nearest neighbor approach?
A
C D
B
1
0.5
1
1
1.5
Example Distance
A
B
C
D
Quiz A: B is most likely an anomaly.
A
C D
B
1
0.5
1
1
1.5
Example Distance
A 2
B 2.5
C 1.5
D 1.5
Learning-based Anomaly Detection Approaches Classification-based
Clustering-based
Support Vector Data Descriptor (SVDD)
Statistics-based
Neighborhood-based
Spectral-based
6. Spectral-based Approach Assume normal examples lie in a low dimensional feature space so can be well-reconstructed from that space. Anomalies are not.
3.20.2
Example Project original feature vector into 2D space and reconstruct it.
0.91.10.10.9
Projection can be done by taking inner product between the feature vector with a projective vector.
1101
0.1-0.1-0.10.1
Example Reconstruction error can be used as an anomalous score.
1101
0.91.10.10.9
- = error = 0.12 + (-0.1)2 + (-0.1)2 + 0.12 = 0.04
Find Low-Dimensional Space using PCA Principal Component Analysis (PCA) is commonly used to get projective vectors.
https://qiita.com/bmj0114/items/db9145a707cb6ed13201
w2 w1
Example Result of PCA-based Approach Abnormal network traffic flows have higher reconstruction errors.
OutlineBackground
Learning-based Detection Approaches
Evaluation Metrics - detection error - f1-score and AUC score
Challenges
Detection Error Detection error of a model is the fraction of its mis-detected examples
- e.g. mis-detect a normal example as anomaly
- e.g. mis-detect an anomaly as normal
Example: if there are 100 testing examples, and 10 of them are mis-detected, the detection error is 10/100 = 0.1.
10 spam emails
990 ham emails
10 spam emails
990 ham emails
What is the detection error of this model?
normalspam
Confusion Matrix
True Positive (TP) False Positive (FP)
False Negative (FN) True Negative (TN)
actual positive (spam) actual negative (ham)
predicted negative
predicted positive
Precision: how many predicted positive are truly positive
Recall: how many actual positive data are predicted positive
F1-Score: harmonic mean of precision and recall
Precision, Recall, F1-Score
TP = ? FP = ?
FN = ? TN = ?
actual pos (spam)
actual neg (ham)
predicted neg (ham)
predicted pos (spam)
10 spam emails
990 ham emails
Exercise What is the confusion matrix of the detection model?
normal spam
TP = 0 FP = 0
FN = 10 TN = 990
actual pos (spam)
actual neg (ham)
predicted neg (ham)
predicted pos (spam)
10 spam emails
990 ham emails
Exercise What is the confusion matrix of the detection model?
normal spam
TP = 0 FP = 0
FN = 10 TN = 990
actual pos (spam)
actual neg (ham)
predicted neg (ham)
predicted pos (spam)
Exercise What are the precision, recall and f1-score?
Precision =
Recall =
F1-Score =
TP = 0 FP = 0
FN = 10 TN = 990
actual pos (spam)
actual neg (ham)
predicted neg (ham)
predicted pos (spam)
Exercise What are the precision, recall and f1-score?
Precision = = 0 / 0
Recall = = 0/ (0+10)
F1-Score = = ?
Detection by Thresholding Anomalous Score Many anomaly detection models output anomalous scores, and detection results are obtained by thresholding these scores.
Example A. Score Threshold 0.5Detection Result
1 = anomaly 0 = normal
A 0.8 0.8 > 1 1
B 0.3 0.3 < 0.5 0
C 0.6 0.6 > 0.5 1
D 0.2 0.2 < 0.5 0
TP FP
FN TNF1 Score
Exercise What are detection results based on the following thresholds?
Example A. Score Detection Result (Threshold 0.5)
Detection Result (Threshold 0.7)
Detection Result (Threshold 0.25)
A 0.8 1
B 0.3 0
C 0.6 1
D 0.2 0
ExerciseDifferent thresholds can give different detection results, thus different TP & FP.
Example A. Score Detection Result (Threshold 0.5)
Detection Result (Threshold 0.7)
Detection Result (Threshold 0.25)
A 0.8 1 1 1
B 0.3 0 0 1
C 0.6 1 0 1
D 0.2 0 0 0
ROC CurveROC curve of a model is its performance under different thresholds.
Each point is result of one threshold.
Area Under Curve (AUC) ScoreAUC score is the area under ROC curve. Good model has higher AUC score.
SummaryThere are many metrics to evaluate detection performance of a model.
Detection error is most common but has many flaws.
Confusion matrix gives four numbers but hard to compare.
F1-score is a more robust measure but based on a single threshold.
AUC score is a most robust measure that integrates results over many thresholds.
OutlineBackground
Learning-based Detection Approaches
Evaluation Metrics
Challenges
Challenges in Anomaly Detection Contextual Anomaly Detection
Collective Anomaly Detection
Other Technical Challenges
Contextual Anomaly
Collective Anomaly
Collective Anomaly
Exercise: Any Anomaly?A customer is shopping on Amazon
- object 1: steel ball bearings
Exercise: Any Anomaly? A customer is shopping on Amazon
- object 1: steel ball bearings
- object 2: black powder/charcoal
Exercise: Any Anomaly? A customer is shopping on Amazon
- object 1: steel ball bearings
- object 2: black powder/charcoal
- object 3: battery connectors
Exercise: Any Anomaly? A customer is shopping on Amazon
- object 1: steel ball bearings
- object 2: black powder/charcoal
- object 3: battery connectors
- …
A customer who bought above items together could be a bomb-maker!
Other Technical Challenges
Hard to find a normal region.
Attackers may disguise anomalies.
Normal behavior may evolve over time.
Notion of anomaly is problem-dependent.
Not enough labeled data (especially, anomalous data).
Q & A?