Introduction to Anomaly Detection - uwyo.educlan/teach/rampe18_anomaly.pdf · - preliminary: data...

Post on 11-Jul-2020

2 views 0 download

transcript

Introduction to Anomaly Detection

Chao Lan

Presented at the summer camp of RAMPE II: Cybersecurity and Internet of Things, University of Wyoming, 2018.

OutlineBackground

Learning-based Detection Approaches

Evaluation Metrics

Challenges

OutlineBackground

- what is anomaly detection and what are their applications?- why do we need computer to help anomaly detection?- why do we want machine learning to help design detection rule?

Learning-based Detection Approaches

Evaluation Metrics

Challenges

What is anomaly detection?

“Anomaly detection refers to the problem of finding patterns

in data that do not conform to expected behavior.”

Chandola et al. Anomaly detection: A survey. ACM Computing Surveys, 2009.

http://www.svcl.ucsd.edu/projects/anomaly/

Network Anomaly Detection – Do We Know What to Detect? 2013

Fraud Prevention with Neo4j: A 5-Minute Overview, 2017

Fujitsu Develops Traffic-Video-Analysis Technology Based on Image Recognition and Machine Learning, 2016.

Early detection of at-risk students using machine learning based on LMS log data. 2017.

Exercise: how to teach computer to detect spams?

An Example Spam Email on Google Lottery

Let me design & program some “rules” in computer!

Rule 1: Email with “lottery” is a spam.

What about this warning email?

Rule 2: Email containing “million” is a spam.

What about this UW email?

ML Solution: learn detection rules from example emails.

spam

normal

QuizQ1: what are the applications of anomaly detection?

Q2: why do we need computers to help detect anomalies?

Q3: what’s wrong with handcrafted detection rules?

QuizQ1: what are the applications of anomaly detection?

A1: surveillance, cyber-security, fraud transaction, health-care, education, etc

Q2: why do we need computers to help detect anomalies?

A2: massive amount of data makes manual detection inefficient (or, impossible)

Q3: what’s wrong with handcrafted detection rules?

A3: hard to design (need domain knowledge) and generalize

OutlineBackground

Learning-based Detection Approaches - preliminary: data representation and visualization- six common anomaly detection approaches

Evaluation Metrics

Challenges

Preliminary 1: Data Representation An example email is often represented by a vector (feature vector).

x =

google lotterycatemailtransportpandamillion ..

=

1101001..

Above example vector is called “bag-of-words” feature representation of a document.

Concepts: Feature, Label, InstanceEach element in the vector is a feature/attribute.

x =

google lotterycatemailtransportpandamillion ..

=

1101001..

The target variable we want to detect is label. (different tasks have different labels)

spam

normal

Concepts: Feature, Label, Instance

In summary, an example email (or, an instance) is a pair of feature vector & label.

x1 =

1101001..

, spam x2 =

1011010..

, ham

This is a most common representation of an example. There are, of course, more complicated representations.

Concepts: Feature, Label, Instance

Other Examples of Feature Vector Representation Image data represented as a vector.

.

.

.

Other Examples of Feature Vector Representation Student data represented as a vector.

# Steal

# Lie/Cheat

# Behavior Pro

# Peer Rej

.

.

.

=

0

1

2

1

.

.

.

We will repeatedly see example & label notations.

Preliminary 2: Data Visualization An example is a vector in a high dimensional space (feature space).

For easier interpretation, we often visualize examples in a 2D space.

x =

google lotterycatemailtransportpandamillion ..

=

1101001..

feature 1

feature 2

Two Common Strategies to Get 2D Space 1. Select two features from the pool (feature selection)

2. Project all features onto two new features (feature transformation)

x = =

feature 1

feature 2

google lotterycatemailtransportpandamillion ..

1101001..

1101

Feature Projection We can project all features on to a new feature using a projective vector w.

Projection on to the new feature is obtained by inner product between w and x.

wT * x = 0.3, -1.2, 0.8, 0.23 * = 0.3 - 1.2 + 0 + 0.23 = -0.67 new feature

Feature Projection Two get two new features, we need to projective vectors w1 and w2.

feature 1 = w1T * x

feature 2 = w2T * x

Get Projective Vectors using PCA Principal Component Analysis (PCA) is commonly used to get projective vectors.

https://qiita.com/bmj0114/items/db9145a707cb6ed13201

w2 w1

We will repeatedly see data distribution in 2D feature space (by PCA).

QuizRecap: to design a spam email detection model, we can design label as

- y = 1 for spam, y = 0 for ham

Q1: to design a fraud transaction detection model, how to design label?

Q2: to design an at-risk student detection model, how to design label?

QuizRecap: to design a spam email detection model, we can design label as

- y = 1 for spam, y = 0 for ham

Q1: to design a fraud transaction detection model, how to design label?

A1: y = 1 for fraud, y = 0 for normal transaction

Q2: to design an at-risk student detection model, how to design label?

A2: y = 1 for at-risk student, y = 0 for normal student

OutlineBackground

Learning-based Detection Approaches - preliminary: data representation and visualization- six common anomaly detection approaches

Evaluation Metrics

Open Challenges

Learning-based Anomaly Detection Approaches Classification-based

Clustering-based

Support Vector Data Descriptor (SVDD)

Statistics-based

Neighborhood-based

Spectral-based

1. Classification-based Approach Learn a detection model to classify emails into spam and ham (i.e. normal email).

model f

spam

ham

email

model f

How to learn model f ?

spam

ham

email

Step 1. construct a model f with some unknown parameters.

Step 2. estimate the parameters from data

Example: learn a linear regression model Step 1. Construct a linear regression model

- x·1 and x·2 are two features of example x (e.g. words “google” and “cat”)

- w0, w1, w2 are unknown parameters (w0 is called bias)

Example: learn a linear regression model Step 2. Estimate w0, w1, w2 from examples x1, x2, x3, …, xn by solving

- xi is the ith example (e.g. the ith email)

- yi is the label of xi, and yi= 0 (ham) or 1 (spam)

Example: learn a linear regression model The solution is where

A new email x = [x.1,x.2] is first input to the model

The result is then thresholded (by a proper value such as 0.5)

Example: apply model to classify email

Many models can directly output 0 and 1, so we do not need to threshold their outputs.

Recall: detection rule is

y=1 for spam, y=0 for ham.

QuizIf model has

- w0=0.5, w1=−0.1, w2=0.1

Are the following emails spam or ham?

- x1 = [x·1,x·2]T = [1, 0]T

- x2 = [x·1,x·2]T = [0, 1]T

- x3 = [x·1,x·2]T = [1, 1]T

- x4 = [x·1,x·2]T = [0, 0]T

QuizIf model has

- w0=0.5, w1=−0.1, w2=0.1

Are the following emails spam or ham?

- x1 = [x·1,x·2]T = [1, 0]T is ham, because f(x) = 0.5 - 0.1*1 + 0.1*0 = 0.4 < 0.5

- x2 = [x·1,x·2]T = [0, 1]T

- x3 = [x·1,x·2]T = [1, 1]T

- x4 = [x·1,x·2]T = [0, 0]T

Recall: detection rule is

y=1 for spam, y=0 for ham.

QuizIf model has

- w0=0.5, w1=−0.1, w2=0.1

Are the following emails spam or ham?

- x1 = [x·1,x·2]T = [1, 0]T is ham, because f(x) = 0.5 - 0.1*1 + 0.1*0 = 0.4 < 0.5

- x2 = [x·1,x·2]T = [0, 1]T is spam, because f(x) = 0.5 - 0.1*0 + 0.1*1 = 0.6 > 0.5

- x3 = [x·1,x·2]T = [1, 1]T

- x4 = [x·1,x·2]T = [0, 0]T

Recall: detection rule is

y=1 for spam, y=0 for ham.

QuizIf model has

- w0=0.5, w1=−0.1, w2=0.1

Are the following emails spam or ham?

- x1 = [x·1,x·2]T = [1, 0]T is ham, because f(x) = 0.5 - 0.1*1 + 0.1*0 = 0.4 < 0.5

- x2 = [x·1,x·2]T = [0, 1]T is spam, because f(x) = 0.5 - 0.1*0 + 0.1*1 = 0.6 > 0.5

- x3 = [x·1,x·2]T = [1, 1]T is ham, because f(x) = 0.5 - 0.1*0 + 0.1*0 = 0.5 ≤ 0.5

- x4 = [x·1,x·2]T = [0, 0]T

Recall: detection rule is

y=1 for spam, y=0 for ham.

QuizIf model has

- w0=0.5, w1=−0.1, w2=0.1

Are the following emails spam or ham?

- x1 = [x·1,x·2]T = [1, 0]T is ham, because f(x) = 0.5 - 0.1*1 + 0.1*0 = 0.4 < 0.5

- x2 = [x·1,x·2]T = [0, 1]T is spam, because f(x) = 0.5 - 0.1*0 + 0.1*1 = 0.6 > 0.5

- x3 = [x·1,x·2]T = [1, 1]T is ham, because f(x) = 0.5 - 0.1*0 + 0.1*0 = 0.5 ≤ 0.5

- x4 = [x·1,x·2]T = [0, 0]T is ham, because f(x) = 0.5 - 0.1*1 + 0.1*1 = 0.5 ≤ 0.5

Recall: detection rule is

y=1 for spam, y=0 for ham.

Learning-based Anomaly Detection Approaches Classification-based

Clustering-based

Support Vector Data Descriptor (SVDD)

Statistics-based

Neighborhood-based

Spectral-based

2. Clustering-based Approach Group examples into clusters. Assume those far from their cluster centers are

more likely to be anomalie.

Detection based on Anomalous Score Algorithm output anomalous score of an example, which indicates how likely the example is an anomaly. We can then threshold the scores to get final detection.

a.s. = 0.8

a.s. = 0.4

a.s. = 0.1

How to cluster examples? K-means is a most common clustering algorithm among others.

- choose a number of clusters k (e.g. k=3)

- initialize k cluster centers (randomly)

- repeat until convergence

- assign every example to its nearest cluster (nearest to cluster center)

- update cluster center to means of its member examples

A Demo of K-means Clustering Algorithm

Quiz (apply K-means Clustering with k=2) Which example has the highest anomalous score? Which has the lowest?

x1

x2

x3

Quiz (apply K-means Clustering with k=2) A: first, the k-means clustering result is roughly as follows

x1

x2

x3

Quiz (apply K-means Clustering with k=2) A: based on clustering result, x1 is the farthest from its center so it has the highest anomalous score. And x3 is the closest to its center so it has the lowest score.

x1

x2

x3

Learning-based Anomaly Detection Approaches Classification-based

Clustering-based

Support Vector Data Descriptor (SVDD)

Statistics-based

Neighborhood-based

Spectral-based

3. Support Vector Data Descriptor (SVDD)Learn a (smallest) normal region that encompasses all normal examples. Assume whatever falls outside the region is anomaly.

minimize

s.t.

Mathematical Model of One-Class SVMFirst, assume a sphere with radius R encompasses all normal examples.

- distance from normal example to normal region center is less than R

min

Mathematical Model of One-Class SVMThen, find the center and smallest radius of such a sphere.

- find sphere center and minimize sphere radius

s.t.

QuizIf is normal example. Which examples will be detected as anomalies by SVDD?

AB

C

QuizA: a normal region roughly looks like below. B & C are outside so are anomalies.

AB

C

Learning-based Anomaly Detection Approaches Classification-based

Clustering-based

Support Vector Data Descriptor (SVDD)

Statistics-based

Neighborhood-based

Spectral-based

4. Statistics-based Approach Estimate a distribution over examples. Assume those drawn from the distribution with lower probability are more likely to be anomalies.

Exercise

Student # Attendance

John 3

Nancy 2

Sam 2

Richard 1

Lily 3

p(x=3) =

p(x=2) =

p(x=1) =

What are the probabilities a student attend class for 1, 2, 3 times?

Exercise

Student # Attendance

John 3

Nancy 2

Sam 2

Richard 1

Lily 3

p(x=3) = 2 / 5 = 0.4

p(x=2) =

p(x=1) =

We can estimate probabilities by counting frequencies.

Exercise

Student # Attendance

John 3

Nancy 2

Sam 2

Richard 1

Lily 3

p(x=3) = 2 / 5 = 0.4

p(x=2) = 2 / 5 = 0.4

p(x=1) =

We can estimate probabilities by counting frequencies.

Exercise

Student # Attendance

John 3

Nancy 2

Sam 2

Richard 1

Lily 3

p(x=3) = 2 / 5 = 0.4

p(x=2) = 2 / 5 = 0.4

p(x=1) = 1 / 5 = 0.2

We can estimate probabilities by counting frequencies.

Exercise

Student # Attendance

John 3

Nancy 2

Sam 2

Richard 1

Lily 3

p(x=3) = 2 / 5 = 0.4

p(x=2) = 2 / 5 = 0.4

p(x=1) = 1 / 5 = 0.2

Richard is more likely to be an abnormal (at-risk) student because he attends class 1 time, and p(x=1)=0.1 is way smaller than the other probabilities.

Quiz Which student is most likely at-risk according to statistics-based approach?

- let x be # peer rejection

Student John Lily Sam Nancy Green Susan Peter Rose Jack Lucy

x 0 1 0 2 1 0 3 1 2 0

Quiz Which student is most likely at-risk according to statistics-based approach?

- let x be # peer rejection

Student John Lily Sam Nancy Green Susan Peter Rose Jack Lucy

x 0 1 0 2 1 0 3 1 2 0

p(x=0) = 4/10 = 0.4

p(x=1) = 3/10 = 0.3

p(x=2) = 2/10 = 0.2

p(x=3) = 1/10 = 0.1, lowest probability, Peter has x=1 so he is most likely at-risk

Learning-based Anomaly Detection Approaches Classification-based

Clustering-based

Support Vector Data Descriptor (SVDD)

Statistics-based

Neighborhood-based

Spectral-based

5. Neighborhood-based Approach Assume examples far from their neighbors are more likely to be anomalies.

Example: 2-nearest neighbor based approachOnly consider two nearest neighbors of examples.

A

B C

D1

1

1

2

2

Example: 2-nearest neighbor based approachTotal distance from A to its two nearest neighbors (B, C) are 1 + 1 = 2

A

B C

D1

1

1

2

2

Example Distance

A 2

B

C

D

Example: 2-nearest neighbor based approachTotal distance from B to its two nearest neighbors (A, C) are 1 + 1 = 2

A

B C

D1

1

1

2

2

Example Distance

A 2

B 2

C

D

Example: 2-nearest neighbor based approachTotal distance from C to its two nearest neighbors (A, B) are 1 + 1 = 2

A

B C

D1

1

1

2

2

Example Distance

A 2

B 2

C 2

D

Example: 2-nearest neighbor based approachTotal distance from D to its two nearest neighbors (A, C) are 2 + 2 = 4

A

B C

D1

1

1

2

2

Example Distance

A 2

B 2

C 2

D 4

Example: 2-nearest neighbor based approachD is more likely to be an anomaly because it has the largest distance to neighbors.

A

B C

D1

1

1

2

2

Example Distance

A 2

B 2

C 2

D 4

Quiz Which example is most likely an anomaly based on 2-nearest neighbor approach?

A

C D

B

1

0.5

1

1

1.5

Example Distance

A

B

C

D

Quiz A: B is most likely an anomaly.

A

C D

B

1

0.5

1

1

1.5

Example Distance

A 2

B 2.5

C 1.5

D 1.5

Learning-based Anomaly Detection Approaches Classification-based

Clustering-based

Support Vector Data Descriptor (SVDD)

Statistics-based

Neighborhood-based

Spectral-based

6. Spectral-based Approach Assume normal examples lie in a low dimensional feature space so can be well-reconstructed from that space. Anomalies are not.

3.20.2

Example Project original feature vector into 2D space and reconstruct it.

0.91.10.10.9

Projection can be done by taking inner product between the feature vector with a projective vector.

1101

0.1-0.1-0.10.1

Example Reconstruction error can be used as an anomalous score.

1101

0.91.10.10.9

- = error = 0.12 + (-0.1)2 + (-0.1)2 + 0.12 = 0.04

Find Low-Dimensional Space using PCA Principal Component Analysis (PCA) is commonly used to get projective vectors.

https://qiita.com/bmj0114/items/db9145a707cb6ed13201

w2 w1

Example Result of PCA-based Approach Abnormal network traffic flows have higher reconstruction errors.

OutlineBackground

Learning-based Detection Approaches

Evaluation Metrics - detection error - f1-score and AUC score

Challenges

Detection Error Detection error of a model is the fraction of its mis-detected examples

- e.g. mis-detect a normal example as anomaly

- e.g. mis-detect an anomaly as normal

Example: if there are 100 testing examples, and 10 of them are mis-detected, the detection error is 10/100 = 0.1.

10 spam emails

990 ham emails

10 spam emails

990 ham emails

What is the detection error of this model?

normalspam

Confusion Matrix

True Positive (TP) False Positive (FP)

False Negative (FN) True Negative (TN)

actual positive (spam) actual negative (ham)

predicted negative

predicted positive

Precision: how many predicted positive are truly positive

Recall: how many actual positive data are predicted positive

F1-Score: harmonic mean of precision and recall

Precision, Recall, F1-Score

TP = ? FP = ?

FN = ? TN = ?

actual pos (spam)

actual neg (ham)

predicted neg (ham)

predicted pos (spam)

10 spam emails

990 ham emails

Exercise What is the confusion matrix of the detection model?

normal spam

TP = 0 FP = 0

FN = 10 TN = 990

actual pos (spam)

actual neg (ham)

predicted neg (ham)

predicted pos (spam)

10 spam emails

990 ham emails

Exercise What is the confusion matrix of the detection model?

normal spam

TP = 0 FP = 0

FN = 10 TN = 990

actual pos (spam)

actual neg (ham)

predicted neg (ham)

predicted pos (spam)

Exercise What are the precision, recall and f1-score?

Precision =

Recall =

F1-Score =

TP = 0 FP = 0

FN = 10 TN = 990

actual pos (spam)

actual neg (ham)

predicted neg (ham)

predicted pos (spam)

Exercise What are the precision, recall and f1-score?

Precision = = 0 / 0

Recall = = 0/ (0+10)

F1-Score = = ?

Detection by Thresholding Anomalous Score Many anomaly detection models output anomalous scores, and detection results are obtained by thresholding these scores.

Example A. Score Threshold 0.5Detection Result

1 = anomaly 0 = normal

A 0.8 0.8 > 1 1

B 0.3 0.3 < 0.5 0

C 0.6 0.6 > 0.5 1

D 0.2 0.2 < 0.5 0

TP FP

FN TNF1 Score

Exercise What are detection results based on the following thresholds?

Example A. Score Detection Result (Threshold 0.5)

Detection Result (Threshold 0.7)

Detection Result (Threshold 0.25)

A 0.8 1

B 0.3 0

C 0.6 1

D 0.2 0

ExerciseDifferent thresholds can give different detection results, thus different TP & FP.

Example A. Score Detection Result (Threshold 0.5)

Detection Result (Threshold 0.7)

Detection Result (Threshold 0.25)

A 0.8 1 1 1

B 0.3 0 0 1

C 0.6 1 0 1

D 0.2 0 0 0

ROC CurveROC curve of a model is its performance under different thresholds.

Each point is result of one threshold.

Area Under Curve (AUC) ScoreAUC score is the area under ROC curve. Good model has higher AUC score.

SummaryThere are many metrics to evaluate detection performance of a model.

Detection error is most common but has many flaws.

Confusion matrix gives four numbers but hard to compare.

F1-score is a more robust measure but based on a single threshold.

AUC score is a most robust measure that integrates results over many thresholds.

OutlineBackground

Learning-based Detection Approaches

Evaluation Metrics

Challenges

Challenges in Anomaly Detection Contextual Anomaly Detection

Collective Anomaly Detection

Other Technical Challenges

Contextual Anomaly

Collective Anomaly

Collective Anomaly

Exercise: Any Anomaly?A customer is shopping on Amazon

- object 1: steel ball bearings

Exercise: Any Anomaly? A customer is shopping on Amazon

- object 1: steel ball bearings

- object 2: black powder/charcoal

Exercise: Any Anomaly? A customer is shopping on Amazon

- object 1: steel ball bearings

- object 2: black powder/charcoal

- object 3: battery connectors

Exercise: Any Anomaly? A customer is shopping on Amazon

- object 1: steel ball bearings

- object 2: black powder/charcoal

- object 3: battery connectors

- …

A customer who bought above items together could be a bomb-maker!

Other Technical Challenges

Hard to find a normal region.

Attackers may disguise anomalies.

Normal behavior may evolve over time.

Notion of anomaly is problem-dependent.

Not enough labeled data (especially, anomalous data).

Q & A?