+ All Categories
Home > Documents > 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham,...

12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham,...

Date post: 19-Jan-2016
Category:
Upload: trevor-sims
View: 213 times
Download: 0 times
Share this document with a friend
Popular Tags:
53
12/9/08, Sandia National Labs 1 Anomaly Detection Using Anomaly Detection Using Data Mining Techniques Data Mining Techniques Margaret H. Dunham, Yu Meng, Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Donya Quick, Jie Huang, Charlie Isaksson Isaksson CSE Department CSE Department Southern Methodist University Southern Methodist University Dallas, Texas 75275 Dallas, Texas 75275 [email protected] This material is based upon work supported by the National This material is based upon work supported by the National Science Foundation under Grant No. Science Foundation under Grant No. IIS-0208741 IIS-0208741
Transcript
Page 1: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

1

Anomaly Detection Using Data Anomaly Detection Using Data Mining TechniquesMining Techniques

Margaret H. Dunham, Yu Meng, Donya Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie IsakssonQuick, Jie Huang, Charlie Isaksson

CSE DepartmentCSE Department

Southern Methodist UniversitySouthern Methodist University

Dallas, Texas 75275Dallas, Texas 75275

[email protected]

This material is based upon work supported by the National Science This material is based upon work supported by the National Science Foundation under Grant No. Foundation under Grant No. IIS-0208741 IIS-0208741

Page 2: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

2

Objectives/Outline

Develop modeling techniques which can “learn/forget” past behavior of spatiotemporal stream events. Apply to prediction of anomalous events.

Introduction EMM Overview EMM Applications to Anomaly Detection EMM Applications to Nuclear Testing Future Work

Page 3: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

3

Page 4: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

4

Outline

Introduction Motivation What is an anomaly? Spatiotemporal Data Modeling Spatiotemporal Data

EMM Overview EMM Applications to Anomaly Detection Emm Applications to Nuclear Testing Future Work

Page 5: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

5

Motivation

A growing number of applications generate streams of data.

Computer network monitoring data Call detail records in telecommunications Highway transportation traffic data Online web purchase log records Sensor network data Stock exchange, transactions in retail chains, ATM operations in

banks, credit card transactions.

Data mining techniques play a key role in modeling and analyzing this data.

Page 6: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

6

What is Anomaly?

Event that is unusual Event that doesn’t occur frequently Predefined event What is unusual? What is deviation?

Page 7: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

7

What is Anomaly in Stream Data?

Rare - Anomalous – Surprising Out of the ordinary Not outlier detection

No knowledge of data distribution Data is not static Must take temporal and spatial values into account May be interested in sequence of events

Ex: Snow in upstate New York is not an anomaly Snow in upstate New York in June is rare

Rare events may change over time

Page 8: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

8

Statistical View of Anomaly

Outlier Data item that is outside the normal distribution of

the data Identify by Box Plot

Image from Image from Data Mining, Introductory and Advanced TopicsData Mining, Introductory and Advanced Topics, Prentice Hall, 2002., Prentice Hall, 2002.

Page 9: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

9

Statistical View of Anomaly

Image from Image from www.wikipedia.org, Normal distribution..

Identify by looking at distribution

THIS DOES NOT WORK with stream data

Page 10: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

10

Data Mining View of Anomaly Classification Problem

Build classifier from training data Problem is that training data shows what is NOT an

anomaly Thus an anomaly is anything that is not viewed as

normal by the classification technique MUST build dynamic classifier

Identify anomalous behavior Signatures of what anomalous behavior looks like Input data is identified as anomaly if it is similar

enough to one of these signatures Mixed – Classification and Signature

Page 11: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

11

Visualizing Anomalies Temporal Heat Map (THM) is a visualization technique for streaming

data derived from multiple sensors. Two dimensional structure similar to an infiniteinfinite table. Each row of the table is associated with one sensor value. Each column of the table is associated with a point in time. Each cell within the THM is a color representation of the sensor

value Colors normalized (in our examples)

0 – While 0.5 – Blue 1.0 - Red

Page 12: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

12

THM of VoIP Data

VoIP traffic data was provided by Cisco Systems and represents logged VoIP traffic in their Richardson, Texas facility from Mon Sep 22 12:17:32 2003 to Mon Nov 17 11:29:11 2003.

Page 13: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

13

Spatiotemporal Stream Data

Records may arrive at a rapid rateHigh volume (possibly infinite) of continuous dataConcept drifts: Data distribution changes on the flyData does not necessarily fit any distribution patternMultidimensionalTemporalSpatialData are collected in discrete time intervals,Data are in structured format, <a1, a2, …>Data hold an approximation of the Markov property.

Page 14: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

14

Spatiotemporal Environment

Events arriving in a stream At any time, t, we can view the state

of the problem as represented by a vector of n numeric values:

Vt = <S1t, S2t, ..., Snt>

V1 V2 … VqS1 S11 S12 … S1q

S2 S21 S22 … S2q

… … … … …Sn Sn1 Sn2 … SnqTime

Page 15: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

15

Data Stream Modeling

Single pass: Each record is examined at most once Bounded storage: Limited Memory for storing synopsis Real-time: Per record processing time must be low Summarization (Synopsis )of data Use data NOT SAMPLE Temporal and Spatial Dynamic Continuous (infinite stream) Learn Forget Sublinear growth rate - Clustering

15

Page 16: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

16

MM

A first order Markov Chain is a finite or countably infinite sequence of events {E1, E2, … } over discrete time points, where Pij = P(Ej | Ei), and at any time the future behavior of the process is based solely on the current state

A Markov Model (MM) is a graph with m vertices or states, S, and directed arcs, A, such that:

S ={N1,N2, …, Nm}, and A = {Lij | i 1, 2, …, m, j 1, 2, …, m} and Each arc,

Lij = <Ni,Nj> is labeled with a transition probability

Pij = P(Nj | Ni).

Page 17: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

17

Problem with Markov Chains

The required structure of the MC may not be certain at the model construction time.

As the real world being modeled by the MC changes, so should the structure of the MC.

Not scalable – grows linearly as number of events. Our solution:

Extensible Markov Model (EMM) Cluster real world events Allow Markov chain to grow and shrink dynamically

Page 18: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

18

Outline

Introduction EMM Overview EMM Applications to Anomaly

Detection EMM Applications to Nuclear Testing Future Work

Page 19: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

19

Extensible Markov Model (EMM)

Time Varying Discrete First Order Markov Model Nodes are clusters of real world states. Learning continues during application phase. Learning:

Transition probabilities between nodes Node labels (centroid/medoid of cluster) Nodes are added and removed as data arrives

Page 20: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

20

Related Work Splitting Nodes in HMMs

Create new states by splitting an existing state M.J. Black and Y. Yacoob,”Recognizing facial expressions in image sequences using local

parameterized models of image motion”, Int. Journal of Computer Vision, 25(1), 1997, 23-48. Dynamic Markov Modeling

States and transitions are cloned G. V. Cormack, R. N. S. Horspool. “Data compression using dynamic Markov Modeling,” The

Computer Journal, Vol. 30, No. 6, 1987.

Augmented Markov Model (AMM) Creates new states if the input data has never been seen in the model, and

transition probabilities are adjusted Dani Goldberg, Maja J Mataric. “Coordinating mobile robot group behavior using a model of

interaction dynamics,” Proceedings, the Third International Conference on Autonomous Agents (agents ’99), Seattle, Washington

Page 21: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

21

EMM vs AMM

Our proposed EMM model is similar to AMM, but is more flexible: EMM continues to learn during the application phase. State matching is determined using a clustering technique. EMM not only allows the creation of new nodes, but deletion

(or merging) of existing nodes. This allows the EMM model to “forget” old information which may not be relevant in the future. It also allows the EMM to adapt to any main memory constraints for large scale datasets.

EMM performs one scan of data and therefore is suitable for online data processing.

Page 22: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

22

EMM

Extensible Markov Model (EMM): at any time t, EMM consists of an MM and algorithms to modify it, where algorithms include:

EMMSim, which defines a technique for matching between input data at time t + 1 and existing states in the MM at time t.

EMMIncrement algorithm, which updates MM at time t + 1 given the MM at time t and classification measure result at time t + 1.

Additional algorithms may be added to modify the model or for applications.

Page 23: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

23

EMMSim

Find closest node to incoming event. If none “close” create new node Labeling of cluster is centroid/medoid of

members in cluster Problem

Nearest Neighbhor O(n) BIRCH O(lg n)

• Requires second phase to recluster initial

Page 24: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

24

EMMIncrement

<18,10,3,3,1,0,0><18,10,3,3,1,0,0>

<17,10,2,3,1,0,0><17,10,2,3,1,0,0>

<16,9,2,3,1,0,0><16,9,2,3,1,0,0>

<14,8,2,3,1,0,0><14,8,2,3,1,0,0>

<14,8,2,3,0,0,0><14,8,2,3,0,0,0>

<18,10,3,3,1,1,0.><18,10,3,3,1,1,0.>

1/3

N1

N2

2/3

N3

1/11/3

N1

N2

2/3

1/1

N3

1/1

1/2

1/3

N1

N2

2/31/2

1/2

N3

1/1

2/3

1/3

N1

N2

N1

2/21/1

N1

1

Page 25: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

25

EMMDecrement

N2

N1 N3

N5 N6

2/2

1/3

1/3

1/3

1/2

N1 N3

N5 N6

1/61/6

1/6

1/31/3

1/3Delete N2

Page 26: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

26

EMM Advantages

Dynamic Adaptable Use of clustering Learns rare event Scalable:

Growth of EMM is not linear on size of data. Hierarchical feature of EMM

Creation/evaluation quasi-real time Distributed / Hierarchical extensions

Page 27: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

27

Growth of EMM

0

100

200

300

400

500

600

700

800

1 80 159

238

317

396

475

554

633

712

791

870

949

1028

1107

1186

1265

1344

1423

1502

number of input data (total 1574)

num

ber o

f st

ate

in m

ode

l

threshold 0.994

threshold 0.995

threshold 0.996

threshold 0.997

threshold 0.998

Servent Data

Page 28: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

28

EMM Performance – Growth Rate

Data SimThreshold

0.99 0.992 0.994 0.996 0.998

Serwent

Jaccrd 156 190 268 389 667Dice 72 92 123 191 389

Cosine 11 14 19 31 61Ovrlap 2 2 3 3 4

Ouse

Jaccrd 56 66 81 105 162Dice 40 43 52 66 105

Cosine 6 8 10 13 24Ovrlap 1 1 1 1 1

Page 29: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

29

EMM Performance – Growth Rate

Minnesota Traffic Data

Page 30: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

30

Outline

Introduction EMM Overview EMM Applications to Anomaly

Detection EMM Applications to Nuclear Testing Future Work

Page 31: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

31

Datasets/Anomalies MnDot – Minnesota Department of Transportation

Automobile Accident

Ouse and Serwent – River flow data from England Flood Drought

KDD Cup’99http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

Intrusion

Cisco VoIP – VoIP traffic data obtained at Cisco Unusual Phone Call

Page 32: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

32

Rare Event Detection

Weekdays Weekend

Minnesota DOT Traffic Data

Detected unusual weekend traffic pattern

Page 33: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

33

Our Approach to Detect Anomalies

By learning what is normal, the model can predict what is not

Normal is based on likelihood of occurrence Use EMM to build model of behavior We view a rare event as:

Unusual event Transition between events states which does

not frequently occur. Base rare event detection on determining events

or transitions between events that do not frequently occur.

Continue learning

Page 34: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

34

EMMRare

EMMRare algorithm indicates if the current input event is rare. Using a threshold occurrence percentage, the input event is determined to be rare if either of the following occurs: The frequency of the node at time t+1 is

below this threshold The updated transition probability of the MC

transition from node at time t to the node at t+1 is below the threshold

Page 35: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

36

Determining Rare

Occurrence Frequency (OFc) of a node Nc :

OFc =

Normalized Transition Probability (NTPmn),

from one state, Nm, to another, Nn :

NTPmn =

c ii

CN CN

mn ii

CL CN

Page 36: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

37

EMMRareGiven:

• Rule#1: CNi <= thCN

• Rule#2: CLij <= thCL

• Rule#3: OFc <= thOF

• Rule#4: NTPmn <= thNTP

Input: Gt: EMM at time t

i: Current state at time t

R= {R1, R2,…,RN}: A set of rules

Output: At: Boolean alarm at time t

Algorithm:

At =

1 Ri = True

0 Ri = False

Page 37: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

38

Rare Event in Cisco Data

Page 38: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

39

Problem: Mitigate false alarm rate while maintaining a high detection rate.

Methodology: Historic feedbacks can be used as a free resource to take out

some possibly safe anomalies Combine anomaly detection model and user’s feedbacks. Risk level index

Evaluation metrics: Detection rate, false alarm rate. Detection rate False alarm rate Operational Curve

Risk assessment

Detection rate = TP/(TP+TN)False alarm rate = FP/(TP+FP)

Page 39: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

40

Reducing False Alarms

•Calculate Risk using historical feedback

•Historical Feedback:

•Count of true alarms:

Page 40: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

41

Detection Rate Experiments

16 18 20 22 24 26 28 30 320

0.5

1

(a) EUCLIDEAN THRESHOLD FOR CLUSTERING (th)

DETECTION RATE OF ANOMALY DETECTION AND RISK ASSESSMENT MODELS

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

(b) RISK ASSESSMENT WEIGHT FACTOR (alpha)

0 50 100 150 200 250 300 350 4000

0.5

1

(c) EMM STATE CARDINALITY THRESHOLD (thNode)

0 50 100 150 200 250 3000

0.5

1

(d) EMM TRANSITION CARDINALITY THRESHOLD (thLink)

ANOMALY DETECTIONRISK ASSESSMENT

ANOMALY DETECTIONRISK ASSESSMENT

ANOMALY DETECTIONRISK ASSESSMENT

ANOMALY DETECTIONRISK ASSESSMENT

Page 41: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

42

False Alarm Rate

16 18 20 22 24 26 28 30 320

0.5

1

(a) EUCLIDEAN THRESHOLD FOR CUSTERING (th)

FALSE ALARM RATE OF ANOMALY DETECTION AND RISK ASSESSMENT MODELS

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

(b) RISK ASSESSMENT WEIGHT FACTOR (alpha)

0 50 100 150 200 250 300 350 4000

0.5

1

(c) EMM STATE CARDINALITY THRESHOLD (thNode)

0 50 100 150 200 250 3000

0.5

1

(d) EMM TRANSITION CARDINALITY THRESHOLD (thLink)

ANOMALY DETECTIONRISK ASSESSMENT

ANOMALY DETECTIONRISK ASSESSMENT

ANOMALY DETECTIONRISK ASSESSMENT

ANOMALY DETECTIONRISK ASSESSMENT

Page 42: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

43

Outline

Introduction EMM Overview EMM Applications to Anomaly

Detection EMM Applications to Nuclear Testing Future Work

Page 43: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

44

Research ObjectivesResearch Objectives

Apply proven spatiotemporal modeling technique to seismic Apply proven spatiotemporal modeling technique to seismic datadata

Construct EMM to model sensor dataConstruct EMM to model sensor data Local EMM at location or areaLocal EMM at location or area Hierarchical EMM to summarize lower level modelsHierarchical EMM to summarize lower level models Represent all data in one vector of valuesRepresent all data in one vector of values EMM learns normal behaviorEMM learns normal behavior

Develop new similarity metrics to include all sensor data types Develop new similarity metrics to include all sensor data types (Fusion)(Fusion)

Apply anomaly detection algorithmsApply anomaly detection algorithms

Page 44: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

45

Input Data RepresentationInput Data Representation

•Vector of sensor values (numeric) at Vector of sensor values (numeric) at precise time points or aggregated over precise time points or aggregated over time intervals.time intervals.•Need not come from same sensor Need not come from same sensor types.types.•Similarity/distance between vectors Similarity/distance between vectors used to determine creation of new nodes used to determine creation of new nodes in EMM.in EMM.

Page 45: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

46

EMM with Seismic DataEMM with Seismic Data

Input – Wave arrivals (all or one per sensor)Input – Wave arrivals (all or one per sensor) Identify states and changes of states in seismic dataIdentify states and changes of states in seismic data Wave form would first have to be converted into a series of vectors representing Wave form would first have to be converted into a series of vectors representing

the activity at various points in time.the activity at various points in time. Initial Testing with RDG dataInitial Testing with RDG data Use amplitude, period, and wave typeUse amplitude, period, and wave type

Page 46: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

47

New Distance MeasureNew Distance Measure

Data = <amplitude, period, wave type>Data = <amplitude, period, wave type> Different wave type = 100% differenceDifferent wave type = 100% difference For events of same wave type:For events of same wave type:

50% weight given to the difference in amplitude.50% weight given to the difference in amplitude. 50% weight given to the difference in period.50% weight given to the difference in period.

If the distance is greater than the threshold, a state change If the distance is greater than the threshold, a state change is required.is required.

  amplitude =amplitude =

| amplitude| amplitudenewnew – amplitude – amplitudeaverageaverage | / amplitude | / amplitudeaverageaverage

period = period =

| period| periodnewnew – period – periodaverageaverage | / period | / periodaverage average

Page 47: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

48

EMM with Seismic DataEMM with Seismic Data

States 1, 2, and 3 correspond to Noise, Wave A, and Wave B respectively.

Page 48: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

49

Preliminary TestingPreliminary Testing

RDG data February 1, 1981 – 6 RDG data February 1, 1981 – 6 earthquakesearthquakes

Find transition times close to known Find transition times close to known earthquakesearthquakes

9 total nodes9 total nodes 652 total transitions652 total transitions Found all quakesFound all quakes

Page 49: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

50

Outline

Introduction EMM Overview EMM Applications to Anomaly

Detection EMM Applications to Nuclear Testing Future Work

Page 50: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

51

Ongoing/Future Work

Extend to Emerging Patterns Extend to Hierarchical/Distributed

Yu Su Test with more data – KDD Cup Compare to other approaches

Charlie Isaksson Apply to nuclear testing

Page 51: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

52

Hierarchical EMMHierarchical EMM

Page 52: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

53

Page 53: 12/9/08, Sandia National Labs 1 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE.

12/9/08, Sandia National Labs

54

ReferencesReferences

Zhigang Li and Margaret H. Dunham, “ STIFF: A Forecasting Framework for Spatio-Temporal Zhigang Li and Margaret H. Dunham, “ STIFF: A Forecasting Framework for Spatio-Temporal Data”, Data”, Proceedings of the First International Workshop on Knowledge Discovery in Multimedia Proceedings of the First International Workshop on Knowledge Discovery in Multimedia and Complex Dataand Complex Data, May 2002, pp 1-9., May 2002, pp 1-9.

Zhigang Li, Liangang Liu, and Margaret H. Dunham, “ Considering Correlation Between Variables Zhigang Li, Liangang Liu, and Margaret H. Dunham, “ Considering Correlation Between Variables to Improve Spatiotemporal Forecasting,” Proceedings of the PAKDD Conference, May 2003, pp to Improve Spatiotemporal Forecasting,” Proceedings of the PAKDD Conference, May 2003, pp 519-531.519-531.

Jie Huang, Yu Meng, and Margaret H. Dunham, “Extensible Markov Model,” Jie Huang, Yu Meng, and Margaret H. Dunham, “Extensible Markov Model,” Proceedings IEEE Proceedings IEEE ICDM ConferenceICDM Conference, November 2004, pp 371-374., November 2004, pp 371-374.

Yu Meng and Margaret H. Dunham, “Efficient Mining of Emerging Events in a Dynamic Yu Meng and Margaret H. Dunham, “Efficient Mining of Emerging Events in a Dynamic Spatiotemporal,” Spatiotemporal,” Proceedings of the IEEE PAKDD ConferenceProceedings of the IEEE PAKDD Conference, April 2006, Singapore. (Also in , April 2006, Singapore. (Also in Lecture Notes in Computer ScienceLecture Notes in Computer Science, Vol 3918, 2006, Springer Berlin/Heidelberg, pp 750-754.), Vol 3918, 2006, Springer Berlin/Heidelberg, pp 750-754.)

Yu Meng and Margaret H. Dunham, “Mining Developing Trends of Dynamic Spatiotemporal Data Yu Meng and Margaret H. Dunham, “Mining Developing Trends of Dynamic Spatiotemporal Data Streams,” Streams,” Journal of ComputersJournal of Computers, Vol 1, No 3, June 2006, pp 43-50., Vol 1, No 3, June 2006, pp 43-50.

Charlie Isaksson, Yu Meng, and Margaret H. Dunham, “Risk Leveling of Network Traffic Charlie Isaksson, Yu Meng, and Margaret H. Dunham, “Risk Leveling of Network Traffic Anomalies,” Anomalies,” International Journal of Computer Science and Network SecurityInternational Journal of Computer Science and Network Security , Vol 6, No 6, June , Vol 6, No 6, June 2006, pp 258-265.2006, pp 258-265.

Margaret H. Dunham and Vijay Kumar, “Stream Hierarchy Data Mining for Sensor Data,” Margaret H. Dunham and Vijay Kumar, “Stream Hierarchy Data Mining for Sensor Data,” Innovations and Real-Time Applications of Distributed Sensor Networks (DSN) SymposiumInnovations and Real-Time Applications of Distributed Sensor Networks (DSN) Symposium , , November 26, 2007, Shreveport Louisiana.November 26, 2007, Shreveport Louisiana.


Recommended