+ All Categories
Home > Documents > A reference based analysis framework for understanding anomaly detection techniques for symbolic...

A reference based analysis framework for understanding anomaly detection techniques for symbolic...

Date post: 09-Dec-2016
Category:
Upload: vipin
View: 214 times
Download: 2 times
Share this document with a friend
34
Data Min Knowl Disc DOI 10.1007/s10618-013-0315-0 A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences Varun Chandola · Varun Mithal · Vipin Kumar Received: 25 April 2012 / Accepted: 1 April 2013 © The Author(s) 2013 Abstract Anomaly detection for symbolic sequence data is a highly important area of research and is relevant in many application domains. While several techniques have been proposed within different domains, understanding of their relative strengths and weaknesses is limited. The key factor for this is that the nature of sequence data varies significantly across domains, and hence while a technique might perform well in its original domain, its performance is not guaranteed in a different domain. In this paper, we aim at establishing this understanding for a wide variety of anomaly detection techniques for symbolic sequences. We present a comparative evaluation of a large number of anomaly detection techniques on a variety of publicly available as well as artificially generated data sets. Many of these are existing techniques while some are slight variants and/or adaptations of traditional anomaly detection techniques to sequence data. The analysis presented in this paper allows relative comparison of the different anomaly detection techniques and highlights their strengths and weaknesses. We extend the reference based analysis (RBA) framework, which was originally pro- posed to analyze multivariate categorical data, to analyze symbolic sequence data sets. Responsible editor: Eamonn Keogh. V. Chandola (B ) Geographic Information Science and Technology, Oak Ridge National Laboratory, Oak Ridge, TN, USA e-mail: [email protected] V. Mithal · V. Kumar Department of Computer Science, University of Minnesota, Minneapolis, MN, USA V. Mithal e-mail: [email protected] V. Kumar e-mail: [email protected] 123
Transcript
Page 1: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

Data Min Knowl DiscDOI 10.1007/s10618-013-0315-0

A reference based analysis framework forunderstanding anomaly detection techniques forsymbolic sequences

Varun Chandola · Varun Mithal · Vipin Kumar

Received: 25 April 2012 / Accepted: 1 April 2013© The Author(s) 2013

Abstract Anomaly detection for symbolic sequence data is a highly important area ofresearch and is relevant in many application domains. While several techniques havebeen proposed within different domains, understanding of their relative strengths andweaknesses is limited. The key factor for this is that the nature of sequence data variessignificantly across domains, and hence while a technique might perform well in itsoriginal domain, its performance is not guaranteed in a different domain. In this paper,we aim at establishing this understanding for a wide variety of anomaly detectiontechniques for symbolic sequences. We present a comparative evaluation of a largenumber of anomaly detection techniques on a variety of publicly available as wellas artificially generated data sets. Many of these are existing techniques while someare slight variants and/or adaptations of traditional anomaly detection techniques tosequence data. The analysis presented in this paper allows relative comparison of thedifferent anomaly detection techniques and highlights their strengths and weaknesses.We extend the reference based analysis (RBA) framework, which was originally pro-posed to analyze multivariate categorical data, to analyze symbolic sequence data sets.

Responsible editor: Eamonn Keogh.

V. Chandola (B)Geographic Information Science and Technology, Oak Ridge National Laboratory,Oak Ridge, TN, USAe-mail: [email protected]

V. Mithal · V. KumarDepartment of Computer Science, University of Minnesota, Minneapolis, MN, USA

V. Mithale-mail: [email protected]

V. Kumare-mail: [email protected]

123

Page 2: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

V. Chandola et al.

We visualize the symbolic sequences using the characteristics provided by the RBAframework and use the visualization to understand various aspects of the sequence data.We then use the characterization done by RBA to understand the performance of thedifferent techniques. Using the RBA framework, we propose two anomaly detectiontechniques for symbolic sequences, which show consistently superior performanceover the existing techniques across the different data sets.

Keywords Sequences ·Anomaly detection ·Data mining ·Reference based analysis

1 Introduction

Data occurs naturally as sequences in a wide variety of applications, such as system calllogs in a computer, biological sequences, operational logs of an aircrafts’ flight, etc.In several such domains, anomaly detection is required to detect events of interests asanomalies. There has been extensive research done on anomaly detection techniques(Chandola et al. 2009; Hodge and Austin 2004; Lazarevic et al. 2003), but most ofthese techniques assign an anomaly score to individual data instances with respect tothe the normal data instances, without accounting for the sequence aspect of the data.For example, consider the set of user command sequences shown in Table 1. Clearlythe sequence S5 is anomalous, corresponding to a hacker breaking into a computerafter multiple failed attempts, even though each command in the sequence by itself isnormal.

A large number of anomaly detection techniques for symbolic sequences havebeen proposed in the literature (Chandola et al. 2012) (see Table 2). For example,Sun et al. (2006) proposed a probabilistic suffix trees (PST) based technique to detectanomalous sequences in a data base of protein sequences. Forrest et al. (1996, 1999)proposed several techniques to detect anomalous sequences in a data base of operatingsystem call sequences (Hofmeyr et al. 1998). While the techniques were proposed andevaluated in specific domains, no systematic evaluation is available regarding theirrelative performance. In particular, it is unclear if the techniques are the best ones forthe domain they were proposed for or if another technique, originally proposed for anentire different domain, might perform better.

In our previous work (Chandola et al. 2008), we provided an experimental evalua-tion of several such techniques on a variety of publicly available data sets, collectedfrom different application domains, as well as artifical data sets, created using an arti-ficial data generator. The results on different data sets reveal that no one technique is

Table 1 Sequences of usercommands

S1 login, pwd, mail, ssh, . . . , mail, web, logout

S2 login, pwd, mail, web, . . . , web, web, web, logout

S3 login, pwd, mail, ssh, . . . , mail, web, web, logout

S4 login, pwd, web, mail, ssh, . . . , web, mail, logout

S5 login, pwd, login, pwd, login, pwd, . . . , logout

123

Page 3: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

A reference based analysis framework

Table 2 Anomaly detection techniques for symbolic sequences

Applicationdomains

Kernel-basedtechniques

Window-basedtechniques

Markovian techniques

Fixed Variable Sparse

Intrusiondetection Forrest et al.

(1996, 1999);Hofmeyr etal. (1998)

Gonzalez and Das-gupta (2003); For-rest et al. (1999);Gao et al. (2002);Qiao et al. (2002)

Lee et al. (1997);Lee and Stolfo(1998); Michaeland Ghosh (2000)

Eskin et al.(2001); For-rest et al.(1999)

Proteomics Sun et al. (2006)

Flight safetyBudalakoti etal. (2007)

Srivastava(2005)

clearly superior to others. Most techniques show consistency in performance on publicdata sets belonging to one domain but show different performance for data sets from adifferent domain. This indicates a relationship between the techniques and the natureof the data. These observations motivate a deeper study of the relationship between atechnique and a data set.

In this paper we build upon our previous work (Chandola et al. 2008) to furtherunderstand the relationship between the anomaly detection techniques and the natureof data. To gain this understanding we use an analysis methodology, called referencebased analysis (RBA) (Chandola et al. 2009). RBA was originally proposed to charac-terize and visualize multivariate categorical data sets. In this paper we show how thesame framework can be adapted to characterize symbolic sequence data. We visualizethe symbolic sequences using these characteristics and demonstrate its utility in under-standing various aspects of the sequence data such as how different are the normalsequences from the anomalous sequences and how similar are the normal sequences toeach other. We then show how different anomaly detection techniques listed in Table 2rely on one or more of such characteristics to detect anomalies. Using these character-istics, we propose two novel anomaly detection techniques for symbolic sequences,called W I N1D and W I N2D , which show consistently superior performance over theexisting techniques across the different data sets.

1.1 Our contributions

The specific contributions of this paper are as follows:

– We provide an experimental evaluation of anomaly detection techniques listed inTable 2 on a variety of data sets to explain the performance of a variety of anomalydetection techniques on different types of sequence data sets. The analysis pre-sented in this paper allows relative comparison of the different anomaly detectiontechniques and highlights their strengths and weaknesses. A preliminary versionof the results were published in Chandola et al. (2008) in which the techniques

123

Page 4: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

V. Chandola et al.

were evaluated using the precision on anomaly class metric. In this paper we alsopresent results using the area under ROC curve metric.

– We extend the RBA framework (Chandola et al. 2009), which was originallyproposed to analyze multivariate categorical data, to analyze symbolic sequencedata sets. We visualize the symbolic sequences using the characteristics providedby the RBA framework and use the visualization to understand various aspects ofthe sequence data. We then use the characterization done by RBA to understand theperformance of the different techniques listed in Table 2. A preliminary version ofthis work has been presented in the context of data sets collected from the domainof system call intrusion detection (Chandola et al. 2010).

– Using the RBA framework, we propose two anomaly detection techniques for sym-bolic sequences, called W I N1D and W I N2D , which show consistently superiorperformance over the existing techniques across the different data sets.

1.2 Organization

The rest of this paper is organized in two parts. In the first part (Sects. 2–6), we providea brief description of the different anomaly detection techniques and an experimentalcomparison of their performance on a variety of publicly available as well as syn-thetic data sets. In the second part (Sects. 7–11) we show how the RBA frameworkcan be used to analyze symbolic sequences. In Sect. 8, we illustrate how the RBAframework can be used to analyze symbolic sequences. Specifically, we show how theRBA framework can be used understand the relationship between different anomalydetection techniques and the nature of sequence data in Sects. 9 and 10. In Sect. 11 wepresent two novel RBA based anomaly detection techniques for symbolic sequences.

2 Problem statement

The objective of the techniques evaluated in this paper can be stated as follows:

Definition 1 Given a set of n training sequences, T, and a set of m test sequences S,find the anomaly score A(Si ) for each test sequence Si ∈ S, with respect to T.

All sequences consist of events that correspond to a finite alphabet, Σ . The length ofsequences in T and sequences in S might or might not be equal in length. The trainingdatabase T is assumed to contain only normal sequences, and hence the techniquesoperate in a semi-supervised setting (Chandola et al. 2012).

2.1 Notation

In this paper we will denote training sequences as T or Tj and test sequences as S orSi . The size of the training data set is denoted by m and size of the test data set isdenoted by n.

123

Page 5: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

A reference based analysis framework

3 Anomaly detection techniques for sequences

As mentioned earlier, there have been several techniques proposed for detecting anom-alies in symbolic sequences (Chandola et al. 2012), specifically for the problem definedin Sect. 2. The techniques can be classified into following three broad categories (seeTable 2) based on the underlying approach. Kernel-based techniques assign an anom-aly score to a test sequence based on its similarity to the normal sequences. Window-based techniques calculate the probability of occurrence of every fixed length windowin the test sequence. Markovian techniques calculate the probability of occurrence ofeach symbol in the test sequence conditioned on the preceding few symbols in the testsequence.

In this paper we study the performance of several techniques that belong to theabove three categories. Each technique assigns an anomaly score A(Si ) to a given testsequence Si .

1. C LU ST E R Budalakoti et al. (2007): The training sequences in T are clusteredinto c clusters using k-Means clustering algorithm (MacQueen 1967), with thenormalized longest common subsequence length (nLC S) as the similarity mea-sure. The anomaly score A(Si ) is set equal to the “distance” of Si from the closestcluster medoid.

2. K N N : The anomaly score A(Si ) is equal to the distance to its kth neighbor in thetraining set T where the distance is measured as the inverse of the nLC S similaritymeasure.

3. t ST I DE Forrest et al. (1999): This technique (Threshold Sequence Time-DelayEmbedding) extracts all k-length sliding windows from Si and assigns a score toeach window proportional to the number of times it occurs in the sequences inT. The anomaly score A(Si ) is equal to the inverse of the average score of allwindows for Si .

4. F S A Michael and Ghosh (2000): This technique computes a likelihood score foreach symbol in Si equal to the conditional probability of observing the symbolgiven the previous k symbols (its k length history) in all sequences in T. Thesymbols which never occur in conjunction with their k length history are ignored.The anomaly score A(Si ) is equal to the inverse of the average likelihood scoresfor each symbol in Si . Note that this approach is similar to using a D-Markovmachine Ray 2004.

5. F S Az: We propose a variant of F S A where instead of ignoring the symbols whichnever occur in conjunction with their k length history, we assign them a likelihoodscore of 0.

6. P ST Sun et al. (2006): This is a variant of the F S A technique, but instead ofusing the conditional probability of each symbol, given its fixed k length history,the P ST technique uses a variable length history of length p, upper bounded byk, such that p is the largest value for which the frequency of occurrence of the plength history is greater than a pre-defined threshold.

7. RIPPER Lee et al. (1997): This technique first learns a Ripper classifier (Cohen1995) to predict a symbol, given its k length history, using the training sequencesin T. Each symbol in the test sequence is assigned a score based on the probability

123

Page 6: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

V. Chandola et al.

of the classifier to predict that symbol, given its k length history. The anomalyscore A(Si ) is equal to the inverse of the average scores for each symbol in Si .

8. HMM Forrest et al. (1999): This techniques learns a Hidden Markov Model (HMM)with σ hidden states from the normal sequences in T using the Baum Welch algo-rithm (Baum et al. 1970). In the testing phase, the optimal hidden state sequencefor the test sequence Si is determined, using the Viterbi algorithm (Forney 1973).Each symbol is assigned a score equal to the probability for transitioning fromcorresponding hidden state to the next state.

4 Data sets used

In this section we describe various public as well as the artificially generated datasets that we used to evaluate the different anomaly detection techniques. We usedpublic data sets that have been used earlier to evaluate sequence anomaly detectiontechniques. To further illustrate certain aspects of different techniques, we constructeddifferent artificial data sets. The artificial data sets were constructed such that we cancontrol the nature of normal as well as anomalous sequences and hence learn therelationship between the various techniques and the nature of the data.

For every data set, we first constructed a set of normal sequences, and a set ofanomalous sequences. A sample of the normal sequences was used as training data fordifferent techniques. A disjoint sample of normal sequences and a sample of anomaloussequences were added together to form the test data. The relative proportion of normaland anomalous sequences in the test data determined the “difficulty level” for that dataset. We experimented with different ratios such as 1:1, 10:1 and 20:1 of normal andanomalous sequences. Results on data sets with other ratios are consistent in relativeterms, although most techniques perform much better for the simplest data set thatuses a ratio 1:1. Since in real sequences anomalies are rare, we report results whennormal and anomalous sequences were in 20:1 ratio in test data. In reality, the ratioof normal to anomalous can be even larger than 20:1. But we were unable to try suchskewed distributions due to limited number of normal samples available in some ofthe data sets.

4.1 Public data sets

Table 3 summarizes the various statistics of the data sets used in our experiments. Alldata sets are available from our web site.1 The distribution of the symbols for normaland anomalous sequences is illustrated in Fig. 1a–c for RVP, snd-cert, and bsm-week2data sets, respectively. The difference in the distribution of symbols for normal andanomalous data in snd-cert and hcv data sets is relatively higher than that in bsm-week2 data. This hints that differentiating between normal and anomalous sequencesfor bsm-week2 data is more challenging than the first two data sets.

1 http://www.cs.umn.edu/~chandola/ICDM2008

123

Page 7: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

A reference based analysis framework

Table 3 Public data sets used for experimental evaluation

Source Data set |Σ | l |SN| |SA| |T| |S|PFAM HCV 44 87 2423 50 1423 1050

NAD 42 160 2685 50 1685 1050

TET 42 52 1952 50 952 1050

RUB 42 182 1059 50 559 525

RVP 46 95 1935 50 935 1050

UNM snd-cert 56 803 1811 172 811 1050

snd-unm 53 839 2030 130 1030 1050

DARPA bsm-week1 67 149 1000 800 10 210

bsm-week2 73 141 2000 1000 113 1050

bsm-week3 78 143 2000 1000 67 1050

l average length of sequences, SN normal data, SA anomalous data, T training data, S test data

(a) (b)

(c)

Fig. 1 Distribution of symbols in training data sets of different types. a RVP, b SND CERT, c BSM WEEK 2

123

Page 8: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

V. Chandola et al.

4.1.1 Protein data sets

The first set of public data sets were obtained from PFAM database (Release 17.0)(Bateman et al. 2000) containing sequences belonging to 7,868 protein families.Sequences belonging to one family are structurally different from sequences belong-ing to another family. We choose five families, viz., HCV, NAD, TET, RVP, RUB.For each family we construct a normal data set by choosing a sample from the setof sequences belonging to that family. We then sample 50 sequences from other fourfamilies to construct an anomaly data set. Similar data was used by Sun et al. (2006)to evaluate the PST technique. The difference was that the authors constructed a testdata for each pair of protein families such that samples from one family were usedas normal and samples from the other were used as test. The PST results on PFAMdata sets reported in this paper appear to be worse than those reported in Sun et al.(2006).

4.1.2 Intrusion detection data sets

The second set of public data sets were collected from two repositories of benchmarkdata generated for evaluation of intrusion detection algorithms. One repository wasgenerated at University of New Mexico.2 The normal sequences consisted of sequenceof system calls generated in an operating system during the normal operation of acomputer program, such as sendmail, ftp, lpr etc. The anomalous sequences consistedof sequence of system calls generated when the program is run in an abnormal mode,corresponding to the operation of a hacked computer. We report results on two datasets, viz, snd-unm and snd-cert. Other data sets were not used due to insufficientanomalous sequences to attain a 20:1 imbalance. For each of the two data sets, thenumber of sequences in the normal as well as anomaly data was small (less than 200),making it difficult to construct significant test and training data sets. The increase thesize of the data sets, we extracted subsequences of length 100 by sliding a windowof length 100 and a sliding step of 50. The subsequences extracted from the originalnormal sequences were treated as normal sequences and the subsequences extractedfrom the original anomalous sequences were treated as anomalous sequences if theydid not occur in the normal sequences.

The other intrusion detection data repository was the Basic Security Module (BSM)audit data, collected from a victim Solaris machine, in the DARPA Lincoln Labs 1998network simulation data sets (Lippmann et al. 2000). The repository contains labeledtraining and testing DARPA data for multiple weeks collected on a single machine. Foreach week we constructed the normal data set using the sequences labeled as normalfrom all days of the week. The anomaly data set was constructed in a similar fashion.The data is similar to the system call data described above with similar (though larger)alphabet.

2 http://www.cs.unm.edu/~immsec/systemcalls.htm

123

Page 9: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

A reference based analysis framework

4.2 Altered RVP data set

To better understand the performance of the anomaly detection techniques to thenature of anomalies in the test data, we created a data set from the original RVPdata from the PFAM repository. A test data set was constructed by sampling 800most normal sequences not present in training data. Anomalies were injected in 50of the test sequences by randomly replacing a symbols in each sequence with theleast frequent symbol in the data set. The parameter a controls the deviation of theanomalous sequences from the normal sequences. The objective of this experimentwas to evaluate how the performance of a technique varies with a.

4.3 Artificial data sets

As mentioned in the introduction, two types of anomalous sequences can exist, onewhich are arguably generated from a different generative mechanism than the normalsequences, and the other which result from a normal sequence deviating for a short spanfrom its expected normal behavior. To study the relationship between these two typesof anomalous sequences and the performance of different techniques, we designed anartificial data generator which allows us to generate validation data sets with differenttypes of anomalies.

We used a generic HMM, as shown in Fig. 2 to model normal as well as anom-alous data. The HMM shown in Fig. 2 has two sets of states, {S1, S2, . . . S6} and{S7, S8, . . . S12}.

Within each set, the transitions corresponding to the solid arrows shown in Fig. 2were assigned a transition probability of (1−5β), while other transitions were assignedtransition probability β. No transition is possible between states belonging to differentsets. The only exception are S2S8 for which the transition probability is λ, and S7S1for which the transition probability is 1−λ. The transition probabilities S2S3 and S7S8are adjusted accordingly so that the sum of transition probabilities for each state is 1.

Fig. 2 HMM used to generate artificial data

123

Page 10: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

V. Chandola et al.

The observation alphabet is of size 6. Each state emits one alphabet with a highprobability (1−5α), and all other alphabets with a low probability (α). Figure 2 depictsthe most likely alphabet for each state.

The initial probability vector π of the HMM is constructed such that eitherπ1 = π2 = . . . = π6 = 1

6 and π7 = π8 = . . . = π12 = 0; or vice-versa.Normal sequences are generated by setting λ to a low value and π to be such that

the first 6 states have initial probability set to 16 and rest 0. If λ = β = α = 0,

the normal sequences will consist of the subsequence a1a2a3a4a5a6 getting repeatedmultiple times. By increasing λ or β or α, anomalies can be induced in the normalsequences.

This generic HMM can be tuned to generate two types of anomalous sequences.For the first type of anomalous sequences, λ is set to a high value and π to be suchthat the last 6 states have initial probability set to 1

6 and rest 0. The resulting HMMis directly opposite to the HMM constructed for generating normal sequences. Hencethe anomalous sequences generated by this HMM are completely different from thenormal sequences.

To generate second type of anomalous sequences, the HMM used to generate thenormal sequence is used, with the only difference that λ is increased to a higher valuethan 0. Thus the anomalous sequences generated by this HMM will be similar to thenormal sequences except that there will be short spans when the symbols are generatedby the second set of states.

By varying λ, β, and α, we generated several evaluation data sets (with two differenttype of anomalous sequences). We will present the results of our experiments on theseartificial data sets in next section.

5 Evaluation methodology

The techniques investigated in this paper assign an anomaly score to each test sequenceSi ∈ S, such that the sequence with highest anomaly score is considered as mostanomalous, and so on. To compare the performance of different techniques in such ascenario, we first convert the ranked output for each sequence in the test data set intoa binary label (normal or anomalous) in the following manner:

1. Rank the test sequences in decreasing order based on the anomaly scores.2. Label the sequences in the top p portion of the sorted test sequences as anomalous,

and rest sequences as normal, where 1 < p ≤ |S|.We evaluate the techniques using two different metrics. The first metric uses a fixedvalue of p. Let there be t true anomalous sequences in top p ranked sequences. Wemeasure the accuracy of the techniques as:

Accuracy = t

p

We report the accuracy when p = q, where q is number of true anomalous sequencesin the test data set. The accuracies for other values of p also showed consistent results.

123

Page 11: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

A reference based analysis framework

One drawback of the above metric is that it is highly dependent on the choice ofp. A technique might show 100% accuracy for a particular value of p but show 50%accuracy for 2p. To overcome this drawback we use a second evaluation metric.

The second metric used to evaluate the anomaly detection techniques is to obtain thearea under the ROC curve (AUC) obtained by varying p from 1 to |S|. The advantageof AUC is that it is not dependent on the choice of p.

6 Experimental results

The experiments were conducted on a variety of data sets discussed in Sect. 4. Thevarious parameter settings associated with each technique were explored. The resultspresented here are for the parameter setting which gave best results across all datasets, for each technique.

6.1 Sensitivity to parameters

The performance of C LU ST E R improved as c was increased from 2 onwards, butstabilized for values greater than 32. The best overall performance was observedfor c = 32. For k N N , the performance was comparable for a wide range of k(2 ≤ k ≤ 32) but deteriorated for higher values of k. The best overall perfor-mance was observed for k = 4. For t ST I DE as well as the Markovian techniques(F S A, F S Az, P ST , RI P P E R), the performance was sensitive to the choice ofwindow length or the length of the history. For low values of this length (≤ 5) orfor values higher than 10, the performance was generally poor. This arises fromthe fact that for very small and large values of k the probabilities of observing a klength subsequence in the training data converge and hence the subsequences losethe ability to distinguish between normal and anomalous sequences. The best per-forming setting was window size of 6 for t ST I DE and history length of 5 forthe Markovian techniques. For P ST , an additional parameter is Pmin which con-trols the threshold under which the counts for a given subsequence are consideredinsignificant. We observed that performance of P ST was highly sensitive to thisparameter. If Pmin was set to very low (≈ 0), P ST performed similar to F S Az,while if Pmin was set to be higher than 0.1, the performance was poor. The bestperformance of P ST was observed for Pmin = 0.01. For H M M , the numberof hidden states σ is a critical parameter. We experimented with values rangingfrom 2 to |Σ |. Our experiments reveal that the performance of H M M does notvary significantly for different values of σ . The best overall performance of H M Mwas observed for σ = 4 for public data sets and σ = 12 for the artificial datasets.

We experimented with various combination functions for different techniques, andfound that the average log score function has the best performance across all datasets. Hence, results are reported for the average log score function. Results withother combination techniques are available in our technical report Chandola et al.(2008).

123

Page 12: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

V. Chandola et al.

Table 4 Accuracy results for public data sets

PFAM UNM DARPA Avg

hcv nad tet rvp rub snd-unm

snd-cert

bsm-week1

bsm-week2

bsm-week3

cls 0.54 0.46 0.84 0.86 0.76 0.76 0.94 0.20 0.36 0.52 0.62

knn 0.88 0.64 0.86 0.90 0.72 0.84 0.94 0.20 0.52 0.48 0.70

tstd 0.90 0.74 0.50 0.90 0.88 0.58 0.64 0.20 0.36 0.60 0.63

fsa 0.88 0.66 0.48 0.90 0.80 0.82 0.88 0.40 0.52 0.64 0.70

fsaz 0.92 0.72 0.50 0.90 0.88 0.80 0.88 0.50 0.56 0.66 0.73

pst 0.74 0.10 0.66 0.50 0.28 0.28 0.10 0.00 0.10 0.34 0.31

rip 0.52 0.20 0.36 0.66 0.72 0.72 0.70 0.20 0.18 0.50 0.48

hmm 0.10 0.06 0.20 0.10 0.00 0.00 0.00 0.00 0.02 0.20 0.07

Avg 0.69 0.45 0.55 0.72 0.63 0.60 0.64 0.21 0.33 0.49

Bold values indicate the best performing method for the corresponding data set

6.2 Accuracy versus AUC

We evaluated the different techniques using the two evaluation metrics described inSect. 5, Accuracy and AUC. Both metrics show similar relative performance for thedifferent techniques. We will compare the performance using the accuracy metric.

6.3 Results on public data sets

Tables 4 and 5 summarize the accuracy and AUC results on the 10 public data sets.C LU ST E R and k N N show good performance for PFAM and UNM data sets butperform moderately on DARPA data sets. F S A and F S Az show consistently goodperformance for all public data sets. t ST I DE performs well for PFAM data sets butits performance degrades for both UNM and DARPA data sets. P ST performs averageto poor for all data sets including the PFAM data sets for which it was originally used.The H M M technique performs poorly for all public data sets. The reasons for the poorperformance is that H M M technique makes an assumption that the normal sequencescan be represented with σ hidden states, which might not be true for the public datasets.

Overall, one can observe that the performance of techniques in general is betterfor PFAM data sets and on UNM data sets, while the DARPA data sets are morechallenging.

6.4 Results on altered RVP data set

Figure 3 shows the performance of the different techniques on the altered RVP data set,for values of a from 1 to 10. We observe that F S Az performs remarkably well for thesevalues of a. C LU ST E R, t ST I DE , F S A, P ST , and RI P P E R exhibit moderateperformance, though for values of a closer to 10, RI P P E R performs better than the

123

Page 13: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

A reference based analysis framework

Table 5 AUC results for public data sets

PFAM UNM DARPA Avg

hcv nad tet rvp rub snd-unm

snd-cert

bsm-week1

bsm-week2

bsm-week3

cls 0.98 0.96 1.00 1.00 0.99 0.99 1.00 0.74 0.90 0.91 0.94

knn 1.00 0.98 1.00 1.00 0.99 1.00 1.00 0.75 0.92 0.91 0.95

tstd 0.99 0.97 0.98 1.00 1.00 0.97 0.92 0.62 0.73 0.80 0.90

fsa 0.98 0.97 0.92 0.99 0.99 0.99 0.96 0.88 0.90 0.97 0.96

fsaz 1.00 0.98 0.98 1.00 1.00 0.97 0.96 0.88 0.91 0.97 0.96

pst 0.99 0.54 0.98 0.97 0.91 0.93 0.88 0.35 0.42 0.54 0.75

rip 0.70 0.45 0.37 0.97 0.96 0.98 0.94 0.79 0.70 0.84 0.77

hmm 0.58 0.50 0.71 0.55 0.24 0.04 0.03 0.43 0.50 0.77 0.43

Avg 0.90 0.79 0.87 0.93 0.88 0.86 0.84 0.68 0.75 0.84

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10

Acc

urac

y (%

)

Number of Anomalous Symbols Inserted (a)

CLUSTERkNN

t-STIDEFSA

FSA-zPST

RIPPERHMM

Fig. 3 Results for altered RVP data sets

other 4 techniques. For a > 10, all techniques show better than 90% accuracy becausethe anomalous sequences become very distinct from the normal sequences, and henceall techniques perform comparably well.

6.5 Results on artificial data sets

Tables 6 and 7 summarize the accuracy and AUC results on 6 (d1–d6) artificialdata sets.The normal sequences in data set d1 were generated with λ = 0.01, β

= 0.01, α = 0.01. The anomalous sequences were generated using the first setting asdiscussed in Sect. 4.3, such that the sequences were primarily generated from the sec-

123

Page 14: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

V. Chandola et al.

Table 6 Accuracy results for artificial data sets

d1 d2 d3 d4 d5 d6 Avg

cls 1.00 0.80 0.74 0.74 0.58 0.64 0.75

knn 1.00 0.88 0.76 0.76 0.60 0.68 0.78

tstd 1.00 0.82 0.64 0.64 0.48 0.50 0.68

fsa 1.00 0.88 0.50 0.52 0.24 0.28 0.57

fsaz 1.00 0.92 0.60 0.52 0.32 0.38 0.62

pst 1.00 0.84 0.82 0.76 0.68 0.68 0.80

rip 1.00 0.78 0.64 0.66 0.52 0.44 0.67

hmm 1.00 0.50 0.34 0.42 0.16 0.66 0.51

Avg 1.00 0.80 0.63 0.63 0.45 0.53

Bold values indicate the best performing method for the corresponding data set

Table 7 AUC results for artificial data sets

d1 d2 d3 d4 d5 d6 Avg

cls 1.00 0.95 0.97 0.98 0.94 0.95 0.97

knn 1.00 0.96 0.98 0.98 0.96 0.95 0.97

tstd 1.00 0.96 0.98 0.98 0.96 0.95 0.97

fsa 1.00 0.96 0.98 0.98 0.96 0.95 0.97

fsaz 1.00 0.96 0.98 0.98 0.96 0.95 0.97

pst 1.00 0.96 0.98 0.98 0.96 0.95 0.97

rip 1.00 0.96 0.98 0.98 0.96 0.95 0.97

hmm 1.00 0.96 0.98 0.98 0.96 0.95 0.97

Avg 1.00 0.96 0.98 0.98 0.96 0.95

ond set of states. For data sets d2–d6, the H M M used to generate normal sequenceswas tuned with β = 0.01, α = 0.01. The value of λ was increased from 0.002 to 0.01in increments of 0.002. The anomalous sequences for data sets d2–d6 were generatedusing the second setting in which λ is set to 0.1.

From Table 6, we observe that P ST is the most stable technique across the artificialdata sets, while the deterioration is most pronounced for F S A and F S Az. Both k N Nand C LU ST E R also get negatively impacted as the λ increases but the trend is gradualthan for F S Az. The performance of H M M on the artificial data sets is better than forpublic data sets since the training data was actually generated by a 12 state H M Mand the H M M technique was trained with σ = 12; thus the H M M model effectivelycaptures the normal sequences.

6.6 Relative performance of different techniques

Kernel based techniques are found to perform well for data sets in which the anomaloussequences are significantly different from the normal sequences; but perform poorlywhen the difference between the two is small. This is due to the nature of the normalized

123

Page 15: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

A reference based analysis framework

LCS similarity measure used in the kernel based techniques. Our experiments show thatk N N technique is somewhat better suited than C LU ST E R for anomaly detection,which is expected, since k N N is optimized to detect anomalies while the clusteringalgorithm in C LU ST E R is optimized to obtain clusters in the data.

F S Az is consistently superior among all techniques, especially for data sets inwhich the anomalous sequences are minor deviations from normal sequences. Theperformance of F S Az is poor when the normal sequences contain rare patterns. Per-formance of t ST I DE is comparable to F S Az when the anomalous sequences aresignificantly different from the normal sequences, but is inferior to F S Az when thedifference is small. t ST I DE is less affected by the presence of rare patterns in thenormal sequences than F S Az. for all PFAM data sets but is relatively poor on DARPAand UNM data sets. t ST I DE performs significantly better on artificial data sets. P STperforms relatively worse than other techniques, except for cases where the normalsequences themselves contain many rare patterns. RI P P E R is also an average per-former on most of the data sets, and is relatively better than P ST , indicating that usinga sparse history model is better than a variable history model.

For the public data sets, we found the H M M technique to perform poorly. Thereasons for the poor performance of H M M are twofold. The first reason is that H M Mtechnique makes an assumption that the normal sequences can be represented withσ hidden states. Often, this assumption does not hold true, and hence the H M Mmodel learnt from the training sequences cannot emit the normal sequences withhigh confidence. Thus all test sequences (normal and anomalous) are assigned a lowprobability score. There are approaches, like the Causal State Splitting Reconstructionalgorithm Shalizi and Klinkner (2004), that can infer the appropriate architecture ofthe H M M model from the data and can be applied to address this issue. The secondreason for the poor performance is the manner in which a score is assigned to a testsequence. The test sequence is first converted to a hidden state sequence, and thenthe F S A technique with k = 1 is applied to the transformed sequence. We haveobserved from our experiment using F S A that setting k = 1 does not perform wellfor anomaly detection. The performance of H M M on artificial data sets (see Table 6)illustrates this argument. Since the training data was actually generated by a 12 stateH M M and the H M M technique was trained with σ = 12; thus the H M M modeleffectively captures the normal sequences. The results of H M M for artificial datasets are therefore better than for public data sets, but still slightly worse than othertechniques because of the poor performance of F S A. When the normal sequenceswere generated using an H M M , the performance improves significantly. The hiddenstate sequences, obtained as a intermediate transformation of data, can actually beused as input data to any other technique discussed here. The performance of such anapproach will be investigated as a future direction of research.

7 Reference based analysis framework for symbolic sequences

The results on different data sets in Sect. 6 reveal that no one technique is clearlysuperior to others. Most techniques show consistency in performance on public datasets belonging to one domain but show different performance for data sets from a

123

Page 16: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

V. Chandola et al.

different domain. This indicates a relationship between the techniques and the natureof the data. In the artificial data sets generated from the data generator as well as thealtered RVP data sets, we further studied this relationship by modifying the nature ofthe data using one or more tunable parameters. These observations motivate a deeperstudy of the relationship between a technique and a data set.

From this section onwards, we study the relationship between the anomaly detectiontechniques and the nature of data. Using the RBA framework Chandola et al. (2009),we characterize symbolic sequence data. We visualize the symbolic sequences usingthese characteristics which is useful to understand various aspects of the sequencedata such as how different are the normal sequences from the anomalous sequencesand how similar are the normal sequences to each other. We then show how differ-ent anomaly detection techniques evaluated in Sect. 6 rely on one or more of suchcharacteristics to detect anomalies. Using these characteristics, we propose two novelanomaly detection techniques for symbolic sequences, called W I N1D and W I N2D ,which show consistently superior performance over the existing techniques across thedifferent data sets.

8 Characterizing sequence data

The results on different data sets in Sect. 6 reveal that no one technique is clearlysuperior to others. Most techniques show consistency in performance on public datasets belonging to one domain but show different performance for data sets from adifferent domain. This indicates a relationship between the techniques and the natureof the data. In the artificial data sets generated from the data generator as well as thealtered RVP data sets, we further studied this relationship by modifying the nature ofthe data using one or more tunable parameters. These observations motivate a deeperstudy of the relationship between a technique and a data set. To facilitate such a study,we first characterize a data set and then identify the relationship between a techniqueand the data characteristics.

We characterize a test sequence data set containing of normal and anomaloussequences with respect to a base data set, containing only normal sequences.3 Wedescribe different alternatives to characterize sequence data which are used by one ormore of the techniques discussed in this paper, as will be discussed in Sects. 9 and 10.

8.1 1-D frequency profiles

The first characterization is motivated from window based techniques that rely on thefrequency of a k length window in a given sequence for anomaly detection. In thissection we refer to a k length window as a k-window for brevity. Each k-window isassociated with a frequency (denoted as fk), i.e., the number of times it occurs in thetraining sequences.

3 This framework to characterize a given data set with respect to a base or reference data set was originallyproposed for characterizing categorical data Chandola et al. (2009).

123

Page 17: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

A reference based analysis framework

A 1-D frequency profile for a test sequence can be constructed as follows. First, allk-windows from the test sequence are extracted and their frequencies fk are computedfrom the training sequences. The frequencies are “binned” into a fixed number (p)of bins. Since windows with fk = 0 are of special interest, the first bin stores thewindows with exactly fk = 0. The other p − 1 bins divide the range 1 : max intoequal width intervals, where max is the maximum frequency of any window in thegiven data set. The values in each bin are normalized to lie between 0 and 1 by dividingthem by the total number of windows in the given sequence. Thus each test sequencecan be mapped into a �p space.

8.1.1 Average 1-D frequency profiles

To characterize a given test data set, we aggregate the 1-D frequency profiles. We con-struct the average 1-D frequency profiles for the normal test sequences and anomaloustest sequences separately. It should be noted that the average profile might not be thebest representation of the profiles. For example, let the test set contain 4 anomaloussequences. Using four bins (p = 4), let the frequency profiles for the four anomaloussequences be (1.00, 0, 0, 0), (0, 1.00, 0, 0), (0, 0, 1.00, 0), and (0, 0, 0, 1.00). Theaverage frequency profile for the anomalous sequences will be (0.25, 0.25, 0.25, 0.25)

which does not provide an accurate representation of the actual profiles. But if theindividual frequency profiles are similar to each other, the average profile will berepresentative.

A test sequence data set can be characterized with respect to a normal data setby taking the difference between the average 1-D frequency profiles for normal andanomalous test sequences. We will describe how this characteristic can be used toexplain the behavior of window based techniques in Sect. 9.

8.2 2-D frequency profiles

The second characterization is motivated from Markovian techniques that rely on thefrequency of a k length window as well as the frequency of the k − 1 length prefixof the window, in a given sequence for anomaly detection. Thus, each k-window isassociated with a tuple ( fk, fk−1), where fk is the frequency of occurrence of thek-window and fk−1 is the frequency of occurrence of the k − 1 length prefix of thegiven k-window in the training sequences.

A 2-D frequency profile for a test sequence can be constructed as follows. First, allk-windows from the test sequence are extracted and the associated tuples ( fk, fk−1)

are computed from the training sequences. The fk frequencies are binned into p binsin the same manner as the 1-D frequency profiles. Similarly, the fk−1 frequencies arebinned into p bins. Thus, every tuple ( fk, fk−1) is assigned to a “cell” (or grid) on ap× p grid. The values in each cell are normalized to lie between 0 and 1 by dividingthem by the total number of windows in the given sequence. Thus each test sequencecan be mapped into a �p×p space.

Note that the column aggregation of the 2-D frequency profile for a test sequencewill give the 1-D frequency profile for the given test sequence.

123

Page 18: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

V. Chandola et al.

8.2.1 Average 2-D frequency profiles

To characterize a given sequence data set, using the 2-D frequency profiles, we followthe same procedure as for 1-D frequency profiles. The frequency profiles for normaland anomalous sequences are aggregated separately to obtain an average normal 2-Dfrequency profile and an average anomalous 2-D frequency profile, respectively.

A test sequence data set can be characterized with respect to a normal data setby taking the difference between the average 2-D frequency profiles for normal andanomalous test sequences. We will describe how this characteristic can be used toexplain the behavior of Markovian techniques in Sect. 9.

9 Relationship between performance of techniques and frequency profiles

In this section we relate the performance of the window based (t ST I DE) and Markov-ian techniques (F S A, F S Az, P ST , and RI P P E R) to the 1-D and 2-D frequencyprofiles defined in Sect. 8.1.

9.1 t ST I DE

The performance of t ST I DE can be explained using the 1-D frequency profilesdescribed in the previous section. The anomaly score assigned by t ST I DE is inverselyproportional to the frequency of the k-windows in a given sequence. Hence the differ-ence in the 1-D frequency profiles for normal and anomalous test sequences determinesthe relative performance of t ST I DE on a given test data set.

For example, the average 1-D frequency profiles for rvp data set in Fig. 4a aresignificantly different, and hence the performance of t ST I DE is 90% (see Table 4).For bsm-week1 data set in Fig. 4b, the difference is not significant, and hence theperformance of t ST I DE is relatively poor (= 20 %).

(a) (b)

Fig. 4 Average 1-D frequency profiles for 6-windows. a rvp, b bsm-week1

123

Page 19: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

A reference based analysis framework

9.2 F S A

The t ST I DE technique distinguishes between normal and anomalous test sequencesin terms of the frequency of the k-windows, fk . Often, fk alone is not distinguishingenough (see Fig. 4b). The F S A technique addresses this issue by considering thefrequency of a k-window as well as the frequency of the k − 1 length suffix of thek-window.

The performance of F S A can be explained using the 2-D frequency profilesdescribed in previous section. F S A assigns anomaly score to a sequence using the val-ues fk and fk−1 for every k window. Hence the difference in the average 2-D frequencyprofiles for normal and anomalous sequences determines it relative performance onthe given data set.

For example, the average 2-D frequency profiles for the bsm-week1 data set areshown in Fig. 5a, b for normal and anomalous sequences, respectively. The grayscaleintensity of each cell represents the magnitude of the relative proportion of k-windowsfalling in that cell. We compare the two profiles with the 1-D frequency profiles shownin Fig. 4b. The absolute difference between normal and anomalous frequency profilesis shown in Fig. 5c with marker “�” indicating that normal test sequences had highervalue for that cell than the anomalous test sequences, and marker “�” indicating thatnormal test sequences had lower value for that cell than the anomalous test sequences.Figures 6 and 7 show the plots (differences only) for other public data sets. Notethat if the 2D profiles are collapsed onto the y-axis, we will get the corresponding1-D profiles. We note that even though the normal and anomalous sequences are notdifferentiable when only fk is considered, the difference is significant when both fk

and fk−1 are considered. This is the reason why F S A performs better than t ST I DEon the bsm-week1 data set.

9.3 Comparing t ST I DE and F S A

The key distinction between t ST I DE and F S A is that the former technique makesuse of the frequencies of k-windows while the latter makes use of the frequencies ofk-windows and the frequencies of their k− length suffixes. This distinction is illus-trated in Fig. 8 which shows the scores assigned by t ST I DE and F S A to windows,w( fk, fk−1). These scores are also referred to as likelihood scores and are the inverseof the anomaly score of the windows. Since fk ≤ fk−1, the entries above the lowerdiagonal are ignored.

F S A ignores the k-windows for which fk−1 = 0, i.e., the bottom left cner ofFig. 8b. It is clear that the scores assigned by t ST I DE are independent of fk−1 andare linearly proportional to fk . Thus only high frequency windows will be assigneda high likelihood score by t ST I DE . But for F S A, k-windows with low fk can stillbe assigned a high score, if the corresponding value of fk−1 is also low. This keydifference accounts for a key strength and weakness of t ST I DE and F S A.

Consider a scenario in which the training data set is not pure but contains oneanomalous sequence, such that most of the k-windows (for a given value of k) donot occur in any other training sequence. Let there be a truly anomalous sequence

123

Page 20: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

V. Chandola et al.

Fig. 5 2D Average frequencyprofiles for bsm-week1 data set(k = 6). a Normal Test Sequences.,b Anomalous Test Sequences., cDifference

(a)

(b)

(c)

123

Page 21: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

A reference based analysis framework

(a) (b)

(c) (d)

(e) (f)

Fig. 6 Absolute difference in 2-D frequency profiles for public data sets for k = 6 (hcv – snd-unm).a hcv (Δ = 1.56), b nad (Δ = 0.64), c tet (Δ = 0.73), d rvp (Δ = 1.71), e rub (Δ = 1.62), f snd-unm (Δ = 1.68)

in the test data set which is similar to the one anomalous training sequence. Mostof the k-windows extracted from this test sequence will have fk = 1. t ST I DEwill assign a high anomaly score to this test sequence. Though the value of fk−1cannot be guaranteed, it is likely that fk−1 ≈ 1. Thus F S A will assign a highlikelihood score to the k-windows of the anomalous test sequence, and henceassign it a low anomaly score. Thus t ST I DE is a better technique in this sce-nario.

123

Page 22: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

V. Chandola et al.

(a) (b)

(c) (d)

Fig. 7 Absolute difference in 2-D frequency profiles for public data sets for k = 6 (snd-cert – bsm-week3).a snd-cert (Δ = 1.54), b bsm-week1 (Δ = 0.23), c bsm-week2 (Δ = 0.27), d bsm-week3 (Δ = 0.61)

Now consider a different scenario, in which the training data set contains a sequencethat consists of k-windows that do not occur in any other training sequence, but arenormal. Let the test data contain one truly normal sequence similar to this trainingsequence. t ST I DE will assign a high anomaly score to this test sequence because thewindows extracted from this sequence will have fk = 1. But, similar to the argumentfor the previous scenario, F S A will assign a low anomaly score. Thus F S A is a bettertechnique in this scenario.

To summarize, t ST I DE is more robust when the training data might not be pure,i.e., it might contain anomalous sequences. F S Az is a better choice when the trainingdata has rare but normal patterns (windows) that have to be learnt.

9.4 F S Az

One issue with F S A is that it ignores the k-windows for which fk = fk−1 = 0. Butoften, such windows can differentiate between normal and anomalous sequences. Theplots of differences between the 2-D average frequency profiles for normal and anom-alous sequences, shown in Figs. 6 and 7, show that for several data sets, anomaloustest sequences have a higher proportion of such windows than the normal data sets.

123

Page 23: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

A reference based analysis framework

(a)

(b)

Fig. 8 Likelihood scores L( fk , fk−1), assigned by different techniques. a t ST I DE ., b F S Az

Our proposed technique, F S Az, utilizes this information by assigning a likelihoodscore of 0 to such sequences, instead of ignoring them. This makes F S Az performbetter than F S A for most data sets.

9.5 P ST

An issue with F S A (and F S Az), as noted earlier, is that they estimate the conditionalprobability of a symbol, based on its fixed length history, even if the history occursonce in the training sequences. Thus such estimates can be unreliable, and hence makethe techniques highly susceptible to presence of anomalies in the training set. P ST

123

Page 24: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

V. Chandola et al.

addresses this issue by conditioning the probability of a symbol on its k length history,only if the history occurs a significant number of times in the training sequences. If thefrequency of the history is low, i.e., the conditional probability estimate is unreliable,it uses the longest suffix of the history which satisfies the reliability threshold.

The score assigned by P ST to a k-window is lower bounded by the score assignedby F S Az. The actual score assigned by P ST not only depends on fk , fk−1, but alsoon δ, which is a threshold on fk−1. The value of δ is determined using user-definedparameters and the training sequences. For a given k-window, if fk−1 ≥ δ the scoreassigned by P ST is same as F S Az. Otherwise, P ST chooses the longest suffix of thek window of length j (2 ≤ j ≤ k), such that f j−1 ≥ δ. If f1 < δ, P ST assigns thescore equal to the probability of observing the last (kth) symbol of the given window.

For example, Fig. 9a–e show the difference in the frequency profiles for normaland anomalous test sequences for the rub data set for different values of k. To assigna score to a k-window for k = 6, the P ST technique will first consider the frequencyprofile for k = 6. Let us assume that the k-window to be scored has fk−1 < λ. Inthis case P ST will substitute the score with the score of a window of length k − 1 inthe frequency profile for k − 1 length windows. If for the k − 1 window, fk−2 ≥ λ,the corresponding score will be used, otherwise the frequency profile for k − 3 isconsidered, and so on.

Using this understanding of P ST , we can explain why P ST performs significantlypoorly than F S Az for most of the public data sets. Let us consider the rub data set.Figure 9a shows difference in the frequency profiles of normal and anomalous testsequences for k = 6. The distinguishing cells in the profile are mostly located in thebottom left corner, and there is a single distinguishing cell in the upper right corner.Both P ST and F S Az will assign similar scores to the k-windows belonging to the cellin the upper right corner. For the cell in the bottom leftmost corner, F S Az will assigna 0 score, and for other cells F S Az will assign a higher score. Thus F S Az will beable to distinguish between normal and anomalous test sequences, which supports ourexperimental finding that the performance of F S Az on this data set is (0.88). For P ST ,all k-windows belonging to the cells in the bottom left corner will have fk−1 < λ,and hence will be substituted with scores for shorter suffix of the k-windows. Thusthe scores for windows to the bottom leftmost cell in Fig. 9a will be scored same asthe shorter windows (of length j < k) in Figs. 9b–e for which f j−1 ≥ δ. But it isevident from the plots that normal and anomalous test sequences are not significantlydistinguishable for higher values of f j−1. This is reason why P ST performs poorlyfor this data set (0.28).

While the above mentioned behavior of P ST is an obvious disadvantage for mostof the data sets, it can also favor P ST in certain cases. For example, P ST performswell in comparison to F S Az on the artificial data set d6. The frequency profiles ofnormal and anomalous test sequences for d6 are shown for different values of k inFig. 10a–e. We observe that the frequency profiles for normal and anomalous testsequences are not distinguishable for k = 6 and hence F S Az performs poorly (0.38).But when frequency profiles for lower values of j ≤ 3 are considered by the P ST ,the profiles for normal and anomalous sequences are relatively more distinguishable(even for larger values of f j−1) and hence P ST performs better (0.68).

123

Page 25: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

A reference based analysis framework

(a) (b)

(c) (d)

(e)

Fig. 9 Absolute difference in frequency profiles for rub data set. a k = 6 (Δ = 1.62), b k = 5 (Δ = 1.62), ck = 4 (Δ = 1.37), d k = 3 (Δ = 0.32), e k = 2 (Δ = 0.13)

9.6 RI P P E R

The motivation behind RI P P E R is same as P ST , i.e., if the fixed length history of asymbol in a test sequence does not have a reliable frequency in the training sequences,the symbol is conditioned on a subset of the history. The difference being that thesubset is not the suffix of the history (as is the case with P ST ), but a subsequence ofthe history.

123

Page 26: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

V. Chandola et al.

(a) (b)

(c) (d)

(e)

Fig. 10 Absolute difference in frequency profiles for d6 data set. a k = 6 (Δ = 0.19), b k = 5 (Δ = 0.19),c k = 4 (Δ = 0.20), d k = 3 (Δ = 0.19), e k = 2 (Δ = 0.16)

RI P P E R, like P ST , assigns score to a k-window which is lower bounded by thescore assigned by F S Az. The actual score assigned by P ST depends on the RI P P E Rrule that is “fired” for the k − 1 prefix of the given window. If the target of the firedrule matches the kth symbol of the given window, the likelihood score is 1, else thelikelihood score is the inverse of the confidence associated with the rule. It is difficult

123

Page 27: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

A reference based analysis framework

0

100

200

300

400

500

600

700

800

900

1000

0.43 0.48 0.52 0.57 0.61 0.66 0.70 0.75 0.79 0.84

NormalAnomalies

(a)

0

50

100

150

200

250

300

350

400

0.76 0.77 0.78 0.79 0.80 0.81 0.82 0.83 0.84 0.85

NormalAnomalies

(b)

Fig. 11 Histogram of Average Similarities of Normal and Anomalous Test Sequences to TrainingSequences. a Data Set d1., b Data Set d6.

to analytically estimate the actual scores assigned by RI P P E R, but generally, thescores assigned by RI P P E R are higher than F S Az but lower than P ST .

As mentioned earlier, the scores assigned by RI P P E R are same as F S Az forhigher values of fk . For lower values of fk the scores depend on the distribution ofk-windows in the training data set as well as how the underlying classifier (RI P P E R)learns the rules and what is the order in which the rules are applied. Generally speaking,it can be stated that the scores assigned by RI P P E R to such windows is greater than0 but lower than the score assigned by P ST to such windows.

The above mentioned behavior of RI P P E R results in its poor performance incases in which the cells with lower values of fk are distinguishing and the anomaloustest sequences have higher proportion of windows in that cell than the normal testsequences. RI P P E R assigns a higher overall likelihood score to the anomalous testsequences and hence is not able to distinguish them from normal sequences. Forall PFAM data sets the distinguishing cells have lower fk value, resulting in poorperformance of RI P P E R. For UNM data sets, the distinguishing cells have highervalues for fk and hence the performance of RI P P E R is very close to that of F S Az.

10 Impact of nature of similarity measure on performance of anomalydetection techniques

Kernel based techniques (kNN and CLUSTER) are distinct from the window basedand Markovian techniques because they rely on the similarity between a test sequenceand training sequences to assign anomaly score to the test sequence. Thus their per-formance can be explained using the average similarity to training sequences char-acteristic.

One distinction between normal and anomalous sequences is that normal testsequences are expected to be more similar (using a certain similarity measure) totraining sequences, than anomalous test sequences. If the difference in similarity isnot large, this characteristic will not be able to accurately distinguish between normal

123

Page 28: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

V. Chandola et al.

Table 8 Values of sn , sa for the public data sets

hcv nad tet rvp rub snd-unm

snd-cert

bsm-week1

bsm-week2

bsm-week3

sn 0.53 0.48 0.67 0.82 0.75 0.99 0.99 0.97 0.98 0.97

sa 0.38 0.38 0.37 0.36 0.37 0.50 0.38 0.88 0.81 0.73

sn − sa 0.15 0.10 0.30 0.46 0.38 0.49 0.61 0.09 0.17 0.24

Table 9 Values of sn , sa for the artificial data sets

d1 d2 d3 d4 d5 d6

sn 0.87 0.87 0.86 0.86 0.86 0.86

sa 0.45 0.63 0.63 0.73 0.76 0.78

sn − sa 0.42 0.24 0.23 0.13 0.10 0.08

and anomalous sequences. This characteristic is utilized by kernel based techniques(kNN and CLUSTER) to distinguish between normal and anomalous sequences.

For example, Fig. 11a shows the histogram of the average (nLC S) similaritiesof test sequences in the artificial data set d1 to the training sequences. The normaltest sequences are more similar to the training sequences, than the anomalous testsequence. This indicates that techniques that use similarity between sequences todistinguish between anomalous and normal sequences will perform well for this dataset. From Table 6, we can observe that the performance of CLUSTER as well askNN is 100% on d1. A similar histogram for data set d6 is shown in Fig. 11b, whichshows that average similarities of normal test sequences and the average similarities ofanomalous test sequences are very close to each other. This confirms the observationin Table 6 that CLUSTER and kNN should perform poorly for this data set.

We quantify the above characteristic by computing the average sequence similarityfor each test sequence. Let the average of the average similarities for normal testsequences be denoted as sn , and average of the average similarities for anomalous testsequences be denoted as sa . If for a given data set, the difference sn− sa is large, kNNand CLUSTER are expected to perform well on that data set, and vice-versa.

Tables 8 and 9 show the values of sn, sa , and sn − sa , for the real and artificial datasets, respectively. The performance of both kNN and CLUSTER is highly correlatedwith the difference sn − sa .

11 Using RBA features for anomaly detection

A key aspect of the RBA framework is that it maps data instances into a multivariatecontinuous space, where normal and anomalous instances can be distinguished fromeach other. More formally, the mapping can be expressed as a function, fθ : S→ �d

that transforms a discrete sequence S to a d dimensional space. θ denote the parametersassociated with the mapping function.

123

Page 29: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

A reference based analysis framework

Input: k, p, nn, S, TOutput: AT← [ ]1foreach T ∈ T do2

T← T ∪ fk,p(T )3

end4A← Array[|S|]5i ← 0 foreach S ∈ S do6

S← fk,p(S)7

A[i] ← k N N Anomaly Score for S with respect to T using nn neighbors8i ← i + 19

end10return A11

Algorithm 1: RBA Based Anomaly Detection

One particular instantiation of the above mapping was sketched in Sect. 8.1. A 1-Dfrequency profile using p bins and k window length is a p dimensional vector F =f 1Dk,p (S), such that F[i] denotes the proportion of k length windows with frequency

that falls in the i th bin of the histogram obtained from the training sequences in T.Another instantiation can be derived from the 2-D frequency profiles discussed in

Sect. 8.1. The p × p 2-D frequency profile using p bins and k window length can beconverted into a p(p+1)

2 dimensional vector by “flattening” the lower triangular portion(including the diagonal entries) of the frequency profile. We denote this mapping byf 2Dk,p .

Algorithm 1 is a sketch of an RBA feature based anomaly detection method whichrelies on the above described functional mapping. The key idea is to map the train-ing and test sequences to a d dimensional continuous space and apply a traditionaldistance based anomaly detection algorithm (such as the nearest neighbor basedmethod Ramaswamy et al. (2000)) to assign an anomaly score to each test sequence.The algorithm takes as input the set of training sequences, T, the set of test sequences,S, number of bins for histograms, p, window length parameter, k, and the number ofnearest neighbors to consider for anomaly detection, nn. The output of the algorithmis a vector A where each entry is the anomaly score assigned to the corresponding testsequence in S.

By plugging the above two functional mappings ( f 1Dk,p and f 2D

k,p ) in Step 7 of Algo-rithm 1, we obtain two variants, named W I N1D and W I N D2D , respectively.

11.1 Results on public and artificial data sets

We evaluate the performance of the proposed techniques on the public and artificialdata sets described in Sect. 4.

11.1.1 Sensitivity to parameters

We first investigate the sensitivity of the different parameters on the performance ofW I N1D and W I N2D . The techniques gave best overall performance for window size

123

Page 30: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

V. Chandola et al.

0

0.2

0.4

0.6

0.8

1

TechniquesCLU

STER

kN

N

tSTID

E

F

SA

FSAz

P

ST

RIP

PER

H

MM

WIN

1D

WIN

C2D

Ave

rage

Acc

urac

y

0

0.2

0.4

0.6

0.8

1

Techniques

Ave

rage

Acc

urac

y

CLUSTER

kN

N

tSTID

E

F

SA

FSAz

P

ST

RIP

PER

H

MM

WIN

1D

WIN

C2D

(a) (b)

Fig. 12 Comparison of average accuracies for W I N1D and W I N2D , and existing anomaly detectiontechniques. a Public Data Sets, b Artificial Data Sets

k = 6, which was also the best performing window size for the existing window basedand Markovian techniques. The performance of both techniques was not sensitive tothe number of nearest neighbors. The techniques gave best performance when thenumber of bins used to construct the profile, p, was low (≈ 3). For larger values of p,the dimensionality of the mapped data increased, and hence the performance of thedistance based anomaly detection technique deteriorated.

For the results provided in subsequent section, the optimal parameter settings werefound by testing on a validation set for different combinations of the parameters(p, k, nn) and using the combination the provides best average results across all datasets. The results are shown for p = 5, k = 6, and nn = 5.

11.1.2 Comparison with all existing techniques

The first set of results show how W I N1D and W I N2D compare against the state ofart techniques, discussed in Sect. 3, on the different public and artificial data sets.The comparison of average performance of the proposed techniques with the existingtechniques is shown in Fig. 12. Notably, W I N1D , which is based on t ST I DE , showsbetter performance than t ST I DE , and W I N2D , which is based on F S Az, showsbetter performance than F S Az on both public and artificial data set. Overall, W I N2D

performs significantly better than all existing techniques on average across all publicand artificial data sets.

The reason the proposed techniques perform better than the existing techniques isbecause of the way the windows are utilized by the proposed techniques. For example,let us consider t ST I DE and W I N1D . Both of these techniques use the frequencyof k-windows to distinguish between the normal and anomalous test sequences. Butt ST I DE weights the windows in a test sequence by their frequencies and the sumsthe total weights to get an inverse of the anomaly score. On the other hand, W I N1D

bins the windows based on their frequency, and then uses the normalized bin counts asfeatures. By using a nearest neighbor approach, W I N1D “learns” weights on differentwindows to achieve best separability between the normal and anomalous test sequence.Same holds true for F S Az and W I N2D .

123

Page 31: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

A reference based analysis framework

Table 10 Comparing accuracy of W I N1D and W I N2D against best existing technique for public datasets

PFAM UNM DARPA Avghcv nad tet rvp rub snd-

unmsnd-cert

bsm-week1

bsm-week2

bsm-week3

W I N1D 0.92 0.74 0.52 0.90 0.88 0.82 0.88 0.30 0.60 0.66 0.72

W I N2D 0.92 0.76 0.82 0.92 0.92 0.84 0.88 0.50 0.60 0.66 0.78

Existing best 0.92 0.74 0.50 0.90 0.88 0.82 0.88 0.50 0.56 0.66 0.74

Table 11 Comparing AUC of W I N1D and W I N2D against best existing technique for public data sets

PFAM UNM DARPA Avghcv nad tet rvp rub snd-

unmsnd-cert

bsm-week1

bsm-week2

bsm-week3

W I N1D 1.00 0.98 0.98 1.00 1.00 0.99 0.98 0.75 0.92 0.92 0.95

W I N2D 1.00 0.99 0.99 1.00 1.00 0.99 0.98 0.91 0.93 0.92 0.97

Existing best 1.00 0.98 0.98 1.00 1.00 0.99 0.96 0.88 0.91 0.97 0.97

Table 12 Comparing accuracy of W I N1D and W I N2D against best existing technique for artificial datasets

d1 d2 d3 d4 d5 d6 Avg

W I N1D 1.00 0.92 0.58 0.52 0.34 0.64 0.67

W I N2D 1.00 0.96 0.81 0.76 0.71 0.74 0.83

Existing best 1.00 0.84 0.82 0.76 0.68 0.68 0.80

Table 13 Comparing AUC of W I N1D and W I N2D against best existing technique for artificial data sets

d1 d2 d3 d4 d5 d6 Avg

W I N1D 1.00 1.00 0.91 0.89 0.76 0.87 0.90

W I N2D 1.00 1.00 0.98 0.98 0.96 0.98 0.98

Existing best 1.00 0.96 0.98 0.98 0.96 0.95 0.97

11.1.3 Comparison with best existing technique

The strength of the RBA based techniques, W I N1D and W I N2D , is that they distin-guish between normal and anomalous test sequences in a multi-dimensional space,while most of the existing techniques operate along one or a limited subset of thedimensions. In the second set of results we assess if this strength allows the RBAbased techniques to outperform the existing techniques.

In Table 10, we compare the accuracy results for W I N1D and W I N2D against thebest existing technique for each public data set. The results show that both W I N1D andW I N2D are strictly better or comparable with the best existing technique for almost

123

Page 32: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

V. Chandola et al.

all of the public data sets. The same inference can be drawn from the AUC results forthe public data sets in Table 11.

For artificial data sets, the performance of W I N2D is still significantly better thanthe best existing technique for each data set as shown in Tables 12 and 13. Notably,for the artificial data sets, the performance of W I N1D is relatively worse than the bestexisting technique.

For artificial data sets, P ST was found to be the best technique while both F S Aand F S Az were found to perform poorly for many artificial data sets. The reason wasthat the artificial data sets were designed to break F S A and F S Az, while P ST , whichutilizes the frequencies of varying length suffixes of the k length windows, was ableto distinguish between the normal and anomalous test sequences. By using W I N2D ,the behavior of P ST is captured and improved, and hence W I N2D outperforms P STon the artificial data sets.

12 Conclusions and future work

In this paper we have shown how the RBA framework can be used in the context ofanomaly detection for symbolic sequences. Visualizing symbolic sequences is chal-lenging, especially when the sequences are of varying length. Using the RBA frame-work we provide a visualization scheme for symbolic sequences.

The RBA based mapping for the symbolic sequences is motivated from the existingtechniques that use fixed length windows as a unit of analysis. In this chapter we haveshown how, using the RBA based features, one can understand the performance of thedifferent existing techniques. Moreover, the framework can also be used to identifythe fundamental differences between techniques. For example, t ST I DE and F S A areshown to be highly different from each other since they handle k-windows in distinctmanner. The framework also allows to identify the weaknesses of each technique.For example, the poor performance of P ST on most of the real data sets could beexplained using the framework. The same framework can also be used to constructscenarios in which a given technique would perform well or poorly.

The analysis of the various distinguishing characteristics can also aid in choosingoptimal values of parameters for different techniques. For example, Fig. 9 shows themagnitude of difference between normal and anomalous sequences in rub data set fordifferent values of window size k. The maximum difference occurs when k = 5 or 6.Our results indicate that all techniques that depend on window size as a parameter giveoptimal performance for these values of k. Similarly, for k N N and C LU ST E R, thedifference in the corresponding characteristic for normal and anomalous test sequence,can be calculated for different values of the parameter k. The value of k that results inmaximum difference in terms of the characteristic, is likely to give best performance onthat data set. One could argue that given a labeled validation data set, a technique canbe evaluated for different parameter values to obtain the optimal value. But using theproposed framework, the analysis needs to be done only for a characteristic, withouthaving to test every technique that depends on that characteristic.

The most significant outcome of applying the RBA framework to symbolicsequences is that the features obtained from the mapping can be used to develop

123

Page 33: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

A reference based analysis framework

powerful anomaly detection techniques, which outperform the existing techniques.Moreover, the RBA based techniques are shown to better than the best existing tech-nique for most of the data sets. Thus instead of using different existing techniqueswhich are optimal for different data sets, RBA provides one best technique across avariety of data sets. This is a significant step towards the ultimate goal for the anom-aly detection research, which is to find a technique that can perform well across allapplication domains.

Acknowledgments This work was supported by NASA under award NNX08AC36A, NSF Grant CNS-0551551 and NSF Grant IIS-0713227. Access to computing facilities was provided by the Digital Technol-ogy Consortium.

References

Bateman A, Birney E, Durbin R, Eddy SR, Howe KL, Sonnhammer EL (2000) The pfam protein familiesdatabase. Nucleic Acids Res 28:263–266

Baum LE, Petrie T, Soules G, Weiss N (1970) A maximization technique occuring in the statistical analysisof probabilistic functions of markov chains. Ann Math Stat 41(1):164–171

Budalakoti S, Srivastava A, Otey M (2007) Anomaly detection and diagnosis algorithms for discrete symbolsequences with applications to airline safety. In: Proceedings of the IEEE International Conference onSystems, Man, and, Cybernetics, vol 37

Chandola V, Banerjee A, Kumar V (2009) Anomaly detection—a survey. ACM Comput Surv 41(3):1–58Chandola V, Banerjee A, Kumar V (2012) Anomaly detection for discrete sequences: a survey. IEEE Trans

Knowl Data Eng 24:823–839Chandola V, Boriah S, Kumar V (2009) A framework for exploring categorical data. In: Proceedings of the

ninth SIAM International Conference on Data MiningChandola V, Boriah S, Kumar V (2010) A reference based analysis framework for analyzing system call

traces. In: CSIIRW ’10: Proceedings of the 6th Annual Workshop on Cyber Security and InformationIntelligence Research, New York, NY, USA, ACM

Chandola V, Mithal V, Kumar V (2008) A comparative evaluation of anomaly detection techniques forsequence data. In: Proceedings of International Conference on Data Mining

Chandola V, Mithal V, Kumar V (2008) Comparing anomaly detection techniques for sequence data. Tech-nical Report 08–021, University of Minnesota, Computer Science Department, July 2008

Cohen WW (1995) Fast effective rule induction. In: Prieditis A, Russell S (eds) Proceedings of the 12thInternational Conference on Machine Learning. Morgan Kaufmann, Tahoe City, pp 115–123

Eskin E, Lee W, Stolfo S (2001) Modeling system call for intrusion detection using dynamic window sizes.In: Proceedings of DISCEX

Forney GD Jr (1973) The viterbi algorithm. Proc IEEE 61(3):268–278Forrest S, Hofmeyr SA, Somayaji A, Longstaff TA (1996) A sense of self for unix processes. In: Proceedinges

of the ISRSP96, pp 120–128Forrest S, Warrender C, Pearlmutter B (1999) Detecting intrusions using system calls: Alternate data models.

In: Proceedings of the 1999 IEEE ISRSP, Washington, DC, USA, pp 133–145, 1999. IEEE ComputerSociety

Gao B, Ma H-Y, Yang Y-H (2002) Hmms (hidden markov models) based on anomaly intrusion detectionmethod. In: Proceedings of International Conference on Machine Learning and Cybernetics, pp 381–385.IEEE

Gonzalez FA, Dasgupta D (2003) Anomaly detection using real-valued negative selection. Genet ProgramEvolvable Mach 4(4):383–403

Hodge V, Austin J (2004) A survey of outlier detection methodologies. Artif Intell Rev 22(2):85–126Hofmeyr SA, Forrest S, Somayaji A (1998) Intrusion detection using sequences of system calls. J Comput

Secur 6(3):151–180Lazarevic A, Ertoz L, Kumar V, Ozgur A, Srivastava J (2003) A comparative study of anomaly detection

schemes in network intrusion detection. In: Proceedings of SIAM International Conference on DataMining. SIAM, May 2003

123

Page 34: A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

V. Chandola et al.

Lee W, Stolfo S (1998) Data mining approaches for intrusion detection. In: Proceedings of the 7th USENIXSecurity Symposium, San Antonio, TX

Lee W, Stolfo S, Chan P (1997) Learning patterns from unix process execution traces for intrusion detection.In: Proceedings of the AAAI 97 workshop on AI methods in Fraud and risk management

Lippmann RP, et al. (2000) Evaluating intrusion detection systems—the 1998 darpa off-line intrusiondetection evaluation. In: DARPA Information Survivability Conference and Exposition (DISCEX) vol2, pp 12–26. IEEE Computer Society Press

MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: CamLM, Neyman J (eds) Proc. of the fifth Berkeley Symposium on Mathematical Statistics and Probability,vol 1. University of California Press, Berkeley, pp 281–297

Michael CC, Ghosh A (2000) Two state-based approaches to program-based anomaly detection. In: Pro-ceedings of the 16th Annual Computer Security Applications Conference, pp 21. IEEE Computer Society

Qiao Y, Xin XW, Bin Y, Ge S (2002) Anomaly intrusion detection method based on HMM. Electron Lett38(13):663–664

Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In:Proceedings of the ACM SIGMOD international conference on Management of data, ACM

Ray A (2004) Symbolic dynamic analysis of complex systems for anomaly detection. Signal Process84(7):1115–1130

Shalizi CR, Klinkner KL (2004) Blind construction of optimal nonlinear recursive predictors for discretesequences. In: Chickering M, Halpern JY (eds) Uncertainty in Artificial Intelligence: Proceedings of theTwentieth Conference (UAI 2004). AUAI Press, Arlington, Virginia, pp 504–511

Srivastava AN (2005) Discovering system health anomalies using data mining techniques. In: Proceedingsof 2005 Joint Army Navy NASA Airforce Conference on Propulsion

Sun P, Chawla S, Arunasalam B (2006) Mining for outliers in sequential databases. In In SIAM InternationalConference on Data Mining

123


Recommended