Alarming the Intrusion Expert - UvA · PDF file2 Alarming the Intrusion Expert: Alarm mining...

MSc Artificial IntelligenceTrack: Machine Learning

Master Thesis

Alarming the Intrusion ExpertAlarm mining for a high performance

intrusion detection system

by

Jos van der Velde100852573

October 26, 2016

42 ECTSJanuary 11, 2016 - October 26, 2016

Supervisors:Dr M W van Someren (UvA)Drs T Matselyukh (OPT/NET)

Assessor:Dr. J M Mooij (UvA)

OPT/NET BV

2

Alarming the Intrusion Expert:Alarm mining for a high performance intrusion

detection system

October 26, 2016

Abstract

Nowadays, intrusion detection systems are indispensable to reveal infiltratorsand misconfigurations in networks. On large streams of network data, thesesystems will raise unmanageable numbers of alarms. This problem can bealleviated by grouping alarms together, so that experts have to check only asingle instance of each alarm group instead of every alarm. Various alarm miningapproaches have been proposed, using either unsupervised clusterers, supervisedclassifiers, or a combination. Although combined approaches show strongperformance while requiring only a small labelled dataset, existing studies haveonly applied it to detectors that are based on machine learning. The present studyemploys an unsupervised algorithm to cluster the alarms of an existing detectorthat is based on manually created rules. Subsequently, a supervised classifierallows experts to refine the alarm groups. The results promise manageablenumbers of homogeneous alarm groups. This approach is helpful becauseit makes the analysis of alarms feasible for large networks, where intrusiondetectors based on machine learning would require too many resources.

3

AcknowledgementsFirstly, I would like to express my sincere gratitude to both my supervisors: Dr.

Maarten van Someren from the University of Amsterdam and Drs. Taras Matselyukhfrom OPT/NET, for their excellent support. During the project, Dr. Van Somerenwas a fantastic guide and motivator, and provided precise and swift feedback. Drs.Matselyukh proved to be an outstanding source of domain knowledge and fruitfuldiscussions, provided access to OPT/NETs systems and software, and even allowedme the opportunity to present the work at the Toulouse Space Show. Besides mysupervisors, I would like to thank Dr. Joris Mooij for agreeing to be a part of mydefense committee. Finally, I would like to thank Hilko van der Leij, Joshua Snell andespecially Anne Martens for leaving no stone unturned while proofreading my thesis.

4

1 Introduction

1.1 Motivation

Networks are constantly bombarded by various types of attacks, including intrusionsby masqueraders who obtained a password, data spying or modification, and denial ofservice attacks. Since the networks of most organisations are nowadays connected tothe internet, even those containing highly sensitive information, vulnerability to suchattacks is a problem of utmost significance.

When prevention of attacks fails, as it inevitably will, organisations rely onintrusion detection to ensure that (automatic) measures can be taken promptly. Twoapproaches of intrusion detection can be distinguished: signature-based and anomaly-based detection[29]. The former detects attacks by comparing the network activitywith a database of known attack signatures. This ensures fast and reliable detectionas long as the signature database stays up to date. The main drawback for signature-based intrusion detection is therefore the vulnerability to zero day attacks (i.e. attacksexploiting a software vulnerability of which the vendor is not aware). This problem issevere, since software updates continually leave new loopholes and the creativity tofind new attacks never seems to dwindle.

A solution for patching this vulnerability is given by anomaly-based intrusiondetection. This approach forms a profile of normal network activity instead of focusingdirectly on misuses. For the detection phase, it assumes that new attacks will lead tonew patterns of activity. Such patterns arise when the attacker tries to obtain access,or afterwards while exploiting this access. For instance, a change of configurationmight signify a malicious user corrupting the network. Given enough normal networkactivity, features and time to train, this approach will detect any abnormality.

The main drawback of anomaly-based intrusion detection is a high number offalse alarms, resulting in an imposing burden of manual labour. These false alarmsare inherent to the fundamental assumption of anomaly-based systems: that any newactivity pattern equals an intrusion. In reality, researchers estimate that only one outof a hundred new activity patterns denotes an intrusion. [4, p.1, p.3][28, p.444][45,p.2][23, p.2] The other ninety-nine of the unseen activity patterns will raise a falsealarm. Another cause of false alarms are patterns of benign activity that cannot bedistinguished from malicious patterns. The latter may be an unavoidable sacrificewhen opting for near real-time performance on large streams of data, when not allinformation can be processed in time.

This thesis focuses on reducing the number of false alarms of an anomaly-basedintrusion detection system. The problem of reducing false alarms is significant becausefalse alarms require costly human intervention and may lead to human errors due tocomplacency.1

Moreover, we use an anomaly-based intrusion detection system that is based onmanually created rules. Such a detector does not require a computationally expensivetraining phase, in contrast to detectors that are based on machine learning, making it

1An extra motivation for reducing false alarms in anomaly-based intrusion detection is the genericityof the problem. When reduced to the extensive problem of finding patterns that do not comply toexpected behavior, i.e. anomaly detection, similar software can be found in fraud detection, medicaldiagnosis, damage detection and image processing.[11, p. 5]

5

capable of handling larger streams of network data. The drawback, of course, is thereliance on experts to perform knowledge acquisition.

The motivation behind this thesis can thus be summarized as “minimizing humaninterventions when detecting intrusions on the largest streams of network data.”

1.2 Problem statement

The problem studied in this thesis regards the reduction of false alarms in the OPTOSSintrusion detection system. We will apply alarm mining to this problem. The goalis to group similar alarms together, so that the human expert needs to assess only asingle instance of each alarm group, instead of every single alarm. Although this isnot a literal reduction of the false alarms, it reduces the false alarms that need to bechecked. For example, the human expert might have assessed an alarm regarding aperson that tries to login to the system. The expert might decide to ignore the alarm,or to create an automatic script that should be executed every time such an alarm isencountered. Thereafter, new instances of such alarms should not be shown to theexpert, but should be automatically handled. This way, the expert checks only oneinstance of each alarm group, knowing that the other alarms inside that group aresimilar. The number of (false) alarms that are shown to the expert is reduced.

We will thus not try to improve the accuracy of the intrusion detector, or indeedchange the detector in any way, but only focus on grouping the alarms.

To give a clear grasp of the problem, the OPTOSS will be briefly introduced,whereafter the alarms will be characterized, including a discussion of the concept ofsimilarity, culminating in an explicit problem statement.

OPTOSS is a proprietary intrusion detection system and stands for OPerator TimeOptimized decision Support System for ICT infrastructures[1]. This system wasconceived a decade ago, and performs the intrusion detection in five steps:

1. Collecting raw data, originating from logs of network activity. Each line of thelog denotes an event. An event contains a description, a facility (from whatkind of activity it originates, for example “ssh” or “snmp”), a time-stamp, butno duration: long lasting activity might produce a single event, as summary, ormultiple consecutive events.

2. Each event retrieves a severity score which signifies the likelihood that the eventis part of an attack.

3. The events are grouped together into event groups, containing consecutiveevents of the same device (e.g. a router). The events are split into two groupswhen the summed severity of all events in a time-span is low.

4. The detector raises an alarm over an event group if the severity per second (i.e.the summed severity of all events in that second) fluctuates significantly overtime.

5. Similar alarms are grouped together in alarm groups, based on the severityscores of the events inside the alarms.

The detector contains two distinguishing features: it is based on manually createdrules, and has a bag-of-events based alarm representation. The rules are made byhuman experts and are used to assign a severity score to each event, in the secondstep of the detection process. The detection is a direct consequence of these manually

6

created rules, which contrasts this detector with systems that are based on machinelearning. Besides, the alarms are raised over event groups instead of over singleevents, giving the detector an unique, bag-of-events based alarm representation. Analarm thus contains all consecutive events inside an event group. The group starts andends at places where the summed severity of the events is low, and originates from asingle device. Typically, it ranges over a couple of seconds, containing a few dozensof events, although this depends on the rules and the thresholds of the system. Alarmscan be classified as true positives (containing intrusive events) and false positives(only harmless events). The lack of alarms can be justly, in which case it is a truenegative, or a result of missing harmless events, in which case it is a false negative.

A deficiency of the current OPTOSS is that the alarms are not grouped correctly.First of all, there are too many alarm groups to consider it manageable for the humanoperator. Secondly, many alarm groups are heterogeneous. Whereas an alarm containsan heterogeneous group of events, alarm groups should consist of similar alarms, inorder to save time for the human expert: if the alarms are homogeneous, the expertneeds to analyze only one alarm in each alarm group.

As concept of similarity, the most practical metric would be to declare two alarmsas similar when the same action should be taken when the alarm is encountered. Sincethe action which needs to be taken typically does not follow from the type of events,we used the root cause of an alarm to form a notion about the similarity. Examplesof root causes can be “attempt to obtain root access on device A by method B” or“misconfiguration of program C on device D.” This root cause can be establishedby an expert, based on an analysis of the events, so that the similarity within alarmgroups can be evaluated. Two alarms are thus similar when they share the sameroot cause. In practice, the lines dividing two alarm groups may be cloudy: twoDenial-of-Service attacks could differ in the size or ingenuity of the attack, requiringdifferent countermeasures, and justifying the allocation of two different root causes.A different notion of similarity differentiates alarms based on them being true or falsepositives. This notion is more transparent, although it might sometimes be ambiguousstill. In this thesis, we will use both definitions of similarity (same root cause orbeing both true or false positive) to evaluate the similarity within alarm groups. Theseconsiderations lead us to the following problem statement:

Problem statement: To group alarms that are raised by the OPTOSS detectorinto alarm groups, in such a way that each alarm within an alarm group has the sameroot cause. In this problem statement:

• The OPTOSS detector is an existing intrusion detection component. We willnot modify this component.

• An alarm is a group of events (log entries having a time-stamp but no duration).An alarm group contains all events that where encountered at a single device,over a time-span of typically a few seconds.

• An alarm group is a group of alarms that were raised at a single device. Ifthe grouping is performed correctly, an alarm group can be identifiable with asingle root cause, that is, a root cause that is shared by all alarms within thealarm group. These alarm groups form the output of our new components, incombination with an assignment of each alarm to a single alarm group.

• The root cause of an alarm is the activity that generated most of the events thatled to the alarm. Examples are an “attempt to obtain root access on device A by

7

method B.” The root cause of each event is determined by experts. Wheneverthe dividing line between two root causes was vague, we used the rule of thumbthat the causes where seen as distinct, if and only if they trigger the expert totake a different action. This means, for instance, that am alarm that is caused bya successful attack has always a different alarm group than one that is causedby an unsuccessful attack. The root cause of an alarm is typically based on asubset of the events inside the alarm, the rest of the events being noise. Whilethe OPTOSS runs, the root cause of each alarm is unknown to the algorithms.Instead, hints about the root cause need to be distilled out of the informationthat is available about each alarm.

• The result will be groups of alarms that have similar root causes. This is helpful,because the expert then has to assess only a single alarm from each alarm group,knowing that the rest of the alarms do not need to be checked, because they havea known cause. The expert can now even continue to set up automatic scripts,defending against new alarms that belong to known root causes.

For this research, an abundance of raw network data was available. A labelleddataset, consisting of alarms grouped correctly together, was not available.

1.3 Our Contributions

We propose to combine the existing, anomaly-based OPTOSS detector with alarmmining components. Since a labelled dataset for our test network was not available,which is the case for most application of the OPTOSS, we primarily relied on anunsupervised K-means clusterer. As any unsupervised component that works on asufficiently complex problem, the clusterer makes mistakes. To correct such mistakeswe included the possibility to refine the alarm groups. This way, the human expert canuse his domain knowledge to combine two groups that were seen as separate by theclusterer, or to divide an alarm group into new alarm groups. The manual feedbackshould then be exploited by the system, by using it as a labeled dataset, to correctlygroup future alarms. This is the task of the supervised component, a Random Forestclassifier. The requirement of human feedback is thereby reduced in two ways: firstly,experts need to cluster only some of the alarm groups, and not the individual events.Secondly, the clusterer already performs a grouping of the alarms. Given that theclusterer works well, the expert needs to make only small changes.

The benefit of this approach is a system enabling real-time detection of previ-ously unseen intrusions in large data streams, employing an unlabeled dataset. Thedrawbacks are a reliance on high quality rules for the detector and a reliance onmanual feedback for the classifier. The proposed approach thus comes into its ownfor problems with large networks and high quality (expensive) experts, when labelleddata is not available.

Of course we are not the first to propose to combine a clusterer and a classifierin sequential order to improve intrusion detection. But to the best of our knowledge,the architecture of our solution makes it unique by combining a rule-based anomalydetector with a clusterer and a classifier. The 2016 work of MIT research scientistVeeramachaneni and his colleagues from the company PatternEx, aptly titled “AI2,training a big data machine to defend”[56], is most closely related to this thesis. Themain difference is that Veeramachaneni et al. proposed a detector which implements

8

outlier detection by an ensemble of clustering techniques, whereas we used an existingrule-based (and bag-of-event based) detector which we combine with a separateclusterer. Since our clusterer works on a significantly smaller subset of the data,namely the alarms and not every individual event, the computational complexity oftraining our system is lower than that of Veeramachaneni et al. Since the OPTOSSdetector itself is efficient, this results in our solution using less processing power - animportant factor when working with large streams of data - although it needs moreexpert knowledge setting the system up.

Other related works in alarm mining apply only a clusterer to an existing detector.The main difference with our system is that we refine the results of the clusterer with asupervised classifier, so that domain knowledge can be exploited to form better alarmgroups.

1.4 Metrics

To evaluate the proposed system, we need to evaluate both the clusterer and theclassifier. Firstly, we need to make sure that the computational requirements of thesystem are acceptable. This be analyzed theoretically, and verified by measuring theperformance of the system on a live network.

To evaluate the alarm clusterer, we will assess the number and the quality of thealarm groups. The reduction ratio can be used to evaluate the former, describing thenumber of alarm groups relative to the number of unique alarms:

Reduction ratio =# unique alarms− # alarm groups

# unique alarms

The reduction ratio should be high, signifying a relatively low number of alarm groups.Evaluating the quality (i.e. the homogeneity) of the alarm groups is less straight-

forward. We will use the following metrics, the first two introduced by Pietraszek[46]and the last by Zhang[2]:

• The average fraction of true positives covered by alarm groups that contain moretrue positives than false positives. Ideally, this should by 1.

• The average fraction of false positives covered by alarm groups that containmore false positives than true positives. Ideally, this should by 1.

• The ratio of groups with only true positives, only false positives and the the ratioof mixed groups. The ratio of mixed groups should ideally be 0.

These metrics indicate the similarity within alarm groups, differentiating onlybetween true and false positives. Since we are also interested in differentiating alarmsbased on the root causes, as described in subsection 1.2, we also included the followingmetrics:

• The average group similarity: the ratio of alarms inside each group that originatefrom the same root cause, averaged over all alarm groups. Ideally, this shouldbe 1.

• The ratio of mixed groups (containing alarms with different root causes). Thisratio should ideally be 0.

9

As baseline, we use the existing OPTOSS system. This system already has aclusterer, forming alarm groups based on the severity score. Our goal is to improveboth its reduction ratio and its group quality.

To evaluate the alarm classifier, we will simply measure its accuracy. The accuracyis the ratio of alarms that are assigned the correct alarm group. In contrast with theevaluation of the clusterer, we do know the correct assignments for the classifier, sinceit has to mimic the assignments of the human expert and the clusterer. As baseline,we will use accuracy of the clusterer, which will be excellent except for any alarmthat is similar to the alarms that are changed by the human expert.

Comparison to other approaches is not straight-forward, mainly because thequality of our approach depends highly upon the alarms that are outputted by adetector that is not used before in alarm mining literature. Comparing these alarmswith the alarms of other detectors proves difficult, because OPTOSS uses a bag-of-events based representation. Consequently, using the same alarms is impossible,since the representation differs. Moreover, we would need to use the same datasetas other approaches. Although many researchers use the outdated 1999 DARPA[36] or the arguably better 2012 UNB ISCX[52] dataset, the most related work, ofVeeramachaneni et al., uses their own data only. With these considerations in mind,we have used our own dataset as well. We are therefore not able to directly comparethe performance of our system with other approaches.

1.5 Thesis outline

The rest of this thesis will be structured as follows. First of all, the necessary back-ground of intrusion detection (section 2) and machine learning (section 3) will becovered quite thoroughly. These sections are covered in depth to allow novices to getup to speed, and, since this is a thesis, to display the fruits of UvA’s machine learningeduction. Once the necessary background is in place, the related works section (4) willlist other approaches to solve similar problems. The current OPTOSS will be treatedin section 5, whereafter the proposed changes are described in section 6. Followinga famous statician, who was quoted saying “in God we trust. All others must bringdata,”2 we will do just that, describing the experiments in section 7 and the results insection 8. Finally, this thesis will be summarized in the conclusion (section 9).

2 Background intrusion detection

This section is aimed at those that are not familiar with the intrusion detection domain.In this section, intrusions will be characterized, intrusion detection techniques willbe described, and approaches to reduce the false alarms will be listed. We suggestseasoned intrusion detection specialists to skip this section, and to continue with themachine learning background (section 3).

2.1 Intrusions

Intrusions can be classified in a multitude of ways. Here we will mention the fivetypes as defined for the 1999 DARPA dataset[36]:

2This quote is commonly attributed to the statistician W. Edwards Deming.

10

1. Remote to Local: attempt to obtain user privileges.

2. User to Root: attempt to obtain root access.

3. Data compromise: attempt to read or modify data.

4. Denial-of-Service: attempt to disrupt the network, e.g. by consuming thenetworks resources.

5. Probe: scanning of a network, for instance to look for vulnerabilities.

To determine the attributes we want to process of all events, we need to accountfor these different attack types. To detect Remote to Local, User to Root and Datacompromise attacks the system needs so-called content-based features which simplygive clues about the actions the events correspond to. Denial-of-Service and Probeattacks can be recognized by time-based and connection-based features, which showrespectively the number of events over time and the number of events originatingfrom the same source.[34][33, p.6]

2.2 Detectors

Intrusion detection techniques are either signature-based (also known as misusedetection) or anomaly-based, a distinction made by Kumar in 1994.[29] On the onehand, misuse detection forms the most straightforward technique and is based on adatabase of known malicious events (signatures). The system simply compares theincoming events to the database, triggering on known intrusions. Anomaly-basedtechniques, on the other hand, as proposed by Denning[15], create a model of normalbehavior, regarding every deviation from this model as an intrusion. This enablesanomaly-based systems to detect unfamiliar events, but generally results in higherfalse-positive rates compared to their misuse-based counterparts.

Following Northcut[42], we can also distinguish intrusion detection systems basedon the placement of the systems, being either on the network or on a host. In theformer case, the so-called Network-based Intrusion Detection Systems (NIDSs) willdirectly monitor the packets of network traffic, enabling it to interfere with possibleattacks before they have even reached a device. Host-based Intrusion DetectionSystems (HIDSs), on the other hand, analyse system logs, making them capable ofdetecting intrusions once they reached the system, also when the device is offline(in contrast to NIDSs). In many ways, NIDS and HIDS complement each other, anda combination of both is recommended in most cases, in which the HIDS can beregarded as the last line of defence.

Another distinction can be made between the type of information the system uses,separating systems based on machine learning with system based on manually craftedrules. These manually generated rules can either specify normal behavior or maliciousbehavior (in this latter case, the rules are also referred to as “signatures”). Machinelearning based systems, on the other hand, train to infer rules from a dataset. Manydifferent machine learning techniques are applied in the intrusion detection domain:for instance neural networks, support vector machines, K-nearest neighbours, geneticalgorithms and techniques based on association rules or fuzzy logic.[37]

As final distinction, some systems are capable of autonomous actions after aknown intrusion is detected. Such systems are known as intrusion prevention systems

11

instead of mere intrusion detection systems. Typical actions of such intrusion preven-tion systems include terminating the connection, blocking the attackers IP addressand changing the content of the attack.[41, pp. 2-2 and 2-3]

The OPTOSS detector can be labeled as a host-based intrusion prevention systemthat relies on manually crafted rules. It is neither completely signature-based norcompletely anomaly-based, exhibiting characteristics from both approaches. Thiswill become clear in section 5, where the architecture of the current system will beexplained.

2.3 Reducing false alarms

Different approaches for reducing the number of false alarms can be found in thevast literature of anomaly-based intrusion detection: enhancement of the detector,alarm verification, alarm prioritization, alarm correlation, hybrid methods, and - theapproach we used - alarm mining.3 All but the first rely on placing an extra componentafter the detector for further processing of the alarms. Let us first identify the benefitsof alarm mining, followed by a discussion of why the other approaches by themselvesare not sufficiently capable of reducing the amount of false alarms, although theymight form interesting additions.

Using alarm mining, we present the human expert a significantly reduced numberof alarms by grouping the alarms together based on attributes such as protocol (eg.syslog, snmp), type of job (eg. traffic, sshd, cron) , and message (eg. “IP spoofing!From # to #, proto #. Occurred # times.”) Now the expert only needs to assess eachtype of attack once, reducing the number of seen alarms easily by two orders ofmagnitude.4 To implement such an alarm mining based approach, an extra componentis put in place to reduce the output of the anomaly detector. Once we realize that theproblem of grouping alarms together is reducible to finding similarities in a dataset,we can bring forward the extensive toolbox of machine learning. Algorithms can beborrowed from unsupervised learning (alarm clustering without expert feedback) andsupervised learning (alarm classification based on expert feedback). The advantageof alarm mining, then, is that it allows the expert to assess each type of attack onlyonce. Disadvantages include the sensitivity to network changes (e.g. reconfiguring thenetwork might make new alarms unrecognizable [23, p. 14]), the inability to classifyalarms as harmless or intrusion - it is not a panacea - and, in case of supervisedlearning, problems in obtaining a large and consistent dataset.

The other approaches are by themselves not sufficiently capable of reducing thenumber of false alarms. First of all, straightforward enhancement of the detector is nota feasible solution for reasons already stated in the introduction: there are performancelimits when crunching large streams of data in real time.

The second option, alarm verification, might be the most ambitious. It attemptsto identify whether an attack was successful. A distinction can be made betweenactive and passive alarm verification, the former monitoring the network in real time,whereas the latter relies on a database of attack successes. Alarm verification bringsa new advantage to the table: it makes it possible to distinguish between harmlessactivity and intrusions, promising to reduce the amount of alerts drastically. Currently

3This enumeration is similar to the one found in signature based survey [23].4Based on own experiments.

12

however, such methods are not reliable: mimicry attacks can trick the network intobelieving the attack failed.[55][23, p. 12]

Alarm prioritization takes an alternative path by assigning a priority score toalerts, whereafter the number of alarms is reduced by simply ignoring all but themost important alarms. To compute the priority score the system is given (or learns)rules about the importance of the target entity, about how well the part of the networkis configurated and about the correlation to similar alerts. Although some priorityindication may be a helpful addition to any intrusion detection system, the drawbackof using it to ignore attacks seems obvious: the prioritization algorithm has essentiallythe same task and thus limitations as the detector, not prioritizing (detecting) harmlessevents, and thus we cannot expect it to perform better than the detector itself.

The literature does not agree on a single definition for the last unmixed approach:alarm correlation. Some use the term when grouping together alarms generated byvarious sensors[49], some by grouping together alarms from multiple intrusion detec-tion systems[62]. We follow[23, p. 15] and find a common ground by extending thedefinition of alarm correlation to “any attempt to group non-similar alarms together inattack scenarios”. Here the adjective non-similar is of vital importance, differentiatingthis approach from alarm mining. But the two techniques are closely related: withoutclustering similar attacks first, the relation between occurrences of non-similar mightbe less clear. Therefore, we see alarm correlation as a next step after alarm mining isperformed. The result of alarm correlation is a number of attack scenarios, groupingnon-similar alarms together that stem from a single process or a single attacker, possi-bly scattered over a longer time and over multiple devices. The (dis)advantages of thisapproach are similar to alarm mining, although we speculate that alarm correlationmay require more (error-prone) fine-tuning, since measures of similarity are usuallymore obvious than parameters in determining the correlation of un-similar entities(e.g. the influence of time on the probability of belonging to the same attack).

Finally, the listed techniques are not mutually exclusive and lend themselves wellto be combined in hybrid systems.5 Any anomaly-based intrusion detection systemwill benefit from alarm mining and a prioritization score, where after the numberof alarms can be reduced even further by applying alarm correlation. Moreover,alarm verification might assist to ignore alarms (this might be a risky procedure) or toenhance the prioritization score.

3 Background machine learning

This section is meant for those that do not have sufficient machine learning experience.It brings the reader up to speed on feature extraction, feature preprocessing, andunsupervised, supervised and semi-supervised machine learning techniques. Werecommend machine learning enthusiasts to skip this section, and to continue withthe related work (section 4).

Before we start with the machine learning concepts, let us give a quick preview ofsection 6 in order to get familiar with the overall framework of the machine learning

5As a warranty note: the usage of the term ‘hybrid’ is ambiguous in intrusion detection literature.The term ‘hybrid anomaly-based intrusion detection systems’ should not be confused with mere ‘hybridintrusion detection systems’, which can denote a combination of an anomaly-based system with asignature-based counterpart (capable of increasing the accuracy, not of decreasing the amount of falsealarms), or, depending on the author, any combination of intrusion detection techniques.

13

solution of this thesis. Our goal is to group similar alarms together. Our datasettherefore consists of alarms. Each alarm consists of features that give clues about thesimilarity between alarms. An example of a feature is the most important description(the text of the log entry) of an alarm. We cannot simply use string attributes (i.e.textual data), because the machine learning algorithms expect numerical values, so weperform feature engineering. Once we obtained a numerical dataset, we can clusterthose alarms together that have similar feature values. This is done using unsupervisedmachine learning, and results in a number of alarm groups. Subsequently, we willallow the human expert to change some alarm groups (the expert might, for instance,correct mistakes of the clusterer). The resulting alarms, labeled with their alarm group,will form the dataset for our supervised machine learning algorithm. This algorithmtrains on the labeled dataset, and should learn to predict the correct alarm group forall new alarms.

The rest of this section will explain the basics of machine learning, and willtherefore not directly relate the mentioned techniques to the work of this thesis. Ofcourse, we mainly treat techniques that we used or considered using, the exceptionbeing section 3.4 about semi-supervised learning, which is added for completeness.

3.1 Features

It all starts with a dataset. Before any pattern can be recognized, before any clustercan be discovered, we need to establish and optimize our dataset. The dataset isestablished by deciding which attributes to use (feature selection) and, if desired, bycreating features using a function over one or more attributes (feature engineering).The features can then be polished by different preprocessing techniques.

The importance of using the right features cannot be overstressed. People tend tounderestimate this importance, focusing solely on the machine learning algorithms.In practice, deciding which features to use, and how, is a vital part of any appliedmachine learning research, and demands a significant share of time. In the words ofPedro Domingos, writer of a paper titled A Few Useful Things to Know about MachineLearning, “some machine learning projects succeed and some fail. What makes thedifference? Easily the most important factor is the features used.”[17, p.82] With thatbeing said, we will give a brief introduction on feature selection, feature engineeringand feature preprocessing, highlighting the main considerations.

First of all, feature selection is concerned with choosing the attributes that supplythe best information for the machine learner. The straight-forward approach of usingall attributes is usually not optimal: some attributes will worsen the performance. Thereason might be that they contain redundant or irrelevant information. Redundantinformation may be caused by correlated attributes (e.g. number of seconds andnumber of milliseconds of an event), which will bias the learner to overvalue thisinformation, resulting in worse performance. Irrelevant information can make theperformance worse when the machine learning algorithm does find a pattern in thetraining data, a pattern that is not likely to exists in new data instances. Choosingthe right attributes might be done by expert intuition, although it is often better to trydifferent combinations of attributes in practice, because machine learning is usuallynot applied to problems that are simple enough to enable good human intuition.

Secondly, when the right attributes are chosen, some might not be in a numericformat, and might need a transformation step to make it possible to feed them into

14

the machine learning algorithms. This transformation of non-numeric into numericattributes is denoted as feature engineering. The two most commonly encounteredtypes of non-numeric data are categorical attributes (e.g the protocol of an event, suchas SNMP or SSH) and string attributes. We will describe feature engineering steps forboth.

Regarding categorical data, most machine learning algorithms will assume thatthe distance between feature values is meaningful. That means that assigning anumber to each category will jeopardize the system. When, for instance, the colors areencoded as (red: 1), (blue: 2) and (green: 3), many machine learning algorithms willincorrectly assume that red and blue are closer to each other than red and green. Tofix this problem, one-hot encoding can be applied: for each category a binary featureis made, being one if it is the category of the data instance, and zero otherwise.

Other non-numeric attributes, including string attributes, are less easily trans-formed into numeric values. A direct translation into categorical data is possible, forexample when “authentication failed for user”, “failed password for user” and “serverunreachable” are denoted as three, completely separate categories. But this directtranslation is usually not preferred, because the degree of similarity between separatestrings is often informative, as the previous example illustrates: the first two stringsdescribe events that are interchangeable for an intrusion detector. Multiple approachesare possible to establish a notion of similarity between non-numeric values: amongstothers string metrics and Bag-Of-Words models.

When using the first option, we rely on string metrics to give similar strings asimilar numerical value. The most popular string metric, the Levenshtein distance[35],is defined as the minimum number of edits (insertion, deletion or substitution of acharacter) needed to transform one string into the other. Another example is the Jarodistance[26], which uses the ratio of matching characters (same characters which areat a similar position) and the number of transpositions needed to move these matchingcharacters between the strings to transform one string into the other. Such a metriccan be used to compute the distance between each string and a set of example strings,which can be used as features. To obtain strong features, a representative subset of thevocabulary should be used for the example strings.

The second option, the Bag-Of-Words model, counts the occurrence of each wordin each string, creating a high-dimensional matrix that can be used as features. Whenthe word order should be taken into account, other approaches are possible as well, butrequire more computational power. For most feature engineering problems therefore,the usage of string metrics or, maybe, Bag-Of-Words models are the only feasiblesolutions for transforming string attributes into numeric values. Both approaches,using string metrics or using a Bag-Of-Words model, have their own applications.If two strings are similar when they contain many of the same words, and a it iscomputationally feasible, the usage of a Bag-Of-Words model is recommended.Otherwise, string metrics are the way to go.

After the features are selected or created, they need to be polished. This stageis called feature preprocessing. It starts with feature cleaning, which should ensurethat missing values are fixed, or that those data instances are deleted. Afterwards, datatransformations such as whitening (a transformation to obtain zero mean and unitvariance) are often needed to ensure that the machine learning algorithm can performwell. Dimensionality reduction is also a frequently used, to decrease the storage space,to speed up later algorithms and even to make them perform better by accentuating

15

the most important information. Multiple methods for dimensionality reduction maybe applied, most notably Principal Component Analysis (PCA), auto-encoders andFischer’s linear discriminant. We will briefly introduce PCA, since we applied thismethod, where after we will indicate the differences between PCA, auto-encoders andFischer’s linear discriminant.

Principal Component Analysis[21][22] (PCA) performs an orthogonal projectioninto a lower dimensional linear space. This lower dimensional space is spanned byvectors called the principal components. The algorithm first chooses the direction ofthe first principal component. It does this in such a way as to maximize the varianceof the data in the dimension of this vector. To avoid trivial solutions, the principalcomponent is restricted to unit length. This way, the first principal component will bethe most informative, in the sense that the diversity between the datapoints, seen bylooking only at the data projected to this dimension, is maximized (i.e. higher than thediversity of the data in any other direction). The other principal components are theniteratively added the same way, but orthogonal to the previous principal components.The second principal component will thus be the second most informative vector. Asa bonus, PCA will reduce the noise, given that the variance of the noise is equal in alldimensions, since the variance of the signal will be higher than average in the firstprincipal components, while the variance of the noise stays the same.

Implementations of Principal Component Analysis can make use of eigenvec-tor decomposition, because math shows that the first principal component equalsthe eigenvector with the largest eigenvalue. Dimensionality reduction can now beperformed by using only the first n principal components. PCA works well as longas three assumptions are met: the original dimensions of the data are comparable;the variance of the data projected to a vector is a meaningful criteria for informa-tiveness; and the variances of the data can be well separated in a linear subspace.The first assumption is usually met by performing feature normalization first. Thisway, each dimension will start off with unit variance. If the data is not normalized,PCA will consider the dimensions with the highest variance to be more important,which might indeed be preferred when the dimensions are of comparable origin.The second assumption states that PCA should not be used when the variance isnot the most informative criteria. This might be the case when other information isavailable, for instance the class labels of each datapoint, in which case the use ofFischer’s linear discriminant[20] would be more appropriate. The last assumptioncan push the researcher towards non-linear dimensionality reduction methods such asauto-encoders[48]. Depending on the separability of the data in a linear projection, anon-linear reduction gives more informative principal components. This, of course,comes at a cost: the computational complexity of non-linear dimensionality reductionalgorithms is higher than the complexity of plain PCA.

3.2 Unsupervised clustering

Unsupervised machine learning groups data into clusters without using labeled data:the algorithm does not receive any feedback. Instead, it is assigned a distance function(usually euclidean distance) and uses one of two possible approaches: hierarchical orflat clustering.[58, p.645]

Hierarchical clustering creates a tree-like structure where leaves with the sameparents represent data points that are close to each other, and intermediate nodes with

16

the same parents represent clusters close to each other. From bottom up, each levelof intermediate nodes groups the data into a smaller number of clusters. The finalclustering is determined by choosing a level, returning each node of this level, with itschildren, as a separate cluster. While basic approaches like single-linkage clusteringhave a high computational complexity of O(n2), newer algorithms like BIRCH canachieve O(n).[60] Hierarchical clustering is especially strong when the data exhibitsunderlying ordering.

Flat clustering, on the other hand, groups the data by minimizing a certain distance-based criterion. As computing the optimal cluster locations and cluster assignmentsis typically NP-hard, approximation algorithms have been developed. Well-knownalgorithms, all minimizing a different criterion, include k-means, mixture models andDBSCAN:[58]

1. K-means: the well-known k-means6 algorithm is the go-to algorithm for mostdata scientists because of its simplicity and good performance. It minimizes thesum of squares of the distances between each cluster and the centroid (mean) ofthat cluster, by iteratively performing two steps: keeping the clusters constantand assigning the data points, and subsequently keeping the assignments constantand refining the clusters based on these assignments.[6, pp. 424-428]Pros: K-means converges fast (O(n)) and the implementation is simple.Cons: K-means is sensitive to the initialization and can converge to localoptima. Furthermore, the square makes the algorithm sensitive to outliers anda bad choice for data with clusters that do not have a hyperspherical shape.Lastly, if the distribution of points belonging to each cluster is heavily skewed, alarge ‘natural cluster’ might be divided at the expense of grouping small ‘naturalclusters’ together.

2. Mixture model: a mixture model assumes that the data is generated by Kdistributions, where each distribution belongs to a cluster. After a choice ofa distribution (e.g. a Gaussian or Bernoulli distribution) the joint probabilityof each point belonging to its cluster is maximized by maximum likelihoodestimation. To approximate the maximum likelihood solution, often estimation-maximization is applied. This algorithm works in a way similar to the k-meansalgorithm, by iteratively assigning each data point to a cluster and refining eachcluster given these assignments. [6, pp. 430-448]Pros: Firstly, a mixture model provides insight in its uncertainty: by computingthe probability that each data point belongs to each cluster (making so-called softcluster assignments). Secondly, a mixture model has more parameters. Not onlythe centroid of each cluster is estimated, but also the covariance. This makesmixture models better suited for non-hyperspherical (but hyperoval) clustershapes of different sizes.Cons: The estimation-maximization has a slow convergence rate: it needs moreand more expensive iterations than k-means.[6, p.438] Furthermore, a mixturemodel needs more initial parameters than k-means, to which it is sensitive, canconverge to local optima, and it cannot handle correlated features (i.e. a singularcovariance matrix).

6To be precise, the proper name would be k-mediods when a distance function other than euclideandistance is used.

17

3. DBSCAN: DBSCAN is a density-based clustering algorithm. Clusters areformed by checking the density around a data point, within a predetermineddistance. If this density is higher than the predetermined threshold, a clusteris formed, with all close-by data points in it (iteratively as long as the densityremains high). Data points not belonging to any cluster and where the density istoo low, are labeled as noise.[19]Pros: The density-based approach makes DBSCAN suitable for any clustershape. Furthermore, the algorithm is robust to outliers and needs only twoinitial parameters. The run-time complexity, O(n log n), is acceptable for mostapplications as well.[19, p.229]Con: The algorithm will not work without a fairly constant density amongstclusters.

3.3 Supervised classification

Supervised machine learning trains on a labeled dataset to classify new data pointsinto the right group. Many approaches have been proposed over the years, includingkernel based methods (e.g. Support Vector Machines), variational inference andgraphical models like Decision Trees or Bayesian Networks.[6] And although the NoFree Lunch Theorem states that each method has its benefits and problems, in recentyears two models stood out in overall performance: Neural Networks and RandomForests.[57]

Neural Networks originated from an attempt to mimic the human brain, but soondiverged from its natural counterpart by dropping restrictions that were deemed unnec-essary. Nowadays the term Neural Network refers to a broad range of classificationmodels, identifiable by their multilayered structure. A neural network consists oflayers, which themselves consists of nodes. It all starts with an input layer including anode for each feature (e.g. for each processed attribute of the events). It ends with anoutput layer that has a node for each class (e.g. for each cluster of alarms). Whenevera data point is fed into the input layer, it will result in a value of each output node.The output class is then the output node with the highest value. In between can be oneor multiple hidden layers. In its simplest form, the network is called a FeedforwardNeural Network, and all nodes in each layer are connected to all nodes in the previousand next layers. A Neural network makes predictions by performing a feed-forwardround, whereby the values of all hidden and output nodes are computed as a functionof the nodes they are connected to of the previous layer(s). To be exact, the valueof each node is computed as a nonlinear function (the activation function) of theweighted sum over all the values in the previous layer(s). The network learns byadjusting its weights to reduce the prediction error of the training data, for which theBackpropagation algorithm is responsible.[6, pp.225-236]

The advantages of Neural Networks include a typically good accuracy and aninteresting tendency to find hidden structures in the data: similar to neurons in thehuman brain, nodes in layers further from the input layer tend to activate on complexand specific input combinations. For example, when used on visual images, the inputlayer depicts single pixels, while more complex nodes would ‘fire’ on straight lines,or even on faces. This partly explains the recent tendency to use more layers then everbefore, denoted by the hyped-up term ‘Deep Learning.’ Other explanations includethe improvement of algorithms, the improvement of hardware and the realization

18

that Neural Network computations are well suited to be run on GPU. The maindisadvantage of Neural Networks can already be read through the previous lines:training is computationally expensive (a simple Feedforward Neural Network has acomplexity of O(n2)). Furthermore, they are prone to overfitting and can be difficultto optimize due to the many tunable parameters.

This is were Random Forests come in: they promise good accuracy with a moremodest computational training complexity (O(n log n)) and fewer tunable parameters,especially useful when there are no hidden structures to exploit. Furthermore, theirdesign makes them immune to overfitting.

Random Forests[10] are an ensemble of Decision Trees. Every tree gets a randomsubset of the features and then votes for the classes it believes the data points belongto. The algorithm then outputs the average or the majority vote as its prediction. Thetree consists of nodes, in such a way that the top node will check every data point onone of the features, splitting the data points into two subsets, each of which will flowto another node, until the leave nodes are reached. The leave nodes then state thatevery data point in it belongs to the class most of their training instances belonged to.To train the tree, algorithms (most famously the C4.5 algorithm) try to split the datapoints in such fashion that the amount of data points in each subset will be dividedas equal as possible. The training of each Decision Tree is deliberately hinderedby pruning the trees (i.e. reducing their size), making the Random Forest safe fromoverfitting. The typically good accuracy is thus acquired from the combined votes ofDecision Trees, which each receive a random sub-sample of the available features.

To conclude, the decision between Neural Networks and Random Forests can besimplified to this consideration: is there enough hidden structure in the data to justifythe computationally more expensive Neural Networks?

3.4 Semi-supervised learning

Semi-supervised machine learning relies on a dataset that is partly labeled. It can besplit into two distinct approaches: semi-supervised clustering and semi-supervisedclassification.[61]

Semi-supervised clustering (also known as constrained clustering) learns froman unlabeled dataset that is enriched with labeled data. The labeled data consists oftwo types of pairwise relations: must-links (two datapoints that must be in the samecluster) and cannot-links (two datapoints that must be in different clusters). Thisapproach relies on the same assumptions as normal clustering approaches, and is thusappropriate when datapoints within a single class have similar features.

Semi-supervised classification takes the opposite route, by starting out with thelabeled data. This approach will first fit a classifier, thereafter the unlabeled datais used to improve the classifications. At first sight, this might come across as animpossible task: what type of information could possibly be distilled out of theunlabeled data? But researchers have shown that the unlabeled data can be used inmultiple ways.

First of all, the labeled data can be used to fit multiple generative models, where-after the unlabeled data allows the validation of these models. This is possible sincegenerative models learn from the distribution of datapoints in the labeled training set,which can be compared with the actual distribution of datapoints in the unlabeleddataset. Hereafter, the generative model is used that fits the unlabelled dataset best.

19

This way, the unlabeled data can be used to find the best generative models. This ap-proach works well when the dataset can be split into well separated clusters, and whenthe assumptions of the generative model are appropriate (i.e. when the datapointsfollow a type of distribution that is correctly identified).

An approach comparable to generative modelling is the use of semi-supervisedSupport Vector Machines (SVMs). While training, a SVM fits a decision boundary inorder to separate two classes. To use SVMs on more classes, multiple SVMs can becombined into a multi-class SVM, by assigning each SVM a single class and trainingit to fit a decision boundary between its class and all other classes. Similar to theusage of multiple generative models, multiple multi-class SVMs can be trained on thelabeled dataset, and validated on the unlabeled dataset. This time, the chosen multi-class SVM is the one with decision boundaries avoiding most unlabeled instancesby the largest margin. This approach is a natural extension of normal SVMs, andtherefore recommended when a SVM was implemented already for the classificationproblem.

A third approach, called self-training, iteratively grows its labeled dataset byadding unlabeled datapoints. In each round, it trains on the labeled dataset and makes aprediction for each unlabeled datapoint. The algorithm then adds unlabeled datapointsto the labelled dataset - but only those datapoints for which the algorithm is confidentthat it performed a correct classification. This approach is mainly recommendedwhen a complicated supervised classifier is already used, since self-training is easy toimplement.

The last approach, co-training, is similar to self-training. This time, two classifiersare trained, each with half of the labeled training set. Each of them then predictthe labels for the unlabeled data, and gives those datapoints to the other classifieron which it is confident that it performed a correct classification. Both will train ontheir new dataset, whereafter the process repeats. Co-training can be used for similarproblems as self-training, but relies on more assumptions: the training set needs to besplit into two sets that are conditionally independent given the class, whereby bothsets should enable a classifier to give good predictions.

All semi-supervised classification approaches can be used for either transductiveor inductive learning. In the former case, the model is trained to predict the unlabeleddata, and will not be validated on unseen datapoints. In the latter case, the model istrained to generalize from the labeled and the unlabeled data, so that unseen datapointswill be classified correctly as well.

4 Related work

Now that the necessary background is in place, we can delve into solutions to problemssimilar to ours. To quickly recap the problem statement (section 1.2): we aim togroup the alarms of the OPTOSS intrusion detector into alarm groups, so that thealarms inside an alarm group have the same root cause, a problem that belongs to therealm of alarm mining. Our contributions can be characterized by the usage of bothan unsupervised clusterer and a supervised classifier, to group alarms together thatoriginate from a separate detector based on manually created rules. The advantageof the clusterer is that it provides good clusters without needing a manually labeled

20

dataset; the advantage of the classifier is that it allows human experts to change theoutcome of the clusterer.

To the best of our knowledge, we are the first to propose a “hybrid alarm mining”approach, as we like to call it, using both alarm clustering and alarm classification togroup together the alerts of a separate detector. Closely related works may be found,though, in approaches that use clustering methods as detector, followed by an alarmclassification module. We will first regard the similarities with and key differencesto those works. Thereafter some other approaches in alarm clustering and alarmclassification will be listed, in order to understand the possibilities and choices thatencompass this thesis.

4.1 Works similar to hybrid alarm mining

Although we apply unsupervised machine learning to cluster alarms of a separatedetector, clustering can also be used in the detector itself. During training, such asystem will cluster all activity into groups of similar events. In the application phase,any event that falls outside the known clusters will trigger an alarm. This contrastswith alarm clustering, where alarms, instead of events, are used as input.

When the detector is based on clustering techniques, and the alarms are refined byan alarm classification module, the system comes reasonably close to our proposedsystem. We found five such approaches in the literature, originating from 2012onwards, and differing chiefly in the algorithms used. These works are listed in Table1. To the best of our knowledge, the first such approach was a work of Muniyandi etal. of the Tamil Nadu University titled “Network Anomaly Detection by CascadingK-means Clustering and C4.5 Decision Tree algorithm”[40].

The related works mostly employ K-means clusterers, in combination with aDecision Tree, Naive Bayes or Random Forest classifier. For us, their choice ofclassification algorithm is of higher importance than their choice of clustering algo-rithm, since they used the clusterer in the detector itself, not as alarm clusterer. Interms of accuracy of the classifier, one would expect the Random Forest classifier tooutperform both Decision Tree and the Naive Bayes classifiers, in return for a highercomputational complexity. We flag the work of Veeramachaneni et al.[56] as the mostinteresting paper to our research, since they employed a Random Forest classifier andshowed a thorough and clear experiment setup. Although we do not directly comparethe results of our thesis to this work, or indeed to any other work, it has functioned asinspiration for feature engineering and experiment design.

TABLE 1: Related works similar to hybrid alarm mining.* Veeramachaneni et al. apply an ensemble of Replicator NeuralNetworks, matrix-decomposition-based and density-based outlier

analysis.

Year Clusterer Classifier Authors

2012 K-means C4.5 Decision Tree Muniyandi et al. [40]2013 Weighted K-means Random Forest Elbasiony et al. [18]2014 Weighted K-means Naive Bayes Emami et al. [59]2016 K-means Naive Bayes Muda et al. [39]2016 Ensemble* Random Forest Veeramachaneni et al. [56]

21

The difference between the two approaches (i.e. using a clusterer as detector orusing a clusterer to group the alarms) has a large impact on the requirements andtraining performance, and possibly as well on the accuracy of the system. Thesecontemplations hold for all the works mentioned in Table 1.

Firstly, regarding the requirements, the approaches are positioned at oppositesides of a trade-off between the need for a large training set and the need for expertavailability. Our approach, relying on experts to form rules to capture all intrusions,is rooted in our observation that the network intrusion market contains many highlyskilled professionals whose job is to do just that: finding intrusions in networkdata. Our main assumption is that those manual rules are able to filter out almostall of intrusions with a fairly low number of false positives. The requirementsfor the other approach are mostly met by using training data that is supposed tocontain no intrusions. This approach thus assumes that such an up-to-date datasetexists containing no intrusions, but otherwise containing events similar to the actualproduction environment. A changing environment, or new types of intrusions forwhich new information is needed to filter them out, could then lead to a necessity fora new dataset (or, equivalently, new expert rules), giving this trade-off long-lastingeffects. To conclude this paragraph about the differences in requirements, it boilsdown to a matter of trust: do you put your trust in the ability of experts to form rules,or are you positive that you have a dataset good enough to train a detector?

The second difference lies in the training performance. In our approach, thepresence of a rule-based detector ensures that the clusterer only needs to group alarmstogether, whereas it needs to group events together when the clusterer acts as detector.Our approach results in an enormous reduction in training instances, and seemstherefore better suited to handle large streams of data, while the performance shouldbe of the same order once the training phase has been completed.

For the last difference, concerning the accuracy of the system, we lack studiescomparing the accuracy of both approaches. Ideally we would like to assess true andfalse positives of both approaches on the same dataset. As mentioned in subsection1.4, those studies would be difficult to perform, since intrusion detection datasets labelsingle events, whereas our system flags bags of events. Lacking those studies, we canonly refer to the results section, where we will try to make plausible that our resultsare promising. It then remains an open question which approach yields better resultsunder which conditions.

To conclude, our work differs from works in alarm classification that use aclusterer as detector, since we replaced the detector with a rule-based variant, andadded a clusterer to group the alarms. We chose this direction because we have accessto highly skilled experts, who we believe can make excellent rules for the intrusiondetector. This believe remains an assumption throughout this thesis, since the accuracyof such expert-made rules, relative to clustering techniques, is not known. Since werelied on experts to manually create rules, we abandoned the dependency on a large,up-to-date, realistic and intrusionless dataset. We can now use production data directlyto train our clusterer and classifier. Lastly, we expect our approach to be better suitedfor large streams of data, since the number of instances to train on is drastically lowerfor our clusterer.

22

4.2 Alarm clustering

The alarm clustering module is an important segment of our approach, for which anextensive body of research has been established. Alarm clustering approaches are notable to label the alarm groups, in contrast to the alarm classification methods which aretreated in next subsection. This allows for an unlabeled dataset on which unsupervisedmachine learning algorithms can be applied. We will regard four approaches ofalarm clustering: based on Attribute-Oriented Induction, based on K-means, based onmeta-alarms and based on soft clustering techniques.

Each of the works in this subsection differs from our work by at least two points:the use of a different detector (we used OPTOSS NG-NetMS whereas most otherused Snort[47]) and the absence of an alarm classification module in their works. Themain difference between Snort and the OPTOSS detector is that Snort is network-based, whereas OPTOSS is host-based. Snort is therefore able to detect threats onthe network, before they have breached into a specific device, whereas OPTOSSrelies on logs from a host device and will thus only detect threats once they arealready active on the devices. This enables OPTOSS to give a broader protection bydetecting unexpected behavior independently of network activity. A second differenceis that Snort is a purely signature-based detector, relying on a database of maliciousevents, while OPTOSS exhibits a more elaborate approach, alarming on a group ofevents. The more straight-forward approach makes Snort “incapable of performingdetection on the basis of a bunch of events, each of which only hints at the possibilityof an attack,”[13, p.83]. detections that are possible with the alarm representationof OPTOSS. A third difference is that the rules of SNORT are open-source whereasthe rules of OPTOSS are proprietary. Both of these approaches have pros and consregarding the safety of the system: open source allows everybody to improve therules, whereas it also allows attackers knowledge about how the rules might becircumvented.[32] Lastly, a more practical difference between the detectors, favouringSnort, is that Snort can build upon a larger body of research than OPTOSS, includingthe application of rule learners to automatically derive the rules (e.g. [50]).

The absence of an alarm classification module in the other works of this subsectionmakes them not capable of the incorporation of expert knowledge. In our architecture,the classifier is a practical addition to allow the experts to change alarm groups thatare inconveniently clustered together. These differences, the different detector andthe absence of a classifier, are the differences that characterize all other works. Otherdifferences are mentioned when the respective approaches are discussed.

The alarm clustering problem was coined by former IBM researcher Julisch in his2001 work “Mining alarm clusters to improve alarm handling efficiency” [27]. Afterproving that the problem is NP-complete, he continued to approximate a solutionusing Attribute-Oriented Induction, an O(n)[25] combination of machine learningtechniques and insights of relational database operations. This algorithm createstaxonomies (trees) for each of the attributes, such that the root node and interior nodesare generalizations of the leaf nodes, which are the possible values of the attribute.An example of such a taxonomy is shown in Figure 1, with root node “All alarms”,interior nodes such as “All durations” or “The facility is a service”, and leaf nodessuch as “duration = 7”(this is the duration node for Alarm1). Clusters consist ofa node for each attribute, such that each alarm of the cluster has attributes that aregeneralizable to those nodes (e.g. “duration = 7” is generalizable to “duration < 10”).

23

The algorithm then minimizes, for each alarm, the number of generalizations to thecluster. It can do this by changing the taxonomy (by deleting generalizations, thatis, interior nodes; or by adding possible generalizations that are specified by humanexperts), by assigning alarms to the most similar cluster, and by making all clusternodes as specific (low-level) as possible. In the example, the algorithm will clusterAlarm1 and Alarm4 using interior node “duration < 10”, which is more specific than“All durations”, and it will delete the interior nodes “duration < 15” and “duration≥ 15”.

All alarms

All durations

duration < 10

7 (Alarm1) 9 (Alarm4)

duration ≥ 10

duration < 15

13 (Alarm2)

duration ≥ 15

16 (Alarm3)

All facilities

Services

sshd (Alarm2) cron (Alarm3)

Other

traffic (Alarm1&4 )

FIGURE 1: A fabricated example of a possible taxonomy created byAttribute-Oriented Induction. Only four alarms are shown. Alarm1,for instance, has a duration of 7 and traffic as facility. Given thistaxonomy, two alarm clusters might be formed: “duration < 10and facility = traffic” and “duration ≥ 10 and facility = Services.”Alarm1 and Alarm4 then need only two generalizations to reach itscluster, whereas both Alarm2 and Alarm3 need 4 generalizations.

Multiple works of Julisch, as well as works of Zhang et al.[2] later on, showedgood results for alarm clustering based on Attribute-Oriented Induction. The maindifference of these works with our approach is the clustering mechanism: Attribute-Oriented Induction measures the difference to the cluster center separately for eachattribute, whereby the alarm wont be assigned to the cluster when one attribute is toodifferent, while K-means (the algorithm we used) looks at the total distance of theattributes to the cluster center. This means that a single alarm, having one attributethat differs from its cluster center, forces Attribute-Oriented Induction to generalizeit’s rules, thereby penalizing all alarms inside the cluster, while K-means will assign apenalty only to the divergent alarm. As long as the concept of “total distance to clustercenter” is meaningful (i.e. when the distance of one attribute might be compensatedfor by another attribute), the mechanism of K-means should be preferred: we expectthat this difference results in Attribute-Oriented Induction needing more clusters inorder to create meaningful (specific) clusters. This expectation is in line with the factthat the researchers only tried to create clusters for the alarms that occur most often,

24

reasoning that those are probably misconfigurations that need to be fixed. Indeed,this contrasting goal is a second difference, directly stemming from (or preceding)the different clustering mechanism. The last difference of their approach is that itprovides intuitive insight in each cluster, consisting of simple if-then rules regardingeach attribute, which is not as easy to obtain for K-means.

The second approach, using a K-means clusterer, is the approach we have used aswell. In 2004, Law et al.[31] were the first to apply this algorithm to alarm clustering.7

Five years later, a PhD dissertation was written by Dey[16], combining the K-meansalgorithm with an Incremental Stream Clustering approach. Instead of the defaultK-means algorithm, an online learning approach is used, updating the cluster centersevery time a new alarm is clustered. This differs from our offline learning approach,which could suffer from badly formed cluster centers. In practice, we have not foundthis to be a problem, since the number of completely new alarms is small when asufficiently large dataset is used for training.

The third approach, based on forming meta-alarms, can be considered to be asimpler version of K-means clustering and was proposed by Predisci et al. in 2006[44].First, a classifier is trained to label each alarm with a class, for instance portscan,DoS or NoClass. Next, the clusterer takes over, and instantiates an empty list ofmeta-alarms for each class. For each alarm, the clusterer will check the distance toeach meta-alarm of the same class. If the distance is below a prespecified threshold,the alarm will be grouped to the same alarm group as the meta-alarm. Otherwise, thealarm is promoted to meta-alarm. This will yield similar results as K-means clustering,with three differences: more information is added by the classifier, the cluster centersare determined only by the first alarm of that class, and a different value needs to bespecified in advance (a distance threshold versus the number of cluster centers). Theformer might be an advantage for the method of Predisci et al., although it does requirea labeled dataset, which is an requirement that is often difficult to meet. Furthermore,we believe that the information is easily visible by examining the alarms. The seconddifference puts the meta-alarm on a disadvantage when comparing to K-means, sinceK-means is able to create cluster centers fitting the alarms perfectly (at least whiletraining). Regarding the last difference, the specification of a distance thresholdmight be more straight-forward and more easily generalizable to other problems, incomparison with the specification of a number of clusters.

The last alarm clustering approach was established when Smith et al. decided tobring out the big guns in 2008[53]. Three clustering techniques were applied for alarmclustering: mixture models, self-organizing maps and a variation on auto-encoders.These algorithms perform soft clustering (each datapoint may belong to multipleclusters) and tend to perform better than K-means, at a cost of a higher computationalcomplexity and more complex algorithms and tuning. Of the three algorithms thatwere tested, self-organizing maps performed worst, and, since we already coveredmixture models in subsection 3.2, we continue with a quick contemplation of thevariation on the auto-encoder that the researchers developed.

An auto-encoder is a type of artificial neural network (see subsection 3.3) whichis trained to recreate the input data. This may sound like a trivial task, which it indeed

7Although Law et al. themselves saw it as a work in alarm classification, as demonstrated by the titleof “IDS false alarm filtering using KNN classifier,” The term ‘semi-supervised classification’ would havebeen more appropriate since they trained the machine on intrusion-less data. In any case, the algorithmremains a clustering algorithm.

25

is when the hidden layers are large enough to represent all the nuances of the data.However, it gets interesting when the neural network is forced to put the data through asmall hidden layer: the network now needs to encode the data into a lower dimension.Whereas a common use of the auto-encoder is to use the (smallest) hidden layer fordimensionality reduction, Smith et al. used it to form density-based clusters using thereconstruction error (the difference between the input and output data). Specifically,alarms where clustered together when the reconstruction error (a one dimensionalvalue) of the alarms differ less then a threshold value, forming links of alarms thatare grouped together. The intuition here is that similar alarms will have a similarreconstruction error. The results show that it indeed performs well. Comparison withour approach follows the same argument as made in subsection 3.3: the accuracyof these complexer methods is typically better than that of simple K-means, with apotency to find hidden structures in the data. K-means, on the other hand, needs lesscomputational power and is easier to finetune, which makes it a reasonable alternativewhen processing large streams of data.

4.3 Alarm classification

After clustering the alarms, we perform an extra step of alarm classification to allowhuman experts to refine the alarm groups based on domain knowledge unavailable tothe clusterer. We already mentioned a couple of classification algorithms in subsection4.1: Decision Trees, Naive Bayes and Random Forest. In this subsection we will treatother approaches, starting with the two founding papers of alarm classification, whichused the C4.5 Decision Tree and the RIPPER rule learning algorithm, whereafter wegive a brief overview of works based on support vector machines and neural networks.We will finish this section by studying two papers that exhibit original views thatmight be incorporated into our system in future work. Equivalently to the previoussubsection, all of these works differ from our work in at least two points: the use of adifferent detector and the absence of an alarm clustering module. Other differencesare, again, mentioned below.

Two comparable papers introduced the field of alarm classification in 2004, usingdifferent algorithms: Shin et al.[51] applied a C4.5 Decision Tree whereas Pietraszekused the RIPPER rule learner[45]. We already briefly introduced the Decision Treealgorithm as part of a Random Forest (see subsection 3.3). RIPPER, also knownas Repeated Incremental Pruning to Produce Error Reduction, is fairly similar tothe C4.5 algorithm. RIPPER has a growing and a pruning phase. In the growingphase it behaves similar to the C4.5 and builds rules based on the most informativeattributes. But unlike C4.5, a rule is written in first order logic, and the algorithm doesnot terminate when a maximum depth is reached or until the left over attributes arenon-informative, but it continues until it accurately classifies the complete training setcorrectly. Once overfitted nicely, it starts to prune away the least informative rules,defending itself against overfitting (a serious problem of the C4.5 algorithm) whiletrying to the keep a high accuracy. As Cohen, the inventor of RIPPER, pointed out,it yields higher accuracy than C4.5 for large noisy datasets.[12] In contrast to ourapproach, these papers (as do many subsequent papers) perform binary classification,grouping the alarms in false and true positives (i.e. harmless and intrusive alerts).In comparison with using a Random Forest, the usage of a single Decision Tree ismore prone to overfitting and therefore, in general, less accurate. RIPPER is better

26

suited to compete with a Random Forest and has the additional advantage of givingclear insight in the produced rules, which might be the best argument for using it:Pietraszek stated in his 2006 PhD dissertation that he “considered the requirement forsymbolic representation a fundamental one, and therefore

[. . .]

not even consider[ed]

any non-symbolic learning methods”.[46, p.55]Bolzoni et al.[7] applied Support Vector Machines (SVMs) to the problem of

alarm classification in 2009. SVMs perform binary classification by placing a borderbetween the two classes in such a way as to maximize the distance between thedecision boundary and the closest instances of each class (so-called support vectors).The decision boundary can be linear, as in the original implementation, or a hyperplane,using the ‘Kernel Trick’. Although a single SVM performs only binary classification,a multi-class classification can easily be obtained by training a one-versus-all SVMfor each class. The algorithm became popular after the discovery that any decisionboundary could be obtained, but the use has been in decline last years since theaccuracy is in general lower than that of modern Neural Networks and RandomForests.[57] Regarding alarm classification, Bolzoni et al. showed that their SVMimplementation outperforms RIPPER only when the number of training instances perclass does not exceed 60.

Neural Networks are the natural competitors of Random Forests, but are foundless frequently in works of alarm classification. We only found the 2007 work ofAlshammari[3], combining a Neural Network with fuzzy logic to perform binaryclassification. To be fair, more works that apply a Neural Network can be found inthe broader domain of intrusion detection: the 2004 work of Moradi et al.[38] usinga Neural Network as detector, or the work of Thomas and Balakrishnan[54] usinga Neural Network to decide how to combine multiple detectors. But since we areconcerned with alarm classification, we shall only treat the work of Alshammari et al.The fuzzy logic used in their approach makes it possible for an alarm to be assigneda probability of being a false alarm, which is indeed a difference with our work.The results show a comparison with RIPPER, which is favorable towards the NeuralNetwork in terms of reducing the number of false positives, although RIPPER hasa lower rate of false negatives. This result is difficult to compare with our approach,since we use multi-class classification.

To finish up this subsection, we like to highlight two papers that show an inter-esting and different angle. Firstly, Benferhat et al.[5] created a Bayesian frameworkcapable of incorporating expert knowledge into the classifier such as “it is expectedthat 80 percent of traffic will be normal.” Although this might be a helpful additionto our system, we leave this direction for future work. The second paper, of Parikhand Chen[43], shows the promise of boosting (combining SVMs, Decision Trees ormulti-layered perceptrons) in intrusion detection. The proposed algorithm, calleddLEARNIN, exploits the fact that the intrusion detection data can be split into datasources: information about the the event (e.g. the protocol), information about theoccurrence of similar events, and information about the state of the operating system.For each of these data sources, and for each class, dLEARNIN creates an ensembleof better-than-random classifiers (equal to a Random Forest when Decision Treesare used). These ensembles are then combined in such a way that the ones with thesmallest validation error have the highest influence on the final classification decision.Such boosting algorithms tend to overcome problems of overfitting and obtain highaccuracies. Although this paper applies the boosting algorithm to form an intrusion

27

detector, similar techniques could be applied to alarm classification as well. Again wehave not taken this approach, although we do like to encourage future work in thisdirection, as we believe that it would be interesting to compare it with our simplerRandom Forest classifier.

5 OPTOSS, the original system

In this thesis we build upon the existing components of Opt/Net’s real-time OPeratorTime Optimized decision Support System for ICT infrastructures (OPTOSS)[1]. Thissection serves to make the reader familiar with OPTOSS, starting with an introductionof the components of the original system. We will focus on the functional goals andsteps of the components, leaving the implementation details undisclosed. After thearchitecture of the OPTOSS is treated, we describe the system’s responses to differenttypes of attacks. We will finish this section by giving a categorization for OPTOSSand by listing points of improvement, which formed the motivation for this thesis.

5.1 Architecture

The OPTOSS is designed to assist the human expert in network management. First ofall, it provides the expert with a graphical user interface to easily assess the state of thenetwork. You may see it as a visual skin to wrap around (network) logs. In principle,the logs are not restricted to origin from computer networks, as any time-series datacan be parsed and visualized by the system. Here, as in most cases the tool is used, weapplied it to logs from a computer network. Aside from the visual skin, the OPTOSSalso provides automatic intrusion detection and raises an alarm every time anomalousbehaviour is found. It also performs a crude clustering, grouping similar alarmstogether. This clustering should enable the human expert to check less alarms, whichis exactly the component which we aim to improve in this thesis. Lastly, the graphicaluser interface allows the expert to label the alarm groups as good, bad or unknown.These labels enable the system to assign the same label to any new alarm that isclustered into a known group.

The components of the original system are shown in Figure 4a. The systemconsists of a Collector, Detector and Severity Clusterer. The figure shows that thelogs of the network are parsed by the Collector, which feeds the events to the Detector.As stated in the introduction, an event corresponds to a log entry and contains atime-stamp, but does not have a duration. The Detector then raises alarms which areclustered by the Severity Clusterer, resulting in a list of alarms. Figure 4a focuses onlyon components that are part of the intrusion detection system, and thus disregards thegraphical user interface which allows the user to observe the results.

The first component, the Collector, not only gathers the log data, but softens theroad for the Detector by assigning a severity score to each event. This severity scoreis the pivotal number of the OPTOSS system. It is manufactured from rules writtenby a domain expert and signifies how interesting the event is. Events that obtain aseverity score of zero are thrown away. In the next step, the events are aggregated overtime (1 second), and joined into event groups spanning multiple seconds. Each event

28

group starts, and ends, when the deviation of the aggregated severity is lower than athreshold for a certain time.8 These event groups are then the input of the detector.

The Detector produces an alarm for every event group that has a severity deviationhigher than a prespecified threshold .9 Each alarm consists of all the events in theevent group. The simplicity of this component is possible only because the realintelligence houses in the rule files which are already parsed by the Collector.

The last of the intrusion detection components is the Severity Clusterer, whichperforms a crude clustering of the alarms. This clustering is based only on the severityof the events within each alarm, disregarding all other information. To be moreprecise, a graph is created for each alarm showing the severity over time, whereby theaggregated severity of all events per second is used. Next, a distance metric is usedto signify the closeness of the severity graphs of two alarms. This results in a list ofalarms, grouped together when the severity of the events inside the alarm are similar.

Finally, the graphical user interface enables the user to exploit the networkvisualizations, to analyze the alarms and to operate the system. The visualizationsshow the user plots of the aggregated severity (i.e. the sum of the rule-based severityscore of all events that occurred within a time-span of one second, minute or hour)over time. This is the same data that the Detector and the Severity Clusterer receive,and can be truly insightful, as different attack types can be recognized by differentshapes of the severity curves. Other parts of the graphical user interface enable the userto check the alarm list, change the classification of the alarms (good, bad, unknown)and operate the OPTOSS (e.g. performing automatic network scans and archiving oldevents).

5.2 Responses to attack types

The distinctive design of the OPTOSS produces an original view on the state of thenetwork in which attacks can be recognized by the shape of the severity graph. Wewill now show characteristic OPTOSS responses to different attack types (which werelisted in section 2.1). Note that the severity values are assigned using rules that arecreated by an expert. Any severity graph is thus a reflection of the rules, and mightdiffer greatly when using the system with different rules on other devices. The graphsare therefore meant to characterize typical OPTOSS responses on a well-configuredsystem. Three severity graphs can be found in Figure 2, showing a Remote to Local,a Denial-of-Service and a Probing attack with various display settings. Using thesethree severity graphs, OPTOSS’ response to the five known attack types, mentionedin section 2.1, will be described.

First of all, the severity graph of the Remote to Local attack (Figure 2a) showsfour high, pointy spikes and two shallow spikes of medium height. The highest spikesstem from three unsuccessful attempts to login to the device, whereby the two closestspikes, in the center, signify a single attempt to login as local admin. This way, thehuman expert and the Severity Clusterer can both recognize the attacks by just looking

8In fact, the Collector does not aggregate the events, but leaves this to a separate component, calledthe Profiler. Similarly, the Severity Clusterer is in fact implemented inside the OPTOSS component thatis called Detector. To keep the explanation clear, we decided to base our explanation on the functionalsteps taken by the system.

9In the vocabulary of OPTOSS, an alarm is called an Anomaly History and a group of alarms isdenoted as an Anomaly. We will stick to our own terminology.

29

at the height and shape of the severity graph (the shape gives information about eventsthat occurred in the seconds around (a) main event(s)). Of course, human experts needto look into the events of each alarm cluster to check what happened, which can beaccomplished using the alarm summary, from which an example is shown in Figure3. If the Severity Clusterer performs accurately, an alarm summary should only beopened once. Thereafter, the known shapes can be matched with known intrusions(or, of course, known but harmless events).

Discovering User to Root and Data Compromise attacks also relies on content-based features. The same detection mechanism thus applies: attacks consist ofcharacteristic events in a specified order, so that fixed patterns emerge in the severitygraphs.

Detecting Denial-of-Service attacks can be done in much the same way, althoughthe underlying mechanism slightly differs. Whereas the characteristic events ofcontent-based attacks should be granted a high severity value, the characteristic eventsof Denial-of-Service attacks should only obtain a modest severity value. This way,single occurrences of these events will not trigger an alarm, but since the Collectorsums the severity of all events in a time-span, the total severity will be high in case ofmultiple events, which can be seen in the severity graphs. The severity rules thus setup an elegant system to exploit both content-based and time-based features in similarfashion. A second way to recognize Denial-of-Service attacks using the severitygraphs is by zooming out. The severity graph will show a sudden rise in activity incase of a flooding attack, as seen in Figure 2b. Although this second method cancertainly be helpful to the human expert, the Detector does not rely on it at this stageof development.

The last type of attack, probing, could be picked up using mostly connection-based features: one IP address is connecting to multiple devices. At the moment,simultaneous events over multiple devices are not correlated by the OPTOSS, whichmakes picking up probing attacks more difficult. Still, some of these attacks can bedetected by using a well-written rules file, as seen in Figure 2c. The yellow highlightssignify that three different alarms are raised for the probing attack. The high spikes inthis particular attack are similar to the Remote to Local attack seen in Figure 2a, whichis not a coincidence in this case, since the probing attack contained login attempts.

In summary, a well-configured OPTOSS is able to pick up attacks from allknown attack types (which are enumerated in section 2.1). Different attacks can bedistinguished, by both the human expert and the Severity Clusterer, based on theseverity graphs, which characterize the events that accompany the alarm.

5.3 Categorization

Most of the characteristics of the OPTOSS will now be clear. It is host-based, becauseit runs on a device on which it receives logs of network activity, in contrast withnetwork-based detectors that directly monitor the packets of network traffic. Thedetector is based on manually crafted rules (instead of machine learning techniques)and is capable of autonomous actions after a known intrusion is detected, making itan intrusion prevention system.

Still, it is unclear whether it employs signature-based or anomaly-based techniques.On first sight, it relies on a rule file to characterize the severity of events, making itrather similar to signature-based approaches. But, in contrast with signature-based

30

(A) Remote to Local

(B) Denial-of-Service

(C) Probing

FIGURE 2: Severity graphs of various attacks, whereby the severity(y-axis) is plotted against the time (x-axis): (a) an unsuccessfulRemote to Local attempt as seen from a two minute time interval;(b) a zoomed out view reveals a possible Denial-of-Service attack;and (c) a probing attack, whereby the alarm highlights (yellow) are

turned on.

approaches, it will not directly raise an alarm when an event is recognized and aseverity is distributed: only a deviation of severity will lead to an alarm. Reacting onchanging activity is, indeed, a characteristic of anomaly-based approaches. Moreover,it aggregates the severity score of all events that happen at the same second, so thatalarms might be raised due to events that have a default severity: it can raise alarmsthat are not based on the rules file. Therefore, we classify the OPTOSS as a hybriddetector, being neither completely signature-based not completely anomaly-based.

The OPTOSS can thus be seen as an signature-based and anomaly-based HIPS(host-based intrusion prevention system), based on manually crafted rules. Thesecharacteristics allow it to apply domain knowledge instead of using a dataset (similarto many signature-based approaches), while it is still able to detect unknown alarms(similar to anomaly-based approaches). Moreover, it is able to perform real-timedetection on larger networks than detectors that are based on machine learning. Lastly,it allows for automatic responses to known attacks, making it an effective defencemechanism. To the best of our knowledge, this makes the OPTOSS a unique system.

31

FIGURE 3: The alarm summary of an unsuccessful Remote to Localattack, containing a list of events which can be correlated with the

severity, as visualized by the red line in the graph.

5.4 Areas for improvement

A deficiency of the current OPTOSS is the typically large number of alarm groups,many of which are heterogeneous (i.e. many groups consist of alarms that havevarying root causes). This not only forces the human expert to assess each of themany alarm groups, a time intensive task, but also makes missed intrusions probable,since the human expert will falsely assume that alarms inside heterogeneous alarmgroups are similar. These problems are caused by the low accuracy of the SeverityClusterer. It often creates different alarm groups for similar alarms, and sometimesgroups different alarms together.

A second issue is that the expert needs to dig into the alarms, checking all theindividual events, in order to judge their effect on the system. The expert couldbe better assisted by the system, would it be able to give a better summary of theinformation at the basis of the alarm.

Other areas for improvement include the detection of long-term anomalies (usingseverity summed over one minute or hour) and applying alarm correlation (clusteringalarm groups, ideally originating from all devices).

6 Proposed system

In this section we will introduce the proposed system, starting with a solution overviewthat describes our overall approach, and an outlook on the new architecture, followedby a treatment of the design decisions of the individual components. The designdecisions we list are in regard to the features, the clusterer and the classifier. We willend this section with some notes about the implementation.

32

(A) The original architecture.

(B) The proposed architecture.

FIGURE 4: The software architecture of (a) the original and (b) theproposed intrusion detection system.

33

6.1 Solution overview

Our main contribution is an improvement of the way that the alarms of the OPTOSSdetector are grouped together. As explained in section 5.4, this was an importantdeficiency in the current system, and resulted in a burden of manual labour. Improvingthe grouping of alarms should lower the amount of human labour, since the humanexpert now only needs to check each alarm group once, that is, if s/he is confident thatthe alarm groups are sufficiently homogeneous. Homogeneity of an alarm group heremeans that the alarms inside the group have the same root cause as defined in section1.2 (examples of distinct root causes are “successful attempt to get root access” and“failed attempt to get root access”) .

We improved the grouping of alarms mainly by including a new unsupervisedalarm mining component to cluster the alarms. The results of this clusterer aresubsequently assessed by a human expert, who is allowed to make changes wheneverthe groups, or the group assignments, are not appropriate. After the human expert hasfinished, the changes should, of course, be applied to new alarms as well. To that end,we added another alarm mining component. This second component learns the correctgroup assignments from the human expert, and is thus a supervised machine learningcomponent. The addition of the classifier is not meant to directly the accuracy of theclusterer, but only to make it possible to incorporate some domain knowledge into thegroupings. The architecture of the new system, and more details, will be described inthe following subsections.

Another contribution is the automatic generation of descriptions for each alarmgroup and the improvement of data visualization, tackling the second issue related tothe current OPTOSS (see section 5.4). These improvements were merely of a practicalnature, and will therefore not feature prominently in this thesis. However, they will beexplained briefly in section 6.8.

The other areas in which the current OPTOSS could be improved, being thedetection of long-term anomalies and the application of alarm correlation, werescoped out of this thesis.

6.2 Architecture

Our proposed architecture should allow the new components to group the alarms intomeaningful groups, and to automatically provide information about each group. Inthis subsection, we will introduce the components we added on top of the originalarchitecture. Again, we will focus on the functional goals rather than the details of thesystem. The details will be presented in the next subsections, as well as the changesin the graphical user interface that are needed to support this new architecture.

The architecture of the proposed system is shown in Figure 4b. In comparisonwith the original system (Figure 4a), four extra components are included. The first two,the Feature Extractor and the Feature Preprocessor, are responsible for composinga dataset. The other two components, the Clusterer and the Classifier, train onthis dataset to group the alarms. Of course, the classifier is not satisfied with anunsupervised dataset, and requires it to be labeled. This is done by the human expert,who refines the groups that are found by the Clusterer.

The first of the new components is the Feature Extractor. Directly after eachalarm, it will gather extra attributes of the events inside that alarm. Which kind of

34

attributes it gathers depends on the Features Selection decisions that will be explainedin section 6.3. In any case, it relies on the Collector to supply the right attributesand on the Detector to signify when a new alarm is formed. For information aboutthe sequence of severity scores inside the alarm, it relies on the Severity Clusterer.In contrast to the data of the events, which are deleted often to reduce the size ofthe database, the information gathered by the Feature Extractor should be kept. Thisway, the clusterer and the classifier are able to train long after the original eventsare deleted. This is a necessary precaution since the OPTOSS system is capable ofhandling more events per day than can reasonably be saved in a database, and the sizeof the training set is normally in the order of weeks.

After the right attributes are collected by the Feature Extractor, the attributes arepolished by the Feature Preprocessor. This component will clean and transform thedata to the right format for the machine learning components. The used procedures,such as whitening (transforming the dataset to zero mean and unit variance) andPCA (dimensionality reduction), rely on a dataset to fit their parameters. Therefore,the Feature Preprocessor is started only after the Feature Extractor has produced asufficiently number of features. This component produces the dataset used by theClusterer and Classifier.

The Clusterer runs directly after the Feature Preprocessor and groups the alarmsof the Detector. It also tries to deduce information about each alarm group, based oninformation of the Feature Extractor, as well as statistics showing the uniformity of thealarms within the group. This information is stored to assist the user when assessingand labelling the alarms. Similar to the Feature Preprocessor (and the Classifier),the Clusterer trains its parameters the first time it is ran, whereafter it will be able toquickly cluster new alarms as belonging to a known alarm group, or else creating anew alarm group.

In the next phase, the alarm groups are corrected and labeled by the humanexpert, preparing the labeled dataset on which the Classifier trains. This way, domainknowledge can be used to change the groupings of the Clusterer. After the Classifieris trained, it groups the new alarms (overruling the decision of the Clusterer). Oncethe training is complete, the Classifier makes sure that new instances of known alarmgroups are classified inside the correct alarm group.

In summary, the Collector, Detector, Severity Clusterer (i.e. the existing compo-nent, in contrast with the Clusterer) and Feature Extractor are permanently running tocreate a dataset of alarms. Once the size of the dataset satisfies the user, s/he can letthe Feature Preprocessor polish the current dataset and any new alarm. The Clusterercan then be trained, supplying the analyst with a database of alarms to correct andlabel. Lastly, the Classifier learns from the combined information of the Clusterer andthe human expert to group the alarms.

6.3 Feature selection

To select the optimal features for the Clusterer and Classifier, we started by analyzingthe way the human experts analyze and assess alarms. Using the original OPTOSStool, two views of an alarm can be obtained: checking the shape of the aggregatedseverity, which gives a quick and high-level overview, and a detailed view whichshows the list of events present in the alarm. Although the high-level view is certainlyhelpful, it does not give enough information to determine the exact nature of the alarm

35

(which explains the suboptimal performance of the Severity Clusterer, also section5.4). We therefore decided to use information about the events inside each alarm,besides the information about the severity. Since the number of events inside eachalarm differs, the next challenge was to obtain a descriptive summary for each alarm.We decided to use the following features:

1. The device on which the events were found.

2. The severity shape, giving information about the severity of all events, aggre-gated per second, over time. As explained in section 5.2, the shape of the severitygraph can be used to distinguish dissimilar alarms.

3. The description of the event with the highest severity. The description containsthe logged information of an event in raw text format. An example of such adescription might be “ADM: Local admin authentication failed for login name[NAME]: invalid login name ([DATE-TIME]).”

4. The facilities of the events. The facility describes the kind of program thatlogged the event. Examples include cron (UNIX scheduled job), sshd (SecureShell deamon), traffic (data traffic over the network), and unknown. As featurewe used two variants: the total number of events for each facility, and the totalnumber of events for each facility weighted by the severity of each event. Thelatter emphasizes the facilities related to the characteristic events of each alarm.

5. Other features: the mean and maximum of severity of the events, and theduration of the alarm.

This way we hope to convey as much of the information about the alarms aspossible - information that is used by the human experts as well. Some concessionsare made to keep the computational complexity manageable, such as disregardingthe description of the events that do not have the highest severity. Of course, weassessed the accuracy of the Clusterer and Classifier on different combinations ofthese features, the results of which will be shown in section 8.

To reflect their relative importance, we attached weights to each feature. First ofall, we wanted to create a separate alarm group for each device, and subsequentlygranted the device such a high weight that alarms inside a group will always originatefrom a single device (more explanation of this method can be found in the nextsubsection). To determine the other weights, we experimented with different values.

6.4 Feature engineering and preprocessing

The sceptic reader might object that “using the description as feature” is easier saidthan done. Indeed, we need to transform the features into numeric values: a problemof feature engineering. Subsequently, the values need to be polished and brought backto an N by M matrix of N alarms with M feature values per alarm. This step, ofpolishing the data, is an act of feature preprocessing. Using a single value to denotethe device, n for the description, m for the severity shape, p for the facility and qother feature values, we want to arrive at a N by n +m + p + q matrix as seen inFigure 5. In this subsection we will describe how we squashed the information foreach feature into their respective matrix values.

The first attribute is the device: a categorical attribute. As explained in Section3.1, caution is required when handling a categorical attribute, since many machine

36

X =

Device Description Severity Facility Other

Alarm 1 a1 d1,1 · · · d1,n s1,n · · · s1,m f1,1 · · · f1,p o1,1 · · · o1,qAlarm 2 a2 d2,1 · · · d2,n s2,n · · · s2,m f2,1 · · · f2,p o2,1 · · · o2,q

......

. . ....

.... . .

......

. . ....

.... . .

...Alarm N aN dN,1 · · · dN,n sN,n · · · sN,m fN,1 · · · fN,p oN,1 · · · oN,q

FIGURE 5: The dataset consists of a row for each of the N alarms.Each row contains a value denoting the device on which the alarmwas raised; n values about the descriptions of the events in the alarm;m values expressing the shape of the severity over time, for all eventsinside the alarm; p values that indicate the ratio of events from eachfacility in the alarm; and finally q other values, such as the duration

of the alarm.

learning algorithms (including K-means) will incorrectly assume that a device withvalue 1 will be closer to device 2 than to device 3. But contrary to our advice in thetheory section, we did not apply one-hot encoding. Instead, we multiplied each devicevalue with a disproportional large weight. This way, the machine learning algorithmswill never group two alarms from different devices together, and we can get away withusing only a single feature value for the device. Another possibility was to separatethe events and alarms based on their device, and then train a separate clusterer foreach device, an approach we did not take because the clustering algorithm would thenforce us to specify the number of alarm groups per device. Using the disproportionallarge weight for the device, we only have to determine the total number of alarmgroups.

The severity shape was already used by the Severity Clusterer, an existing OP-TOSS component that is able to compute the difference between the severity graphs oftwo alarms. We reprogrammed this component as a feature extractor to create clustersof alarms with similar severity graphs, saving the distance between clusters. This way,we obtained a high number of values for each alarm, denoting their distance to allother severity clusters.

Next, the severity distances were whitened. Instead of normal whitening, weapplied the same transformation to all description columns. Each separate columncan thus have a nonzero mean and a variance unequal to one. This way, the columnsremain comparable and a notion of distance is preserved.

As a last step for the severity shape, we applied PCA to reduce the dimensionalityto m, so that it fits the dataset. We selected this dimensionality reduction techniquebecause the distance values were rather noisy. This noise stems from the fact that theused distance metric is nonsymmetric (e.g if the distance from severity group A togroup B is 1, the distance from B to A does not necessarily equal 1); the distancesdo show interesting tendencies, but with large, random looking fluctuations. Wheninterpreting these random looking distances as noise around ‘real’ distances, we hopedthat PCA could assist by not only reducing the dimensionality, but also by findinga more expressive summary of the distances. This hope is grounded in the fact thatPCA has a natural tendency to remove noise, as explained in section 3.1).

To obtain the description values, we computed a string distance from each string

37

to n example strings. As string metric, we used either the Jaro distance[26] or theLevenshtein distance[35], both adequate to group descriptions together when, forinstance, an extra word is added in one description. We found little reason to usea more sophisticated model such as a bag-of-words model, which would be morehelpful when we hope to group descriptions together that are likely to denote a similarevent but have a completely dissimilar word order, or use synonyms, of which wefound very limited examples in our dataset.

The description values were prepared in four steps: preprocessing the strings toignore values like dates, selecting n example strings, computing the distance betweeneach string and the example strings, and finally whitening these distances. The firststep, removing dates and numbers from the descriptions, ensured that descriptionsare matched based on their subject rather than on their low-level values. To thatend, we also removed values from the descriptions when they matched the format“label1 =value1 label2 =‘value2”’. Although the original descriptions were retainedso that they could be shown to the user, the Clusterer and Classifier did not use thisinformation.

For the second and third step, we selected n example descriptions that were highlydifferent from each other. The first is chosen at random. After a new description isselected, the distances to all other descriptions are computed. Then the next exampledescription is chosen to maximize the (minimum) distance to each of the exampledescriptions. We preferred this dimensionality reduction method over PCA because itdistributes the example descriptions evenly, also when many descriptions are similar,which is not guaranteed for the linear combinations of description-vectors that areused by PCA. Furthermore, PCA cannot benefit from its tendency to reduce noise inour case (since there is none to start with) and is computationally more expensive.

In the last step, the description distances were whitened. Again we applied thesame transformation to all columns so that the meaning of the distances remains.

The facility features were the last features for which dimensionality reduction wasneeded. As explained in the previous subsection, we saved two versions of this featurefor each alarm: an unweighted variant saving the number of events of each facilityand a weighted variant saving the summed severity of the events of each facility. Thisyields a matrix with a column for each of the possible facilities. Since each of thecolumns in this matrix denotes a genuinely different dimension of the data, we applieddefault whitening (per column, in contrast to the ‘total whitening’ used for both ofthe distance-based features). As dimensionality reduction we applied PCA, since wehave no better way than variance to measure the usefulness of the data. Furthermore,PCA might reduce the noise in the facility data, which stems from small deviations inthe events of similar alarms.

Lastly, the Other distances are rather uncomplicated. They contain values suchas the duration, the mean severity and the maximum severity of the alarm. The onlypreprocessing operation for these values was column-wise whitening.

To summarize the feature engineering and preprocessing, we applied separatewhitening and dimensionality reduction to each of the features (device, description,severity shape, facility and other features), obtaining the dataset visualized in Figure5.

38

6.5 Clusterer

Sometimes the simplest algorithms are best. In our case, we started with the moststraightforward clustering algorithm, K-means, and were never seduced to part withit, because the results proved to be good. This can partly be explained by the shapeof the clusters, which should be hyperspherical for the K-means algorithm. Thecluster centers are hyperspherical when for each class, and each dimension, thewithin-class-variance is similar (i.e. when the dimensions are normalized and eachdimension is equally relevant). We expect this to be the case: first of all, the varianceof each dimension is similar. This it caused by the normalization method, whichmakes the average variance of all dimensions unit, and therefore the variance ofeach dimension almost unit. Thereafter, the dimensionality reduction might causesome minor deviations by increasing the variances of some dimensions, resultingin variances that will be relatively similar. Second of all, when a dimension is lessrelevant (for instance the facility) and thus shows great deviation within a ‘naturalcluster,’ we place a smaller weight upon these dimensions, causing the clusters to bepractically hyperspherical.

The main drawback of the K-means algorithm is the requirement to provide it withthe number of clusters. As alternative, we could have turned towards the DBSCANalgorithm, or towards iteratively running the K-means algorithm, until the meandistance to the cluster was below a prespecified threshold. The advantage of theseother approaches is that the required threshold similar the same for different datasets.In practice, however, we found that the number of alarm clusters could be finetunedquite easily (especially with the data visualization additions described in subsection6.8). We therefore decided to keep the plain K-means algorithm, motivated by thegood results and the acceptable finetuning.

Regarding the implementation, we chiefly needed to ensure that the algorithmwould not get stuck in a local optimum. We therefore reran the K-means algorithmmultiple times with random initialization, which is the default tactic when applyingK-means. At the end, we selected the run with the lowest mean squared distance fromeach event to its cluster center.

As alternative for random initialization, we also experimented with the refinedstart algorithm developed by Bradley and Fayyad[8]. This algorithm relies on theusage of multiple subsamples of the dataset, each of which are clustered by a K-meansclusterer. From the resulting set of cluster centers of the runs, each using a differentsubsample, the best cluster centers are then computed. To do this, the squared distanceof each data instance to the nearest cluster center is minimized. Using this refinedstart algorithm, the performance and accuracy might be improved.

6.6 Classifier

For the next component of our pipeline, the Classifier, we put our trust in the Swissarmy knife of supervised machine learning: the Random Forest algorithm. As ex-plained in the Theory section (3.3), the main contestants for this part of the code areNeural Networks and Random Forests, where the former is especially useful to exploithidden structures in the data. Although such hidden structures might indeed exist (forinstance the attack type), we do not believe that the accuracy of the Classifier dependson those high-level pieces of information. We expect the Classifier to assist in two

39

ways: to divide an existing cluster and to combine clusters. Both actions reflect adifference found by the human expert (e.g. two descriptions with similar charactersbut different semantics such as “succesful login” and “unsuccessful login”) and shouldnormally not require any high-level understanding of the data. We thus do not need theClassifier to infer abstract representations from the data, and favor the Random Forestalgorithm over a Neural Network, since it is has a lower computational complexityand is easier to tune.

One disadvantage of the default Random Forest algorithm is the offline training:it needs to be retrained after the training data has changed, which may be timeconsuming. Although online solutions exists, for instance the efficient and accurateMondrian Forests[30] algorithm, we decided not to take that path. In practice, weexpect the human expert to label the data only on a few occurrences: once at the startand, sporadically, when new types of attacks are discovered that are very similar toexisting attacks, and thus clustered incorrectly. Therefore we expect few annoyanceswhen using the default, offline Random Forest algorithm.

A second disadvantage might actually originate from a celebrated benefit of theRandom Forest algorithm: that it is naturally guarded against overfitting. In our case,we want the algorithm to generalize in moderation. We especially do not want itto undo any of the changes that are made by the human expert. Luckily for us, theRandom Forest algorithm has parameters to influence the amount of overfitting, mostprominently the maximum depth of each tree: when allowing each tree to grow larger,the tree will cease to be a weak classifier, and the forest might overfit. We used thisparameter to control the amount of generalization of the Random Forest.

6.7 Computational complexity

Computational efficiency and scalability is of utmost importance to the OPTOSS inorder to guarantee real-time performance on large streams of data. In this thesis, thefollowing components were introduced, where N stands for the number of alarms:

1. The Feature Extractor: O(N).

2. The Feature Preprocessor: O(N).

3. Training the K-means Clusterer: O(N).

4. Training the Forest Classifier: O(N logN).

The total complexity of the extra components thus equals O(N logN): it islinearithmic in the number of alarms. Since the number of alarms is drasticallysmaller than the number of events which need to be processed in real-time, thecomplete pipeline remains efficient.

The one drawback of the new OPTOSS might be that the training of both machinelearning components (the Clusterer and the Classifier) will produce a temporary surgein the computational requirements of the system, whereas the old OPTOSS trainsonline. These new bursts of CPU (and RAM) usage might proof to be an issue whentraining on enormous datasets, although it could be easily dealt with by allocatingappropriate hardware.

40

6.8 Graphical user interface

Now that the human expert is responsible to assess the alarm groups, it is importantto make this task as easy as possible. To that end, we adjusted the graphical userinterface in three ways.10

The first adjustment was meant to assist the human expert when choosing whichalarms to label. To this end, the list of alarms was augmented with information aboutits description, facility and severity shape: for each of these features we showed thedistance to the center of the alarm group it belongs to (which is possible since thedataset contains separate values for each of the features, as was shown in Figure 5).

The second change shows a summary of the alarm groups, which is primarilyhelpful when checking whether the number of clusters for K-means was chosenappropriately. The summary is visualized by three graphs that show the distributionwithin and outside the alarm group regarding the description, facility and severityshape. Since the dimensionality of these features does not allow them to be plottedeasily, we use PCA to obtain the two most informative principal components of eachfeature, which were visualized by scatter plots. An example of these plots is presentedin Figure 6, visualizing the variance in description, facility and severity shape of twoalarm groups. Ideally, the variance of each feature is small within each group. In theexample, the variance of the description is small for both groups, while the varianceof severity shape is slightly higher for the first alarm group, and the variance of thefacility values is slightly higher for the second alarm group. Moreover, the descriptionvalues form a spherical cluster, whereas the other clusters show different shapes. Thisis possible because the weights for the description values are high, so that the Clustererfocuses mainly on reducing the distance of these values to the cluster center. Theseconsiderations might prompt the expert to use less clusters (when multiple alarmsshow that the variance is small and some alarm groups should be combined) or toadjust the weights (when the cluster groups are not homogeneous).

Lastly, the user interface was augmented with the possibility to change the as-signment of an alarm to another alarm group, in order to allow the human expert tomanually label the alarms.

7 Experimental setup

In this section we will describe the experiments we conducted to verify that the cluster-ing module is able to generate homogeneous alarm groups, and that the classificationmodule is able to correctly alter the alarm groups and the group assignments of newalarms, based on expert feedback. We will treat the dataset, the creation of the groundtruth, the types of experiments and some implementation details. Recall that themetrics are already described in the introduction (section 1.4).

10Some other adjustments to the graphical user interface were made as well. These adjustments werenot so much needed to allow the usage of the new components, but were made to improve the OPTOSSin other areas. They included the addition of a severity graph underneath the event list, as seen in Figure3, to enable the human expert to easily learn the correlation between severity graphs and the type ofalarm.

41

(A) IP spoofing

(B) Harmless logging

FIGURE 6: Visualization of the distribution of severity shape dis-tances, facilities and description distances of two alarm groups. Thered dots denote the values from the alarm that is being viewed; theyellow dots show the values from the alarms belonging to the samealarm group; and the black dots denote the alarms from other alarm

groups on the same device.

42

7.1 Dataset

Following Veeramachaneni et al.[56], we conducted our experiments on a dataset weassembled ourselves, as motivated in the introduction (section 1.4). To obtain thisdataset, we gathered data from a test network of OPT/Net, consisting of a switch, acouple of routers and a couple of devices with various operating systems. Each ofthese devices send their syslog to the Collector which put the OPTOSS, including thenew components, into work.

The network activity originated from four sources: normal activity of the devices,warnings raised to continually check the working of the system, attacks from outsiders,and our own attacks. Although outsiders are always keen to try to break into thenetwork of OPT/Net, we have noticed no successful attacks. Of course, also thefailed attack are of interest to us, as are misconfigurations of the devices (justly raisedwarnings and errors), and our own attack simulations.

The dataset contains 201061 alarms, raised by the Detector in 50 days of networkactivity on 5 devices. On average, each alarm is based on 10.2 events. Many of thealarms were exact duplicates of other alarms, in the sense that they were based on theexact same events in the exact same sequence. When disregarding these duplicates,6321 unique alarms are left, which were divided into 2388 alarm groups by theSeverity Clusterer. The dataset is heavily skewed in multiple ways: 88 percent of thealarms stem from only 3 alarm groups and 92 percent of alarms originate from only 2devices.

7.2 Creating the ground truth

To validate the alarm mining components, the results were compared with a groundtruth: a grouping of alarms that is considered to be correct. This ground truth wasnot given in advance, but was created by manually looking into each alarm group.We will now describe the way we created the ground truth and the assumptions thatunderlie this approach.

It was not feasible to create the ground truth by checking every single alarm,because the number of alarms was too big. Therefore, we relied on the new FeatureExtraction and Feature Preprocessing components to identify alarms that are com-pletely equal. Two alarms were considered to be equal when (1) the description of themost important event inside the alarm was equal, (2) the distribution of events fromeach facility within the alarms was equal, and (3) the severity shape was equal. After alist of unique alarms was created, each was manually inspected by checking the eventsthat caused the alarm. For all the events, an origin and, sometimes, consequencescould be identified, leading to the formulation of a root cause. This way we obtaineda list of root causes that could be correlated with the description, facility and severityshape. Any grouping of the alarms was then validated based on this list.

Our way of creating the ground truth rests upon three assumptions: the root causeof alarms with equal attributes (e.g. description of main event) is equal, each alarm hasonly one root cause, and the origin and possible consequences of events are interpretedcorrectly.

The first assumption rests upon the observation that two alarms mostly share thesame root cause, when the main events inside the alarms have the same description.The description is very revealing, since it is obtained from the original line inside the

43

log. Of course, even when the main description is equal, the root cause of two alarmscan still be different. But if this is the case, then some events (i.e. other than the mainevent) should differ. If two event groups are different (i.e. when at least one eventdiffers), the difference will be picked up, except when both of the event groups haveexactly the same distribution of facilities, and the same sequence of severity scores.Although this is possible, it would be highly unlikely, given a well-configured rulesfile. Therefore we are confident we reduced the complete list of alarms to a smallerset of unique alarms without missing any root cause.

Secondly, we assumed that each alarm has only a single root cause. This is asimplification. In reality, an alarm is raised over all consecutive events that happen ina certain period. Such events might originate from multiple root causes. Wheneverthis happened, we disregarded all events that we could not correlate with the mainevent inside the alarm (i.e. all events that were unrelated to the event with the highestseverity score). Given that the rules file is written correctly, this event should signifythe most important activity. We thus allowed the possibility that we missed someintrusive events. We justify this by stating that this will be the way that the tool willbe used in reality as well: first, the focus will be only on the most important issues,and less important intrusions will show once the main intrusions are solved.

The last assumption, concerning correct interpretation of the root causes, acknowl-edges the possibility of human mistakes while assessing the data. It is therefore partof any attempt to create a ground truth.

To complete the formation of the ground truth, each root cause was labeled asintrusive or harmless. We used the convention that an intrusive root cause needs to betaken care of, while harmless root causes can be ignored. This means that we not onlyflagged intrusive attacks, but also misconfigurations of the system that needed to befixed.

7.3 Experiments related to the alarm Clusterer

The experiments related to the alarm Clusterer focused on the homogeneity of theresulting alarm groups. The experiments were conducted after the original OPTOSShad collected 50 days of events, aided by the new feature extractor, and after thefeature preprocessor had created the final dataset. The way the features are createdhas a large impact on the quality of the Clusterer, because the Clusterer is basedupon the assumption that two alarms with the same root cause have similar features.Therefore, we experimented with the parameters of the preprocessor, which have beenexplained in sections 6.3 and 6.4. This resulted in the dataset visualized in Figure 5.The parameters of the preprocessor, along with their default values and other valuesthat have been tested, are:

• The metric used to obtain description distances:– Jaro distance (default)– Levenshtein distance

• The dimensionality reduction method for description values:– Based on distance to n distinctive descriptions (default)– PCA

• The number of description dimensions:

44

– 5 (default)– 10

• The number of facility dimensions: 5

• The number of severity dimensions: 5

Once the dataset was created, the Clusterer was run. As with the preprocessor,different values for the parameters were tested, as described in section 6.5. Theparameters of the Clusterer, again with their values, are:

• K:– 125 (default)– 170

• Weights: { Descriptions, facilities, severity }– {1, 1, 1}– {10, 1, 1} (default)– {100, 1, 1}

• Initialization policy:– Randomized (default)– Refined start

This way, we obtained seven settings for the preproccesor and the Clusterer(keeping all but one parameter on the default value), along with one default run. Asbaseline for the unsupervised approach, we determined the homogeneity of the alarmgroups that were created by the original OPTOSS.

The values that were tested for each parameter were chosen by hand. We did notperform any form of grid search, for two reasons. Firstly, the results of the defaultrun were very satisfying already, making it unnecessary to perform more experimentswith the goal of improving the results. Secondly, the used values were sufficient todetermine the effect of their parameters.

7.4 Experiments related to the alarm Classifier

The experiments related to the alarm Classifier show the accuracy of the Classifier inits task of determining the alarm groups of previously unseen alarms. The experimentsare performed after the alarm Clusterer has created alarm groups. The procedure startsby dividing the alarms into a training set and a test set, whereafter some assignmentsof the training set are changed (i.e. the human expert decides that some assignmentsof the Clusterer were inappropriate). The task of the Classifier consists of groupingunseen alarms into the right alarm group. Hereby, the Classifier should mimic theactions of the Clusterer, except for instances where unseen alarms are similar to alarmswhose assignment has been manually changed.

We changed the assignments of some alarms, that were put in an inappropriatealarm group by the Clusterer. In all these cases, the Clusterer had grouped alarmstogether with alarms that had similar features, but a different root cause, which isexactly the scenario in which the Classifier should assist. We changed the assignmentsof alarms that belonged to three distinct alarm groups, splitting those groups into two(we thus created three new alarm groups). The first alarm group contained originally

45

both successful and unsuccessful login attempts. The second alarm group containedlogs from a program, some of which contained errors. The last alarm group containedprobing attacks that were picked up by the firewall, of which some were directedtowards our demo device. The tasks of the Classifier vary in difficulty: the first alarmgroup can be easily split into two, because the successful and unsuccessful loginattempts create alarms that are quite dissimilar in both description and severity shape.The last alarm group was the most difficult to split, since the only difference betweenthe alarms was a subtle difference in the description.

We allocated the first three quarters of the clustered alarms to train our Classifier,leaving the remaining quarter as test set. Additional experiments were conductedwhile training on half, or even on only a quarter, of the complete set of alarms, leavingthe remainder as test set.

As with the Clusterer, we conducted multiple experiments while trying differentparameters. The used parameters of the Classifier, along with their values, are:

• Used features:– Device, description (5 dimensions), facility (5 dimensions) and severity (5

dimensions)

• The maximum number of features per tree:– 4 (default)– 1, 2, 3, 4, 5, 6 or 10

• The maximum depth of each tree:– 25 (default)– 3, 5, 10, 15 or 50

• The maximum number of trees:– 1000

As baseline, we used the accuracy of the Clusterer. The Clusterer is not aware ofthe manual changes, it will make mistakes on any of the reassigned alarms. Moreover,we compensated for the fact that some of the classes did not appear in the training set,but only in the test set. The Classifier cannot know that these classes exist, promptingmistakes on these alarms as well (normally, the Classifier should be retrained after newalarm groups are discovered by the Clusterer). As baseline we thus obtained a scoreof 0.9971 (of the 50235 alarms in the test set, 144 were changed and 37 were new).Admittedly, this is a high baseline, caused by the fact that the number of changedalarms was low. The theoretical limit of the accuracy is 0.9993, whereby all alarms ofknown alarm groups would be classified correctly (of the 50235 alarms in the test set,only the 37 new alarms will then be misclassified).

7.5 Implementation

The new components were implemented in an architecture similar to the existingOPTOSS components, and were mainly written in C++ with a plain SQL database. Forthe implementation of PCA we relied on the OpenCV[9] library. For the machinelearning algorithms we relied on mlpack[14] (for K-means) and, again, the OpenCV(for Random Forest). The graphs that became part of the graphical user interface wereconstructed using PyGTK in combination with matplotlib[24].

46

8 Results and discussion

8.1 Clusterer

The Clusterer creates alarm groups that are judged on their homogeneity. Followingsection 1.4, two measures of homogeneity are used: based on the distinction betweentrue and false positives (i.e. intrusions and harmless activity), and based on theirroot causes. Incrementing the number of alarm groups makes it of course easier toobtain homogeneous alarm groups. Therefore, the reduction ratio (the number ofalarm groups compared to the number of unique alarms) is closely related to groupsimilarity. We will start by describing the main results, by showing the reductionratios versus group similarity (based on the root causes). Subsequently, we will showthe effect of varying the parameters, first for the root cause similarity and then for thetrue and false positives.

Experiment k Reduction ratio Group similarity Mixed

Baseline 2314 0.634 0.8766 0.35k = 240 240 0.962 1.0000 0.00k = 170 170 0.973 0.9868 0.05k = 125 125 0.980 0.9794 0.11

TABLE 2: Results of the Clusterer, showing the trade-off between thereduction ratio and the homogeneity of the alarm groups, wherebythe baseline is also shown, formed by the clustering of the originalOPTOSS. The homogeneity is represented by the group similarity(the average ratio of alarms that have the same root cause as themost frequent root cause of that group) and the ratio of mixed alarm

groups (alarm groups that contain multiple root causes).

The main results of the Clusterer are shown in Table 2 and show an immenseimprovement over the baseline (the alarm clustering of the existing OPTOSS). Whilethe baseline has an average group similarity of 0.88, the new system obtains a perfectscore, and still manages to perform a tenfold reduction of the number of alarm groups.This average group similarity shows the ratio of alarms for which the root cause isthe most occurring root cause in the alarm group, averaged over all alarm groups.Even when increasing the reduction ratio further by using only 125 alarm groups, theaverage group similarity of the new system remains almost 0.98.

Although the average group similarity forms a strong indicator of the accuracy ofthe Clusterer, it does not say anything about the number of alarm groups that containmultiple root causes. A high average group similarity can still mean that many alarmgroups contain a single alarm with a deviating root cause, which will still oblige thehuman expert to manually assess multiple alarms in every alarm group, because hewill not know which alarm groups are mixed. Therefore, the ratio of mixed alarmgroups (i.e. that contain multiple root causes) is also an important metric, and is alsoshown in Table 2. The results show a similar tendency to the group similarity with,again, an immense improvement compared to the baseline. The new system that wasallowed to create 240 alarm groups again shows a perfect result by managing to formzero mixed alarm groups.

47

Comparing the two metrics, we observe that the ratio of mixed alarm groups issignificantly further from optimal than the group similarity (except for the experimentsthat did produce a perfect score). This signifies that the number of undesirable alarmswithin each mixed alarm group is typically low, which makes it easier to separatethem manually, especially because the undesirable alarms can be easily isolated bythe new sorting capabilities of the tool. But even though the alarm groups may beeasily refined by the human experts, it will still take time, and a 0.11 ratio of mixedalarm groups (obtained by the new system with k = 125) might still be rather high,spurring the human expert to decrease the reduction ratio.

The trade-off between a high reduction ratio versus a high group similarity can bedifficult to assess, because all alarm groups need to be checked manually to determinethe group similarity. Furthermore, the optimal solution to this trade-off might bebased on personal preference: a lower reduction ratio distinguishes the alarms witha higher level of detail. We suggest to aim for a perfect group similarity by tryingdifferent reduction ratios, each time taking a quick glance at the results, and stoppingwhen around half of the alarm groups seem duplicates at first sight. This way, wearrived quickly at k = 240 with a perfect group similarity.

Although the results are promising, we need to be careful about the interpretationof a “perfect group similarity”: our approach rests on the assumption that each alarmcontains only a single root cause. This assumption, described in section 7.2, isa simplification. In reality, our alarm representation is based on a bag-of-events,and events might thus originate from multiple attacks. We made the assumptionto make it possible to rely heavily on the description of the main event. But whentwo attacks happen concurrently, the description of the main event will only refer toone attack (the attack that is the most important, according to the rules file). Thismeans that the alarm that contains two attacks might be grouped together with analarm containing a single attack, when the other attributes (the distribution of facilitiesand the sequence of severity scores) are sufficiently similar. In our dataset, we havenot encountered examples of such “hidden attacks” inside alarms, but this mightbecome more problematic in larger networks, where concurrent attacks are moreprobable. When these hidden attacks become problematic, lowering the weight ofthe description features should help, allowing for more focus on differences in thefacilities and severity scores.

This brings us to the next part of the clustering results, the experiments thatshow the effect of various parameters on the root cause similarity. The results areshown in Table 3. Along with the already shown results, it contains experiments withdiverse parameters of the feature preprocessor (respectively the metric for descriptiondistances, the dimensionality reduction method for these distances, and the number ofdescription dimensions) and the Clusterer (the weights and the initialization policy).

A significant difference is shown regarding the dimensionality reduction methodfor the description: determining the distance from each description to n exampledescriptions that are highly dissimilar, which is our default method, outperforms PCA.This is not surprising, since PCA, being blind to any problem-specific knowledge,fixates on maximizing the variance of the data along its output vectors, while we havean excellent way to determine the most interesting description distances (by using thexample descriptions).

The usage of more description values (ten instead of the default five) results in

48

Description k Group similarity Mixed

Baseline 2314 0.8766 0.35k = 240 240 1.0000 0.00k = 170 170 0.9868 0.05

Default 125 0.9723 0.13Levenshtein distance 125 0.9700 0.13PCA for descriptions 125 0.9634 0.1410 descriptions values 125 0.9674 0.13Weights {1, 1, 1} 125 0.9380 0.31Weights {100, 1, 1} 125 0.9794 0.11Refined start 125 0.9616 0.23

TABLE 3: Results of the Clusterer related to the root cause similar-ity within each alarm group. The columns denote the experimentdescription, the number of alarm groups (k), the average ratio ofalarms having the same root cause as the most frequent root causeof that group (Group similarity), and finally the ratio of alarms that

contain more than one root cause (Mixed).

worse performance as well. We believe that this can be caused by a by a high similar-ity between the additional example descriptions and existing example descriptions,resulting in description values that are correlated. Correlated example descriptionslead to a disproportional focus on the distance to those descriptions, comparable tousing some example descriptions twice.

Experiments with various weights for the description values (1, the default 10, or100), while keeping the weights for the facility and severity values at 1, show the bestresults when the description weights are at maximum. This is a direct consequence ofour assumption that each alarm has only a single root cause, which can be identifiedby the (description of the) event with the highest severity. As explained earlier in thissection, this assumption made sense in our dataset, since there were no concurrentattacks inside alarms that were grouped together with single-attack-alarms. Fordatasets with a higher number of concurrent attacks, we suggest to reduce the weightfor the description values and to increase the number of clusters (in order to stillobtain a high group similarity).

Using the Levenshtein distance, instead of the Jaro distance, does not make asignificant difference. Both are respectable string metrics and behave accordingly.Still, we expected the Levenshtein distance to perform better, because it should bebetter suited towards longer strings, whereas the Jaro distance is developed for stringssuch as personal names. Still, the reality proves to be recalcitrant, and shows similarresults for both string metrics, which is a surprising result.

The last parameter of the Clusterer is the initialization policy. The results clearlyshow that the default, randomized start performs better than the refined start which issuggested by Bradley and Fayyad. We do not have an explanation for these results.

The last experiments of the Clusterer show the effect of various parameters onsimilarity of harmless versus intrusive alarms. The results, as shown in Table 4,

49

Experiment k Similarity TP Similarity FP TP FP Mixed

Baseline 2314 0.9758 0.8848 0.13 0.68 0.19k = 170 170 0.9847 0.9928 0.69 0.27 0.04

Default 125 0.9911 0.9812 0.62 0.30 0.09Levenshtein distance 125 0.9924 0.9946 0.47 0.46 0.07PCA for descriptions 125 0.9736 0.9889 0.72 0.18 0.1010 descriptions values 125 0.9662 0.9758 0.73 0.17 0.10Weights {1, 1, 1} 125 0.9762 0.9636 0.54 0.26 0.21Weights {100, 1, 1} 125 0.9822 0.9920 0.66 0.26 0.08Refined start 125 0.9775 0.9802 0.58 0.30 0.09

TABLE 4: Results of the Clusterer, regarding the true and false posi-tives. The columns denote the experiment description, the number ofalarm groups (k), the average ratio of true positives in alarm groupsthat contain mostly true positives (Similarity TP), a similar columnregarding the false positives (Similarity FP), and finally the ratioof alarm groups that only contain true positives (TP), only false

positives (FP) and that are mixed (Mixed).

show similar trends11 to the results regarding the root cause similarity. A notabledifference is the overall higher group similarity (divided into a value for groups thathave mainly true positives and a value for groups that have mainly false positives)with a maximum of above 0.99, whereas the root cause similarity remained under0.98 for k = 125. This is easily explained, since two different root causes will oftenboth be harmless, or both intrusive.

Two main differences are found in the patterns of the results as shown in Table4, versus the patterns of the root cause similarity (Table 3). Firstly, the Levenhsteindistance results in improved similarity while discriminating between harmless andintrusive alarms. The reason that the Levenhstein distance outperforms the Jarodistance might be that it is better suited for larger strings. Indeed, this reason isequally fitting for the task of discriminating root causes, although it did not live up tothose expectations. We cannot think of any reason that either of these tasks is bettersuited for the Levenhstein distance, and believe that the differences are coincidental,originating from the specific descriptions inside the dataset.

The second difference between the results regarding harmless and intrusive events,and those regarding root cause similarity, is that the higher description weights makethe performance of the latter worse. This most probably stems from the importance ofthe facility values when discriminating between intrusive and harmless events.

11An interesting finding about the original OPTOSS is that its ‘false positive similarity’ is significantlylower than its ‘true positive similarity’. This shows that the groups that mainly contain harmless alarmsare quite pure (i.e. do not contain many intrusive alarms), and can most often be discarded. While this isgood to know, it does not significantly reduce the problem of the large number of alarm groups that needto be assessed, because the overwhelming majority of alarm groups that are created by it contain moreintrusive alarms than harmless alarms.

50

Experiment Total accuracy Accuracy of reassigned alarms

Baseline 0.99710 0Theoretical limit 0.99930 1

Best settings 0.99905 1

TABLE 5: The accuracy of the Classifier using the best settings,in comparison with the baseline and the theoretical limit. The lastcolumn, the accuracy of the reassigned alarms, shows how well thealarms are classified that formed the reason to use a Classifier in thefirst place: alarms similar to alarms of which the group assignment

was manually changed.

8.2 Classifier

The main results of the Classifier are presented in Table 5. The Classifier obtainedan accuracy close to the theoretical limit. Besides the total accuracy, the Classifierreaches a perfect accuracy on the changed alarms, even for the difficult groups, inwhich the alarms were split on only a subtle description difference.

Certain mistakes prevented the classifier from reaching the theoretical limit. Thesemistakes of the Classifier were consequences of previously unseen combinations ofthe description, facility and severity values: while the K-means Clusterer just groupedthem based on the distance to cluster centers, the Random Forest has learned morecomplex rules. The ability to form such complex rules is necessary, because thehuman expert might form alarm groups that are not hyperspherical (i.e. groups whosealarms cannot be distinguished based on the distance to cluster centers).

There are multiple ways to improve the accuracy. For this dataset, we could havereached the theoretical limit by adopting the assignments of the Clusterer for eachalarm assigned to an unchanged alarm group (i.e. to an alarm group in which thehuman expert did not make changes). As such, the Classifier could have focused onthe three changed alarm groups only, for which the accuracy was already optimal.This tactic would make the task of the Classifier easier, especially when only a coupleof alarm groups have been changed, as was the case in our dataset. When morealarm groups are changed, a more elaborate approach would be possible by training adistinct Classifier for each changed alarm group. Each Classifier would then be awareof only a couple of alarm groups (only those groups into which some alarms of theoriginal alarm groups were assigned by the human expert). It will thus never assignan alarm to any of the other alarm groups, as such preventing certain mistakes fromhappening and resulting in a better accuracy.

We did not implement these methods because the accuracy was close to optimalalready, and because these misclassifications can already be avoided by retraining theClassifier periodically. Moreover, a slightly better accuracy would not necessarilyimply a better result: it is likely that some of the Classifier’s mistakes were caused bysome unseen alarms that are, in fact, assigned into an alarm group with the correct rootcause. Such mistakes are likely, because alarm groups that have the same root causedo most often share similar attribute values. In short, we believe that the classifier isable to obtain very acceptable results, especially when it is periodically trained.

51

0 10 20 30 40 500.980

0.985

0.990

0.995

1.000

Max depth

Acc

urac

y

(A) Maximum depth

0 2 4 6 8 10

0.998

0.999

1.000

Max number of features

Acc

urac

y

(B) Maximum number of features

0 0.25 0.5 0.75 1

0.990

0.995

1.000

Ratio used for the training set

Acc

urac

y

(C) Size of the training set

FIGURE 7: The accuracy of the Classifier while varying two param-eters of the Random Forest algorithm: (a) the maximum depth ofeach tree, and (b) the maximum number of features per tree; and (c)the accuracy versus the size of the training set. Note that the y axes

are broken.

52

As for the Clusterer, we conducted several experiments while varying the pa-rameters of the Classifier. The results are visualized in Figure 7. First of all, theaccuracy peaks when the maximum depth of each tree is around fifteen, as shownin Figure 7a. The max depth influences the accuracy of each single tree, therebypotentially inducing an overfit. The shape of the graph is therefore as expected: whenthe maximum depth is too low, the trees become too weak, but when it is too high, theRandom Forest overfits. Still, the accuracy remains high even when the maximumdepth grows to fifty. This means that overfitting is not a big problem, which is, again,as expected: the training set is sufficiently large to make generalization to unseenalarms seldom necessary.

The accuracy is plotted against the maximum number of features per tree in Figure7b. As long as the maximum number of features is at least three, the accuracy remainslargely constant, although the default value of four, being the square root of the numberof features, results in slightly lower accuracies. We cannot give reasons for thesefluctuations, because they reflect this particular dataset, and should be determinedempirically. It is, however, interesting to see that a higher number of features (10 outof 16) still gives a good accuracy: this again shows that overfitting is not a large issueon this dataset.

Lastly, the accuracy is plotted against the size of the training set in Figure 7c.A larger training set should improve the accuracy, which is reflected in the graph.This graph underlines the point we made earlier: the Classifier should be retrainedperiodically, in order to keep the ratio of training and test instances balanced.

8.3 Performance

The performance of both the Clusterer and the Classifier was excellent. Training eachcomponent was accomplished in a matter of minutes, whereafter the clustering orclassification of new alarms was almost instant. These results reflect the relatively lowcomputational complexity of the used algorithms, as has been explained in section6.7.

In contrast to the performance of the machine learning algorithms, the performanceof the database proved to be troublesome. After each training session, the alarmsneeded to be updated, which took hours. This was acceptable (although inconvenient)for our dataset, but can be problematic for larger networks. Although the databasequeries were fairly optimized, time savings can probably still be obtained by improvingthe queries, or by using other database techniques, which was out of scope for thisthesis.

9 Conclusion

In this thesis, an existing intrusion detection system (OPTOSS) was augmented withtwo alarm mining components, in order to group similar alarms together. The OPTOSScollects events from network logs and raises alarms, consisting of all consecutiveevents of a time span. We characterized each alarm by a root cause: the process orperson that caused the most important events. This root cause similarity was used toassess the accuracy of the first alarm mining component, the alarm Clusterer, whichclusters the alarms based on three features: of which the description (the text of the

53

complete log entry) of the main event was most informative. This way we obtainedalarm groups containing alarms that have the same root cause, so that the same actionshould be undertaken whenever an alarm from a known alarm group is encountered.

Furthermore, we made it possible to change alarm assignments in order to allowhuman experts to incorporate domain knowledge. An alarm Classifier was usedto train on the resulting dataset (created by the alarm groups of the Clusterer andthe manual changes). This way, the Classifier creates the final group assignmentsaccording to the wishes of the human experts.

The experiments were conducted on a private dataset, and the results were ex-cellent compared to the existing OPTOSS. The results of the alarm Clusterer wereexpressed in terms of the average group similarity: the average ratio of alarms thathave the most prevalent root cause of its alarm group. We obtained a perfect groupsimilarity using only 240 alarm groups, while the original system reaches a groupsimilarity of only 88% by using almost ten times as many alarm groups. Even whenusing 125 alarm groups, the group similarity remained at almost 98%.

The alarm Classifier was able to correctly predict the alarm group for almost allalarms, and obtained an accuracy of 99.91%, close to the theoretical limit.

These results are essential to the usability of the OPTOSS, by allowing the humanexpert to assess only a single alarm from each of the few alarm groups. When the rootcause is determined, the expert can now be confident that the remaining (and future)alarms of the alarm group will share the same root cause. This not only saves timewhile assessing all alarms, but it makes it also reasonable to use automated scriptsto handle known intrusions (a function that already existed in the OPTOSS). Thisway, the results of this thesis make it much less time-consuming to use the OPTOSS,an important outcome since the OPTOSS is unique in its capability to handle largestreams of network data in real time.

For future research, it would be interesting to test the system on different datasetsin order to assess the generality of the obtained results. In particular, it would beinteresting to delve into the effects of concurrent attacks, which were rare in ourdataset. Alarm correlation would be another interesting approach, by generatingattack scenarios from correlated alarms. Furthermore, the OPTOSS currently createsalarms over events that are aggregated per second. It would be interesting to extendthe current detector, by raising alarms for long-term intrusions that generate minutesor even hours of network activity. Lastly, the database proved to be the performancebottleneck of the new components. Using more advanced techniques or tools couldprobably reduce this issue.

But for now, we can rest assured, knowing that the OPTOSS can tirelessly protectlarge networks against intruders and misconfigurations, along with human expertsthat are finally able to keep up.

REFERENCES 54

References

[1] Opt/net optoss ng-netms.http://www.opt-net.eu/products/optoss-ng-netms.Accessed: 2016-07-22.

[2] S. O. Al-Mamory and H. Zhang. Intrusion detection alarms reduction using rootcause analysis and clustering. Computer Communications, 32(2):419–430,2009.

[3] R. Alshammari, S. Sonamthiang, M. Teimouri, and D. Riordan. Usingneuro-fuzzy approach to reduce false positive alerts. In Fifth Annual Conferenceon Communication Networks and Services Research (CNSR’07), pages345–349. IEEE, 2007.

[4] S. Axelsson. The base-rate fallacy and the difficulty of intrusion detection.ACM Trans. Inf. Syst. Secur., 3(3):186–205, Aug. 2000. ISSN 1094-9224. doi:10.1145/357830.357849. URLhttp://doi.acm.org/10.1145/357830.357849.

[5] S. Benferhat, A. Boudjelida, K. Tabia, and H. Drias. An intrusion detection andalert correlation approach based on revising probabilistic classifiers using expertknowledge. Applied intelligence, 38(4):520–540, 2013.

[6] C. M. Bishop. Pattern Recognition and Machine Learning (Information Scienceand Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.ISBN 0387310738.

[7] D. Bolzoni, S. Etalle, and P. H. Hartel. Panacea: Automating attackclassification for anomaly-based network intrusion detection systems. InInternational Workshop on Recent Advances in Intrusion Detection, pages 1–20.Springer, 2009.

[8] P. S. Bradley and U. M. Fayyad. Refining initial points for k-means clustering.In ICML, volume 98, pages 91–99. Citeseer, 1998.

[9] G. Bradski. The opencv library. Dr. Dobb’s Journal of Software Tools, 2000.[10] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.[11] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM

computing surveys (CSUR), 41(3):15, 2009.[12] W. W. Cohen. Fast effective rule induction. In Proceedings of the twelfth

international conference on machine learning, pages 115–123, 1995.[13] E. Corchado and H. Yin. Intelligent Data Engineering and Automated Learning

- IDEAL 2009: 10th International Conference, Burgos, Spain, September 23-26,2009, Proceedings. Lecture Notes in Computer Science. Springer BerlinHeidelberg, 2009. ISBN 9783642043949. URLhttps://books.google.nl/books?id=2XxqCQAAQBAJ.

[14] R. R. Curtin, J. R. Cline, N. P. Slagle, W. B. March, P. Ram, N. A. Mehta, andA. G. Gray. mlpack: A scalable C++ machine learning library. Journal ofMachine Learning Research, 14:801–805, 2013.

[15] D. E. Denning. An intrusion-detection model. IEEE Trans. Softw. Eng., 13(2):222–232, #feb# 1987. ISSN 0098-5589. doi: 10.1109/TSE.1987.232894. URLhttp://dx.doi.org/10.1109/TSE.1987.232894.

[16] C. Dey. Reducing IDS False Positives Using Incremental Stream ClusteringAlgorithm. PhD thesis, Royal Institute of Technology, 2009.

http://www.opt-net.eu/products/optoss-ng-netms

http://doi.acm.org/10.1145/357830.357849

https://books.google.nl/books?id=2XxqCQAAQBAJ

http://dx.doi.org/10.1109/TSE.1987.232894

REFERENCES 55

[17] P. Domingos. A few useful things to know about machine learning.Communications of the ACM, 55(10):78–87, 2012.

[18] R. M. Elbasiony, E. A. Sallam, T. E. Eltobely, and M. M. Fahmy. A hybridnetwork intrusion detection framework based on random forests and weightedk-means. Ain Shams Engineering Journal, 4(4):753 – 762, 2013. ISSN2090-4479. doi: http://dx.doi.org/10.1016/j.asej.2013.01.003. URLhttp://www.sciencedirect.com/science/article/pii/S2090447913000105.

[19] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al. A density-based algorithm fordiscovering clusters in large spatial databases with noise. In Kdd, volume 96,pages 226–231, 1996.

[20] R. A. Fisher. The use of multiple measurements in taxonomic problems. Annalsof eugenics, 7(2):179–188, 1936.

[21] K. P. F.R.S. Liii. on lines and planes of closest fit to systems of points in space.Philosophical Magazine Series 6, 2(11):559–572, 1901. doi:10.1080/14786440109462720. URLhttp://dx.doi.org/10.1080/14786440109462720.

[22] H. Hotelling. Analysis of a complex of statistical variables into principalcomponents. Journal of educational psychology, 24(6):417, 1933.

[23] N. Hubballi and V. Suryanarayanan. False alarm minimization techniques insignature-based intrusion detection systems: A survey. ComputerCommunications, 49:1–17, 2014.

[24] J. D. Hunter. Matplotlib: A 2d graphics environment. Computing In Science &Engineering, 9(3):90–95, 2007. doi: 10.1109/MCSE.2007.55.

[25] H.-Y. Hwang and A. W.-C. Fu. Efficient algorithms for attribute-orientedinduction. In KDD, pages 168–173, 1995.

[26] M. A. Jaro. Advances in record-linkage methodology as applied to matching the1985 census of tampa, florida. Journal of the American Statistical Association,84(406):414–420, 1989.

[27] K. Julisch. Mining alarm clusters to improve alarm handling efficiency. InComputer Security Applications Conference, 2001. ACSAC 2001. Proceedings17th Annual, pages 12–21. IEEE, 2001.

[28] K. Julisch. Clustering intrusion detection alarms to support root cause analysis.ACM Trans. Inf. Syst. Secur., 6(4):443–471, Nov. 2003. ISSN 1094-9224. doi:10.1145/950191.950192. URLhttp://doi.acm.org/10.1145/950191.950192.

[29] S. Kumar and E. H. Spafford. An application of pattern matching in intrusiondetection, 1994.

[30] B. Lakshminarayanan, D. M. Roy, and Y. W. Teh. Mondrian forests: Efficientonline random forests. In Advances in neural information processing systems,pages 3140–3148, 2014.

[31] K. H. Law et al. Ids false alarm filtering using knn classifier. In InternationalWorkshop on Information Security Applications, pages 114–121. Springer, 2004.

[32] G. Lawton. Open source security: opportunity or oxymoron? Computer, 35(3):18–21, 2002.

[33] A. Lazarevic, L. Ertoz, V. Kumar, A. Ozgur, and J. Srivastava. A comparativestudy of anomaly detection schemes in network intrusion detection. In In

http://www.sciencedirect.com/science/article/pii/S2090447913000105


http://dx.doi.org/10.1080/14786440109462720

http://doi.acm.org/10.1145/950191.950192

REFERENCES 56

Proceedings of SIAM Conference on Data Mining, 2003.[34] W. Lee and S. J. Stolfo. Data mining approaches for intrusion detection. In

Proceedings of the 7th Conference on USENIX Security Symposium - Volume 7,SSYM’98, pages 6–6, Berkeley, CA, USA, 1998. USENIX Association. URLhttp://dl.acm.org/citation.cfm?id=1267549.1267555.

[35] V. I. Levenshtein. Binary codes capable of correcting deletions, insertions andreversals. In Soviet physics doklady, volume 10, page 707, 1966.

[36] R. Lippmann, J. W. Haines, D. J. Fried, J. Korba, and K. Das. The 1999 darpaoff-line intrusion detection evaluation. Comput. Netw., 34(4):579–595, Oct.2000. ISSN 1389-1286. doi: 10.1016/S1389-1286(00)00139-0. URLhttp://dx.doi.org/10.1016/S1389-1286(00)00139-0.

[37] C. Modi, D. Patel, B. Borisaniya, H. Patel, A. Patel, and M. Rajarajan. A surveyof intrusion detection techniques in cloud. Journal of Network and ComputerApplications, 36(1):42–57, 2013.

[38] M. Moradi and M. Zulkernine. A neural network based system for intrusiondetection and classification of attacks. In Proceedings of the 2004 IEEEinternational conference on advances in intelligent systems-theory andapplications, 2004.

[39] Z. Muda, W. Yassin, M. Sulaiman, and N. Udzir. K-means clustering and naivebayes classification for intrusion detection. Journal of IT in Asia, 4(1):13–25,2016.

[40] A. P. Muniyandi, R. Rajeswari, and R. Rajaram. Network anomaly detection bycascading k-means clustering and c4. 5 decision tree algorithm. ProcediaEngineering, 30:174–182, 2012.

[41] Nist and E. Aroms. NIST Special Publication 800-94 Guide to IntrusionDetection and Prevention Systems (IDPS). CreateSpace, Paramount, CA, 2012.ISBN 1470151693, 9781470151690.

[42] S. Northcutt and J. Novak. Network intrusion detection. Sams Publishing, 2002.[43] D. Parikh and T. Chen. Data fusion and cost minimization for intrusion

detection. IEEE Transactions on Information Forensics and Security, 3(3):381–389, 2008.

[44] R. Perdisci, G. Giacinto, and F. Roli. Alarm clustering for intrusion detectionsystems in computer networks. Engineering Applications of ArtificialIntelligence, 19(4):429–438, 2006.

[45] T. Pietraszek. Recent Advances in Intrusion Detection: 7th InternationalSymposium, RAID 2004, Sophia Antipolis, France, September 15 - 17, 2004.Proceedings, chapter Using Adaptive Alert Classification to Reduce FalsePositives in Intrusion Detection, pages 102–124. Springer Berlin Heidelberg,Berlin, Heidelberg, 2004. ISBN 978-3-540-30143-1. doi:10.1007/978-3-540-30143-1 6. URLhttp://dx.doi.org/10.1007/978-3-540-30143-1_6.

[46] T. Pietraszek. Alert classification to reduce false positives in intrusion detection.PhD thesis, Citeseer, 2006.

[47] M. Roesch. Snort - lightweight intrusion detection for networks. In Proceedingsof the 13th USENIX Conference on System Administration, LISA ’99, pages229–238, Berkeley, CA, USA, 1999. USENIX Association. URLhttp://dl.acm.org/citation.cfm?id=1039834.1039864.

http://dl.acm.org/citation.cfm?id=1267549.1267555

http://dx.doi.org/10.1016/S1389-1286(00)00139-0

http://dx.doi.org/10.1007/978-3-540-30143-1_6


REFERENCES 57

[48] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internalrepresentations by error propagation. Technical report, DTIC Document, 1985.

[49] R. Sadoddin and A. Ghorbani. Alert correlation survey: framework andtechniques. In Proceedings of the 2006 International Conference on Privacy,Security and Trust: Bridge the Gap Between PST Technologies and BusinessServices, page 37. ACM, 2006.

[50] K. Shafi, H. Abbass, and W. Zhu. An adaptive rule-based intrusion detectionarchitecture. In Proceedings of the 2006 RNSA security technology conference.Canberra, Australia, pages 307–319, 2006.

[51] M. S. Shin, E. H. Kim, and K. H. Ryu. False alarm classification model fornetwork-based intrusion detection system. In International Conference onIntelligent Data Engineering and Automated Learning, pages 259–265.Springer, 2004.

[52] A. Shiravi, H. Shiravi, M. Tavallaee, and A. A. Ghorbani. Toward developing asystematic approach to generate benchmark datasets for intrusion detection.Computers & Security, 31(3):357 – 374, 2012. ISSN 0167-4048. doi:http://dx.doi.org/10.1016/j.cose.2011.12.012. URLhttp://www.sciencedirect.com/science/article/pii/S0167404811001672.

[53] R. Smith, N. Japkowicz, M. Dondo, and P. Mason. Using unsupervised learningfor network alert correlation. In Proceedings of the Canadian Society forComputational Studies of Intelligence, 21st Conference on Advances inArtificial Intelligence, Canadian AI’08, pages 308–319, Berlin, Heidelberg,2008. Springer-Verlag. ISBN 3-540-68821-8, 978-3-540-68821-1. URLhttp://dl.acm.org/citation.cfm?id=1788714.1788743.

[54] C. Thomas and N. Balakrishnan. Performance enhancement of intrusiondetection systems using advances in sensor fusion. Supercomputer Educationand Research Centre Indian Institute of Science, Doctoral Thesis, 304pp.Available at: http://www. serc. iisc. ernet.in/graduation-theses/CizaThomas-PhD-Thesis. pdf, 2009.

[55] A. D. Todd, R. A. Raines, R. O. Baldwin, B. E. Mullins, and S. K. Rogers. Alertverification evasion through server response forging. In International Workshopon Recent Advances in Intrusion Detection, pages 256–275. Springer, 2007.

[56] K. Veeramachaneni, I. Arnaldo, V. Korrapati, C. Bassias, and K. Li. Ai2:Training a big data machine to defend. In 2016 IEEE 2nd InternationalConference on Big Data Security on Cloud (BigDataSecurity), IEEEInternational Conference on High Performance and Smart Computing (HPSC),and IEEE International Conference on Intelligent Data and Security (IDS),pages 49–54, April 2016. doi: 10.1109/BigDataSecurity-HPSC-IDS.2016.79.

[57] M. Welling. Machine learning 1. University Lecture, 2014.[58] R. Xu and D. Wunsch, II. Survey of clustering algorithms. Trans. Neur. Netw.,

16(3):645–678, May 2005. ISSN 1045-9227. doi: 10.1109/TNN.2005.845141.URL http://dx.doi.org/10.1109/TNN.2005.845141.

[59] M. S. S. H. Yousef Emami, Marzieh Ahmadzadeh. Efficient intrusion detectionusing weighted k-means clustering and naı̈ve bayes classification. Journal ofEmerging Trends in Computing and Information Sciences, 5(8), 2014.

[60] T. Zhang, R. Ramakrishnan, and M. Livny. Birch: an efficient data clustering




http://dx.doi.org/10.1109/TNN.2005.845141

REFERENCES 58

method for very large databases. In ACM Sigmod Record, volume 25, pages103–114. ACM, 1996.

[61] X. Zhu. Semi-supervised learning literature survey. Technical Report 1530,Computer Sciences, University of Wisconsin -Madison, 2005.

[62] X. Zhuang, D. Xiao, X. Liu, and Y. Zhang. Applying data fusion incollaborative alerts correlation. In Computer Science and ComputationalTechnology, 2008. ISCSCT ’08. International Symposium on, volume 2, pages124–127, Dec 2008. doi: 10.1109/ISCSCT.2008.38.

Date post:	30-Jan-2018
Category:	Documents
Upload:	dangthuan
View:	219 times
Download:	0 times

Alarming the Intrusion Expert - UvA · PDF file2 Alarming the Intrusion Expert: Alarm mining...

Documents