Data Mining for Cyber Threat Analysis Vipin Kumar Army High Performance Computing Research Center...

Data Mining for Cyber Threat Analysis

Vipin Kumar

Army High Performance Computing Research CenterDepartment of Computer Science

University of Minnesota

http://www.cs.umn.edu/~kumar

Project Participants: A. Lazarevic, V. Kumar, J. Srivastava H. Ramanni, L. Ertoz, M. Joshi, E. Eilertson, S. Ketkar

Mining Large Data Sets - Motivation

Examples: Computational simulations Information Assurance &

Network intrusion Sensor networks Homeland Defense

There is often information “hidden” in the data that is not readily evident.

Human analysts may take weeks to discover useful information.

Much of the data is never analyzed at all.

Computational Simulations

Network Intrusion Detection

Sensor Networks

Data Mining for Homeland Defense

“Data mining and data warehousing are part of a much larger FBI plan to … discover patterns and relationships that indicate criminal activity” (network intrusions, cyber attacks, terroristic calls, …)” Federal Computer Week, June 3, 2002

FBI Director Robert Mueller: “New information technology is critical top conducting business in a different way, critical to analyzing and sharing information on a real time basis”

Homeland Defense: Key issues

Information fusion from diverse data sources including intelligence, agencies, law enforcement, profile …

Data mining on this information base to uncover latent models and patterns

Visualization and display tools for understanding the relationships between persons, events and patterns of behavior

Window

Chair

Credenza

Door

Plant

Baseboard H

eater

Chair

Chair

Shelves

CulturalData

IntelligenceData

Law enforcementData

InformationFusion

Eventrecog

Assocanalysis

Threatpredictor

ThreatVisualizer

Information Assurance: Introduction

As the cost of the information processing and Internet accessibility falls, more and more organizations are becoming vulnerable to potential cyber threats “unlawful attacks and threats of attack against computers,

networks, and the information stored therein when done to intimidate or coerce a government or its people“ – D. Denning

Incidents Reported to CERT/CC

0

10000

20000

30000

40000

50000

60000

90 91 92 93 94 95 96 97 98 99 00 01

Information Assurance: Intrusion Detection Intrusion Detection: Detecting a set of actions

that compromise the integrity, confidentiality, or availability of information resources. Viruses and Internet worms

Theft of classified information from DOD computers Problem of identifying individuals

who are using computers without authorization who have legitimate access but are abusing their privileges

Intrusion Detection System (IDS) combination of software and hardware that attempts to perform

intrusion detection raise the alarm when

possible intrusion happens

Data Mining on Intelligence Databases

Purpose:Develop methods to identify potential threatsMine intelligence database

Example: Forecasting Militarized Interstate Disputes (FORMIDs).

Data: social, political, economic, geographical information for pairs of countries

ratio of military capability democracy index level of trade distance

Predict: the likelihood of militarized interstate disputes (MIDs).Overall Objective: predict likely instabilities involving pairs of countries. Collaborators: Sean O’Brien, Center of Army Analysis (CAA),

Kosmo Tatalias (NCS).

Data Mining in Commercial Word

Employed

# of years

# of years in school

YESNO

NO

Yes

NoYes

Married

< 2

4 > 4

Classification / Predictive Modeling {Direct Marketing,

Fraud Detection}

Clustering (Market

segmentation)

Association PatternsMarketing / Sales

Promotions

TID Items

1 Bread, Milk

2 Beer, Diaper, Bread, Eggs

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Coke, Bread, Diaper, Milk

}{},{ BeerMilkDiaper

Given its success in commercial applications, data mining holds great promise for analyzing large data sets.

2

Key Technical Challenges

Large data size Gigabytes of data are common

High dimensionality Thousands of dimensions are possible

Spatial/temporal nature of the data Data points that are close in time and space

are highly related

Skewed class distribution Interesting events are very rare looking for the “needle in a haystack”

Data Fusion & Data Preprocessing Data from multiple sources

Data of multiple types (text, images, voice, … )

Data cleaning – missing value imputation, scaling, mismatch handling

“Mining needle in a haystack. So much hay and so little time”

Intrusion Detection Research at AHPCRC

Misuse Detection - Predictive models Mining needle in a haystack – models must be able to handle skewed class

distribution, i.e., class of interest is much smaller than other classes.

Learning from data streams - intrusions are sequences of events

Anomaly and Outlier Detection Able to detect novel attacks through outlier detection schemes

Detect deviations from “normal” behavior as anomalies

Construction of useful features that may be used in data mining

Modifying signature based intrusion detection (SNORT) systems to incorporate anomaly detection algorithms

Summer Institute Projects Implementing Anomaly/Outlier detection algorithms

Investigating algorithms for classification of rare classes

Visualizing tool for monitoring network traffic and suspicious network behavior

Learning from Rare Class

Key Issue: Examples from rare classes get missed in standard data mining analysis

Over-sampling the small class or under-sampling the large class

PNrule and related work [Joshi, Agarwal, Kumar, SIAM

2001, SIGMOD 2001]

RareBoost [Joshi, Agarwal, Kumar, ICDM 2001, KDD 2002]

SMOTEBoost [Lazarevic, et al, in review]

Classification based on association - add frequent items as “meta-features” to original data set

PN-rule Learning and Related Algorithms* P-phase:

cover most of the positive examples with high support seek good recall

N-phase: remove FP from examples covered in P-phase N-rules give high accuracy and significant support

Existing techniques can possibly learn erroneous small signatures for absence of C

C

NC

PNrule can learn strong signatures for presence of NC in N-phase

C

NC

* - SIAM 2001, SIGMOD 2001, ICDM 2001, KDD 2002

SMOTE and SMOTEBoost

SMOTE (Synthetic Minority Oversampling Technique) generates artificial examples from minority (rare) class along the boundary line segment

Generalization of over-sampling technique

Combination of SMOTE and boosting further improves the prediction performance or rare classes

SMOTE and SMOTEBoost Results

Experimental Results on modified KDDCup 1999 data set

Final values for recall, precision and F-value for U2R class on KDDCup-99 intrusion data set Method Recall Precision F-value Method Recall Precision F-value

Standard RIPPER 57.35 84.78 68.42 Standard Boosting 80.147 90.083 84.83 Nu2r=100, Nr2l=100 61.76 86.60 72.1 Nu2r=100, Nr2l=100 87.5 87.5 87.5 Nu2r=300, Nr2l=100 80.15 85.16 82.58 Nu2r=300, Nr2l=100 88.24 90.91 89.55

SMOTE +

RIPPER Nu2r=500, Nr2l=100 75.74 88.03 81.42

SMOTE-Boost

Nu2r=500, Nr2l=100 87.5 92.97 90.15 Final values for recall, precision and F-value for R2L class on KDDCup-99 intrusion data set

Method Recall Precision F-value Method Recall Precision F-value

Standard RIPPER 75.98 96.72 85.11 Standard Boosting 95.46 96.83 96.14 Nu2r=100, Nr2l=100 87.94 94.47 86.90 Nu2r=100, Nr2l=100 97.02 96.54 96.78 Nu2r=300, Nr2l=100 77.45 92.03 84.11 Nu2r=300, Nr2l=100 97.07 95.25 96.15

SMOTE +

RIPPER Nu2r=500, Nr2l=100 71.75 89.38 79.06

SMOTE-Boost

Nu2r=500, Nr2l=100 97.38 95.84 96.72

Classification Based on Associations

Current approaches use confidence-like measures to select the best rules to be added as features into the classifiers. This may work well only if each class is well-represented in

the data set.

For the rare class problems, some of the high recall itemsets could be potentially useful, as long as their precision is not too low.

Our approach: Apply frequent itemset generation algorithm to each class. Select itemsets to be added as features based on precision,

recall and F-Measure. Apply classification algorithm, i.e., RIPPER, to the new data

set.

Experimental Results (on modified KDD Cup 1999 data)

For rare classes, rules ordered according to F-Measure produce the best results.

Original RIPPER RIPPER with high Precision rules

RIPPER with high Recall rules RIPPER with high F-measure rules

Precis ion Recall F-Measure

dos 99.50% 99.16% 99.33%

u2r 84.78% 57.35% 68.42%

r2l 96.72% 75.98% 85.11%

probe 97.22% 90.27% 93.62%

norm al 95.52% 99.30% 97.37%


dos 99.51% 98.95% 99.23%

u2r 90.09% 73.53% 80.97%

r2l 92.90% 75.88% 83.53%

probe 96.96% 96.69% 96.83%

norm al 96.29% 98.88% 97.57%


dos 99.53% 99.65% 99.59%

u2r 88.57% 68.38% 77.18%

r2l 96.54% 78.91% 86.84%

probe 97.40% 94.89% 96.13%

norm al 96.80% 99.25% 98.01%


dos 99.48% 99.79% 99.63%

u2r 94.17% 83.09% 88.28%

r2l 96.14% 84.31% 89.84%

probe 98.25% 92.11% 95.08%

norm al 97.18% 99.26% 98.21%

Anomaly and Outlier

Detection

Main Assumptions All anomalous activities need closer inspection Determine “normal activity profile” and flag an alarm when the

state differs from the “normal profile” Expert analyst examines suspicious activity to make final

determination whether activity is indeed an intrusion Drawbacks

Possible large number of false alarms and not recognizing attacks

Supervised (with access to normal data) vs. Unsupervised (with NO access to normal data) determining “normal behavior”

False alarm

Missed attacks

Anomalous activities

Normal profile

Outlier Detection

Outlier is defined as a data point which is very different from the rest of the data (“normal data”) based on some measure of similarity

Outlier detection approaches: Statistics based approaches

Distance based techniques

Clustering based approaches

Density based schemes

Distance and density based schemes

Distance based approaches (NN approach) - Outliers are points that do not have enough neighbors

Density based approach (LOF approach) finds outliers based on the densities of local neighborhoods

Concept of locality becomes difficult to define due to data sparsity in high dimensional space

Clustering based approaches define outliers as points which do not lie in clusters

Implicitly define outliers as background noise

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Connection number

Con

nect

ion

scor

e

LOF approach

NN aproach

Outlier Detection Results (on DARPA’98 data)Detection Rate (False alarm rate was fixed

to 1%) NN approach LOF Bursty attacks 15/19 (78.9%) 14/19 (73.7%) Non-bursty attacks 9 / 25 (36.0%) 14 / 25 (56.0%)

The score values assigned to network connections from bursty attacks

Modifying SNORT

SNORT contains simple SPADE (Statistical Packet Anomaly Detection Engine)

SPADE only compares the statistics of packets

Our approach Integrate our implemented outlier detection schemes into the

SNORT since it is and open source code

Improve the detection of novel intrusions and suspicious behavior by using sophisticated outlier detection schemes

SNN Clustering – Finding Patterns

in Noisy Data• Finds clusters of arbitrary

shapes, sizes and densities

• Handles high dimensional data

• Density invariant

• Built-in noise removal

Topics from Los Angeles Times (Jan. 1989)• 3204 articles, 31472 words (LA Times, January 1989)• afghanistan embassi guerrilla kabul moscow rebel soviet troop

ussr withdraw • chancellor chemic export german germani kadafi kohl libya libyan

plant poison weapon west • ahead ball basket beate brea chri coach coache consecut el final

finish foul fourth free game grab half halftim hill host jef lead league led left los lost minut miss outscor overal plai player pointer quarter ralli rank rebound remain roundup scor score scorer season shot steal straight streak team third throw ti tim trail victori win won

• ab bengal bowl cincinnati craig dalla denver esiason field football francisco giant jerri joe miami minnesota montana nfl oppon pass pittsburgh quarterback rice rush super table taylor terri touchdown yard

Topics from FBI web site

• uss rocket cabin aircraft fuel hughes twa missile redstone • (1999 - Congressional Statement - TWA 800)

• theft art stolen legislation notices recoveries • (FBI - Art Theft Program)

• signature ink writings genuine printing symbols• (forensic science communications)

• forged memorabilia authentic bullpen • (FBI Major Investigation - operation bullpen)

• arabia saudi towers dhahran • (June 25 1996 bombing of the Khobar Towers military housing complex in

Dhahran Kingdom of Saudi Arabia REWARD)

• classified philip quietly ashcroft hanssen drop cia affidavit tenet dedication compromised kgb successor helen volunteered

• (Agent Robert Philip Hanssen - espionage)

Topics from FBI web site

• afghanistan ramzi bin yousef islamic jihad egyptian bombings

egypt pakistan hamas yemen headquartered usama laden kenya

tanzania nairobi embassies dar salaam rahman mohamed abdel

affiliated camps opposed deserve legat enemies vigilance plots

casualties

• enterprise asian chinese enterprises korean vietnamese italian

cartels heroin cosa nostra sicilian lcn

• firearms firearm bullets ammunition cartridges

• perpetrating bioterrorism responders credible exposed biological

articulated covert hoax wmd assumes

Future Applications

Unclassified telephone calls data to be provided by INSCOM

Goal: to determine a terrorist in a haystack

Nodes people

Edges telephone calls (date / time / duration)

Conclusions

Predictive models specifically designed for rare class can help in improving the detection of small attack types

Simple outlier detection approaches appear capable of detecting anomalies

Clustering based approaches show promise in identifying novel attack types

Integration data mining techniques into SNORT should improve the detection rate

Data Mining Process

• Data mining – “non-trivial extraction of implicit, previously unknown and potentially useful information from data”

Back

Modified KDDCup 1999 Data Set

• KDDCup 1999 data is based on DARPA 1998 data set

• Remove duplicates and merge new train and test data sets

• Sample 69,980 examples from the merged data set – Sample from neptune and normal subclass. Other

subclasses remain intact.

• Divide in equal proportion to training and test sets

Back

DARPA 1998 Data Set

DARPA 1998 data set (prepared and managed by MIT Lincoln Lab) includes a wide variety of intrusions simulated in a military network environment

9 weeks of raw TCP dump data 7 weeks for training (5 million connection records) 2 weeks for training (2 million connection records)

Connections are labeled as normal or attacks (4 main categories of attacks - 38 attack types) DOS - Denial Of Service Probe - e.g. port scanning U2R - unauthorized access to root privileges, R2L - unauthorized remote login to machine,

Two types of attacks Bursty attacks - involve multiple network connections Non-bursty attacks - involve single network connections

Back to KDDCup

Back to Experiments

Terrorist Threat Analyzer & Predictor Operational Capability:•Ability to match data from multiple sources, resolving structural and semantic conflicts.•Ability to recognize events and behaviors, each of whose (partial) information is available in different data streams.•Ability to identify latent associations between suspects and their linkages to events/behaviors.•Capability to predict threatening activities and behaviors with high probability.•Ability to visualize interesting and significant events and behaviors.

Terrorist Threat Analyzer & Predictor (T-TAP)

Proposed Technical Approach: New EffortKey Technologies:•High dimensional data clustering – METIS.•Spatio-temporal change point detection.•Association analysis and belief revision.•High dimensional classification – SVM, NN, boosting.Task List:•T1: Develop data fusion algorithms to match diverse intelligence, law enforcement, and cultural data.•T2: Develop event & behavior recognition algorithms across multiple, multi-media, data streams.•T3: Association analysis algorithms to determine hidden connections between suspects, events and behaviors; develop networks of associations.•T4: Predictive models of threatening events and behaviors.•T5: Interestingness & relevance based visualization models of significant events and behaviors.

Rough Order of Magnitude Cost and Schedule:•Tasks 1, 2, 3 will each proceed in parallel for the first 12 months, with version 1 released at the end of 6 months. At this point tasks 4, 5 will start and proceed in parallel. Cross feedback across various tasks will lead to final, refined tools at the end of the 18 month period.Deliverables:•Database schema and structures to store terrain info.•Software implementation of concealed cavities prediction model.•User manuals, test reports, database schema•Quarterly technical and status reports, and final report.Corporate Information:•Vipin Kumar (POC)•Army Research Center, University of Minnesota, 1100 Washington Avenue SE, Minneapolis, MN 55455•Phone (612)626-8095; E-mail: [email protected]

AHPCRCUniversity of Minnesota

Window

Chair

Credenza

Door

Plant

Baseboard H

eater

Chair

Chair

Shelves

CulturalData

IntelligenceData

Law enforcementData

InformationFusion

Eventrecog

Assocanalysis

Threatpredictor

ThreatVisualizer

T-TAPSystem

Distributed Virtual Integrated Threat Analysis Database Operational Capability:•Global integrated view across multiple databases: integrated schema, global dictionary and directory, common ontology•Threat analysis object repository

•comprehensive suspect dossier•temporal activity tracks•association networks

•Interactive database exploration•Field and value based querying•Approximate match based retrieval

Distributed Virtual Integrated Threat Analysis Database (DVITAD)

Proposed Technical Approach: New EffortKey Technologies:•Semantic object matching•Clustering large datasets – METIS•Latent/Indirect association analyzerTask List:•T1: Data normalization through the development of wrappers/connectors for various databases.•T2: Integrated schema creation to model information found in all databases.•T3: Matching entities (suspects, events, etc.) across multiple data sources; resolving conflicting attributes values for entities across databases.•T4: Clustering of suspect profiles.•T5: Building networks of hidden associations between suspects.•T6: Constructing temporal activity tracks of events and linkage of suspects to the tracks.

Rough Order of Magnitude Cost and Schedule:•Tasks 1 and 2 will each proceed in parallel for the first 12 months, with version 1 released at the end of 6 months. At this point tasks 3,4,5,6 will start and proceed in parallel. Cross feedback across various tasks will lead to final, refined tools at the end of the 18 month period.Deliverables:•Global schema, dictionary, and directory for the integrated database.•Software that realizes the virtual integrated DB view.•User manuals, test reports, database schema•Quarterly technical and status reports, and final report.Corporate Information:•Vipin Kumar (POC)•Army Research Center, University of Minnesota, 1100 Washington Avenue SE, Minneapolis, MN 55455•Phone (612)626-8095; E-mail: [email protected]

AHPCRCUniversity of Minnesota

Window

Chair

Credenza

Door

Plant

Baseboard H

eater

Chair

Chair

Shelves

Database Connectors

Information integration:matching, clustering,profiles, networks, tracks

Virtual Integrated Database

Threat Analysis

Tools

Date post:	15-Dec-2015
Category:	Documents
Upload:	citlali-artis
View:	219 times
Download:	0 times

Data Mining for Cyber Threat Analysis Vipin Kumar Army High Performance Computing Research Center...

Documents