Network attack analysis viak-means clustering
Dept. of Computer Science
B. Thomas Golisano College of Computing and Information Sciences
- By Team Cinderella
Chandni [email protected]
Priyanka [email protected]
10/27/2015
B. Thomas Golisano College of Computing and Information Sciences
CONTENTS
Analysis of Research Paper 3
Analysis of Research Paper 2
Analysis of Research Paper 1
Recap of projectoverview
10/27/2015
B. Thomas Golisano College of Computing and Information Sciences
PROJECT OVERVIEW : SUMMARY
Detection of Network Intrusion
The network is prone to the following kinds of attacks: DoS ( Denial of Services) Probing or Surveillance attacks User-to-root Remote-to-local
K- Means Clustering Algorithm – To find thehidden pattern in the dataK-Means Clustering is a NP Hard problem
To parallelize the K-Means algorithm
Our approach
Our goal
Problem Statement
NSL-KDD Cup 1999 DatasetLink:https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
Data to be used
10/27/2015
B. Thomas Golisano College of Computing and Information Sciences
Title: K-Means Clustering Approach to Analyze NSL-KDD Intrusion Detection Dataset
Author: Vipin Kumar, Himadri Chauhan, Dheeraj Panwar
Journal Name: International journal of soft computing and engineering
ISSN: 2231-2307 Volume -3, Issue -4,
Date: 09/01/2013
Page Number: 1-4
URL: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.413.589&rep=rep1&type=pdf
RESEARCH PAPER 1
10/27/2015
B. Thomas Golisano College of Computing and Information Sciences
• Detection of Network Intrusion by performing unsupervised data mining technique of clustering onunlabeled data.
• Analysis of the NSL-KDD dataset using K-Means Clustering.
• Clustering the dataset into normal and the following network attack types: DoS (Denial of Services) Probe R2L U2R
RESEARCH PAPER 1 : PROBLEM STATEMENT
10/27/2015
B. Thomas Golisano College of Computing and Information Sciences
• K-Means is few of the simplest, unsupervised clustering algorithm that can cluster unlabeled data.
• Initialization of k prototypes (w1, …, wk) [ k is the number of clusters ] [1]Each cluster Ci is associated with prototype wi
Repeatfor each input data record di, where i ϵ {1,… n},
doAssign di to the cluster Cj* with nearest prototype wj
for each cluster Cj, where j ϵ {1,…, k},do
Update the prototype wj to be the centroid of all samples currently in Ci,so that wi = ∑di ϵ Ci / | Ci|
RESEARCH PAPER 1 : APPROACH TO THE PROBLEM
10/27/2015
B. Thomas Golisano College of Computing and Information Sciences
• Understanding the multi dimension NSL-KDD dataset.Records from the dataset contains the following 41 attributes. [1]
Class attributes specifies the class of the network traffic.It may be normal of on of the following 37 different kinds of attacks.
• Categorization of 37 different attack types into 4 attack categories. [1]
RESEARCH PAPER 1 : CONTRIBUTION
DoS Back, Land, Neptune, Pod, Smurf, teardrop, Mailbomb, Processtable, Udpstorm, Apache2, Worm
Probe Satan, Ipsweep, Nmap, Portsweep, Mscan, Saint
R2L Guess_Password, Ftp_write, Imap, Phf, Multihop, Warezmaster, Xlock, xsnoop, Snmpguess, Snmpgetattack, Httptunnel, Sendmail, Named
U2R Buffer_overflow, Loadmodule, Rootkit, Perl, Sqlattack, Xterm, Ps
duration src_bytes dst_bytes land wrong_fragment urgent
num_failed_logins logged_in num_compromised root_shell su_attempted num_root
num_shells num_access_files num_outbound_cmds is_host_login is_guest_login count
serror_rate srv_serror_rate rerror_rate srv_rerror_rate same_srv_rate diff_srv_rate
dst_host_count dst_host_srv_count dst_host_same_srv_rate dst_host_diff_srv_rate dst_host_same_src_port_rate dst_host_srv_diff_host_rate
dst_host_srv_serror_rate dst_host_rerror_rate dst_host_srv_rerror_rate protocol_type service flag
hot num_file_creations srv_count srv_diff_host_rate dst_host_serror_rate class
10/27/2015
B. Thomas Golisano College of Computing and Information Sciences
• It requires the entire dataset to be stored in the main memory.
• Involves high computational complexity. [ O(n*k*i*d) n=number of data records, k=number of clusters,i= number of iterations, d= dimensionality of the record ]
RESEARCH PAPER 1 : DRAWBACKS
10/27/2015
B. Thomas Golisano College of Computing and Information Sciences
• Knowing the attributes to work with.
• Understanding the types of attacks and their belongings to one of the 4 attack categories.
• Prototype to be considered is the centroid of the cluster.
• With the cluster size of 4, the optimal result can be achieved.
• Using Euclidean distance metric to compute the dissimilarity between the data records.
RESEARCH PAPER 1 : TAKEAWAY FOR OUR IMPLEMENTATION
10/27/2015
B. Thomas Golisano College of Computing and Information Sciences
Title: Parallelizing k-means Algorithm for 1-d Data Using MPI
Author: Ilias K. Savvas and Georgia N. Sofianidou
Journal Name: 2014 IEEE 23rd International WETICE Conference
Date: 2014
Page Number: 179 – 184
URL: http://ieeexplore.ieee.org.ezproxy.rit.edu/stamp/stamp.jsp?tp=&arnumber=6927046
RESEARCH PAPER 2
10/27/2015
B. Thomas Golisano College of Computing and Information Sciences
• Reducing the computational complexity of the K-Means by implementing parallel K-means.
RESEARCH PAPER 2 : PROBLEM STATEMENT
10/27/2015
B. Thomas Golisano College of Computing and Information Sciences
• It follows the Single Instruction Multiple Data (SIMD) paradigm.
• It works on the concept of master and worker nodes.
• Communication between the master and worker nodes are performed using MPI ( Message PassingInterface ).
RESEARCH PAPER 2 : APPROACH TO THE PROBLEM
10/27/2015
B. Thomas Golisano College of Computing and Information Sciences
• Parallel K-Means Algorithm [2]
Master node, M finds the number of available worker nodes, assume : (N − 1)M: splits input dataset D into D/(N − 1) subsetsM: transfers data to worker nodesfor all Worker nodes wi, wi where i ∈ {1, 2, 3,...,N −1} , do in parallel
Receive the data chunk from the masterPerform K-Means clusteringSend the local centroids and the number of data records assigned to each one of them to M
end forM receives centroids and the corresponding data records assigned to each of the centroids.M sorts the centroid listM calculates global centroids by applying the weighted arithmetic mean and produce the final clustering.
RESEARCH PAPER 2 : CONTRIBUTION
10/27/2015
B. Thomas Golisano College of Computing and Information Sciences 9/22/2015
• Parallel K-Means Algorithm [2]Master node, M finds the number of available worker nodes, assume: (N − 1)M: splits input dataset D into D/(N − 1) subsetsM: transfers data to worker nodesfor all Worker nodes wi, wi where i ∈ {1, 2, 3,...,N −1} , do in parallel
Receive the data chunk from the masterPerform K-Means clusteringSend the local centroids and the number of data records assigned to
each one of them to Mend forM receives centroids and the corresponding data records assigned to each
of the centroids.M sorts the centroid listM calculates global centroids by applying the weighted arithmetic mean and
produce the final clustering.
RESEARCH PAPER 2 : CONTRIBUTION
M
W4W3W2W1
C1 C2 C3 C4
M
Collecting local centroids C
B. Thomas Golisano College of Computing and Information Sciences
Creation of the chunks:
RESEARCH PAPER 2 : UNDERSTANDING K-MEANS WITH AN EXAMPLE
Subset1 Subset2 Subset3 Subset4
12 13 39 20
3 7 19 21
30 31 32 26
8 2 16 27
18 4 17 28
15 25 22 29
23 10 38 34
24 9 33 35
5 14 40 36
6 1 11 37
Input: Dataset with 40 data records, k=2
Centroids with the number of data records assigned to each centroid
Centroids Data records assigned
15.5 7
22.5 3
8.2 6
16.7 4
35.4 5
27 5
23.5 6
36 4 Calculating global centroid
C1 = 8.2*6 + 15.5*7 + 16.7*4 + 22.5*3 / (7+3+6+4)= 14.6
C2 = 23.5*6 + 27*5 + 35.4*5 + 36*4 / (5+5+6+4)= 29.9
Sorting the centroids
Centroids Data records assigned
8.2 6
15.5 7
16.7 4
22.5 3
23.5 6
27 5
35.4 5
36 4
10/27/2015
B. Thomas Golisano College of Computing and Information Sciences
• Parallel K-Means efficiency increases with anincreasing number of clusters.
RESEARCH PAPER 2 : RESULTS
• Parallel K-Means is highly scalable.
Fig 1: Increasing K and number of nodes [2] Fig 2: With K=2 increasing number of nodes [2]
10/27/2015
B. Thomas Golisano College of Computing and Information Sciences
• Message Passing overhead is involved in sending and receiving information across the master andworker nodes.
RESEARCH PAPER 2 : DRAWBACK
10/27/2015
B. Thomas Golisano College of Computing and Information Sciences
• Improved performance of K-Means clustering by divide and merge mechanism (parallelizing).
• The results obtained are same to the results generated from the original K-Means clustering.
RESEARCH PAPER 2 : TAKEAWAY FOR OUR IMPLEMENTATION
10/27/2015
B. Thomas Golisano College of Computing and Information Sciences
Title: High performance parallel k-means clustering for disk-resident datasets on multi-core CPUs
Author: Ali Hadian, Saeed Shahrivari
Journal Name: The Journal of Supercomputing
ISSN: 0920-8542
Date: 08/2014
Page Number: 845 – 863
URL: http://link.springer.com.ezproxy.rit.edu/article/10.1007%2Fs11227-014-1185-y
RESEARCH PAPER 3
10/27/2015
B. Thomas Golisano College of Computing and Information Sciences
• Implementation of parallel K-Means clustering utilizing the maximum capabilities of multiple cores of acomputer.
RESEARCH PAPER 3 : PROBLEM STATEMENT
10/27/2015
B. Thomas Golisano College of Computing and Information Sciences
• The algorithm takes a parallel processing approach utilizing the available cores in the CPU.
• Divide the dataset into multiple chunks.
• Cluster each of the chunks.
• Aggregate the chunk clustering and make the final cluster.
• This algorithm does only single pass over the dataset.
RESEARCH PAPER 3 : APPROACH TO THE PROBLEM
10/27/2015
B. Thomas Golisano College of Computing and Information Sciences
• Concepts of master thread and chunk-clustering thread are integral part of the algorithm.
RESEARCH PAPER 3 : CONTRIBUTION
Input: K, chunk_size, dataset
Initialize chunks_queue [queues shared between threads]Initialize centroids[] [centroids shared between threads]while Not End_Of_Dataset do
new_chunk←load_next_chunk(data_set, chunk_size) enqueue(new_chunk, chunks_queue) if (chunks_queue is full) then
wait(chunks_queue) [wait until a consumer thread dequeues a chunk]end if
end whilewhile chunks_queue is not empty do
wait(chunks_queue) [wait for all chunks in queue to be clustered]end while C←Cluster [using K-means++ clustering]
Algorithm:Master
thread [3]
10/27/2015
B. Thomas Golisano College of Computing and Information Sciences
• Concepts of master thread and chunk-clustering thread are integral part of the algorithm.
RESEARCH PAPER 3 : CONTRIBUTION (contd..)
Input: chunks_queue, centroids[], K(Number of clusters for each chunk)
while queue is not empty & main thread is alive dochunk_instances←Dequeue(queue)C←Clustercentroids[]←centroids[]∪Cif (queue is empty) then
wait(queue) [wait until a consumer thread dequeues a chunk]end if
end while
Algorithm:Chunk
Clustering thread [3]
• With the clustered chunks, the master thread loads the centroid list.• With the centroid list the master thread performs k-means++ clustering to find the final cluster
10/27/2015
B. Thomas Golisano College of Computing and Information Sciences
• The algorithm results in higher speedups when the data size is large enough.
RESEARCH PAPER 3 : RESULTS
Fig 3: Speedup of the algorithm [3]
10/27/2015
B. Thomas Golisano College of Computing and Information Sciences
• Utilizing multiple cores in the machine to cluster the dataset at the fastest.
RESEARCH PAPER 3 : TAKEAWAY FOR OUR IMPLEMENTATION
10/27/2015
B. Thomas Golisano College of Computing and Information Sciences
REFERENCE
[1] K-Means Clustering Approach to Analyze NSL-KDD Intrusion Detection DatasetVipin Kumar, Himadri Chauhan, Dheeraj PanwarJournal Name: International journal of soft computing and engineeringISSN: 2231-2307 Volume -3, Issue -4,Date: 09/01/2013Page Number: 1-4URL: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.413.589&rep=rep1&type=pdf
[2] Parallelizing k-means Algorithm for 1-d Data Using MPIIlias K. Savvas and Georgia N. SofianidouJournal Name: IEEE 23rd International WETICE ConferenceDate: 2014Page Number: 179 – 184URL: http://ieeexplore.ieee.org.ezproxy.rit.edu/stamp/stamp.jsp?tp=&arnumber=6927046
[3] High performance parallel k-means clustering for disk-resident datasets on multi-core CPUsAli Hadian, Saeed ShahrivariJournal Name: The Journal of SupercomputingISSN: 0920-8542Date: 08/2014Page Number: 845 – 863URL: http://link.springer.com.ezproxy.rit.edu/article/10.1007%2Fs11227-014-1185-y
10/27/2015
B. Thomas Golisano College of Computing and Information Sciences 10/27/2015
B. Thomas Golisano College of Computing and Information Sciences 10/27/2015