86
CHAPTER 5
MODIFIED J48 DECISION TREE ALGORITHM FOR
EVALUATION
5.1 Introduction
After the 64byte protocol structure standardization and Genetic approach
functions of mutation and cross over, the fitness of the protocol device identification
is carried out, using the modified J48 decision tree algorithm. The implementation of
the decision tree algorithm and the identified results are discussed in this chapter.
The main objective of developing this modified J48 decision tree algorithm is
to minimize the search process in compare with the Current Active Directory List.
The MAC address of the device is available in CADL. This list is used as an input to
identify the intruder. In order to identify, it is proposed to use a modified J48 decision
tree algorithm. This algorithm is also provided the same result as good as GA. But it
counts less time. The process flow of the modified J48 decision tree algorithm is as
follows.
Precondition
All the Current Active Directory List elements are sorted in an ascending
order and generated list with unique device representation of the network.
Step 1: The generated MAC value is evaluated with the Current Active Directory List
(CADL)
Step 2: The Current Active Directory list elements are formed as a tree based on the
value of the existing MAC address of the devices
Step3: At each node left tree formed with less value elements and right tree formed
with greater value elements.
87
Step 4: Execute the step 3 recursively till all the elements in the CADL is included in
the tree.
Step 5: Evaluate the Left Significant Byte (LSB) to identify the gain (Similarity) of
the Device of the observed and suspected packet generated value.
Step 6: If the bits are equal then returns 1 otherwise 0 as a gain.
Step 7: Count the gain value
Step8: If the gain maximum value is in the CADL then that device is authenticated
device else recommend for the suspected device.
This process is implemented as part of the data mining to detect the IDS.
5.2 Data Mining approaches for IDS decision
Data mining is the process of extracting patterns from data. Data mining is an
important tool by which modern business transforms data into business intelligence
giving an informational advantage. It is currently used in a wide range of profiling
practices. The most important reason for using data mining is to assist in the analysis
of collection and observation of behavior. Data mining technology is advanced for
processing huge amounts of data, and to discover hidden and ignored information.
Data mining commonly involves four classes of clustering, classification, regression
and association. Clustering is the task of discovering “similar” groups and structures
in the data, without using the known structures. Classification is the task of
generalizing known structure to apply to new data. Common algorithms include
decision tree learning, nearest neighbor, Naive Bayesian classification [30][71],
neural networks and support vector machines. Regression attempts to find a function
which models the data with the least error. Association rule learning searches for
relationships between variables. The data mining approach is used to determine the
unknown interested pattern from the domain applications. The network security
researchers attempted to determine the IDS using various data mining techniques.
88
5.2.1 Decision Tree Algorithm
A decision tree performs the classification of a given data sample through
various levels of decisions to help us reach a final decision. Such a sequence of
decisions is represented in a tree structure. The tree structure is used in classifying
unknown data records. A decision tree with a range of discrete (symbolic) class labels
is called a classification tree, whereas a decision tree with a range of continuous
(numeric) values is called a regression tree. CART (Classification and Regressing
Tree) is a well-known program, used in the designing of decision trees. Decision trees
make use of the IDE3, C4.5 and CART algorithms [91][92].
5.2.2 IDE3 Algorithm
The IDE3 (Iterative Dichotomiser 3) decision tree algorithm is introduced in
1986 by Quinlan Ross [81][80]. It is based on Hunt’s algorithm, and is serially
implemented. Like other decision tree algorithms the tree is constructed in two
phases; tree growth and tree pruning. Data is sorted at every node during the tree
building phase, in-order to select the best splitting single attribute [20]. The IDE3 uses
an information gain measure in choosing the splitting attribute. It only accepts
categorical attributes in building a tree model [81][80].The IDE3 does not give
accurate results when there is too-much noise or details in the training data set; thus
an intensive pre-processing of data is carried out before building a decision tree model
with the IDE3.
5.2.3 C4.5 Algorithm
The C4.5 algorithm is an improvement of the IDE3 algorithm, developed by
Quinlan Ross (1993). It is based on Hunt’s algorithm and like the IDE3, it is serially
implemented. Pruning takes place in C4.5 by replacing the internal node with a leaf
node, thereby reducing the error rate (Podgorelecet al, 2002). Unlike the IDE3, the
C4.5 accepts both continuous and categorical attributes in building the decision tree. It
89
has an enhanced method of tree pruning that reduces misclassification errors, due to
noise or too-much detail in the training data set. Like the IDE3 the data is sorted at
every node of the tree, in order to determine the best splitting attribute. It uses the gain
ratio impurity method to evaluate the splitting attribute [79].
5.2.4 CART Algorithm
The CART (Classification and regression trees) is introduced by Breiman
[15]. It builds both classification and regression trees. The classification tree
construction by CART is based on the binary splitting of the attributes. It is also based
on Hunt’s model of decision tree construction, and can be implemented serially [15].
It uses the gini index splitting measure in selecting the splitting attribute. Pruning is
done in CART by using a portion of the training data set. The CART uses both
numeric and categorical attributes for building the decision tree, and has in-built
features that deal with missing attributes.
The CART is different from other Hunt’s based algorithms as it is also used
for the regression analysis, with the help of the regression trees. The regression
analysis feature is used in forecasting a dependent variable (result), given a set of
predictor variables over a specific period of time. It uses many single variable
splitting criteria like the gini index, symgini, etc., and one multi-variable (linear
combinations) in determining the best split point and the data is sorted at every node
to determine the best splitting point. The linear combination splitting criteria is used
during the regression analysis.
5.2.5 Support Vector Machines (SVM)
The SVM first maps the input vector into a higher dimensional feature space,
and then obtains the optimal separating hyper-plane in the higher dimensional feature
space. An SVM classifier is designed for binary classification. The generalization in
this approach usually depends on the geometrical characteristics of the given training
90
data, and not on the specifications of the input space [42]. This procedure transforms
the training data into a feature space of a huge dimension.
5.2.6 Fuzzy Logic
It processes the input data from the network, and describes measures that are
significant for anomaly detection. Fuzzy logic [38] is a form of many-valued logic. It
deals with reasoning, which is approximate rather than fixed and exact. In contrast to
traditional logic theory, where binary sets have two-valued logic: (true or false), fuzzy
logic variables may have a truth value that ranges in degrees between 0 and 1.Fuzzy
algorithms have been successfully applied to a variety of industrial applications,
including automobiles, autonomous vehicles, chemical processes, and robotics. A
fuzzy system [84] comprises of a group of linguistic statements based on expert
knowledge. This knowledge is usually in the form of if-then rules. A case or an object
can be distinguished by applying a set of fuzzy logic rules, based on the attributes’
linguistic values. A comparison of the various decision tree algorithms is presented
below in Table 5.1.
S.no Classifier Method Parameters Advantages Disadvantages
1 Decision Tree[91][92]
Decision treeis based onthe binaryclassificationtree.
Positive andnegativeinstances.
1.Constructiondoes notrequire anydomainknowledge.
2. Canhandle highdimensionaldata.
3. Able toprocess bothnumericalandcategoricaldata.
1. Outputattribute must becategorical.
2. Limited to oneoutput attribute.
3. Decision treealgorithms areunstable.
4. Trees createdfrom numericdatasets can becomplex.
91
S.no Classifier Method Parameters Advantages Disadvantages
2 IterativeDichotomiser 3[IDE3][81][80]
Data is sortedat every nodeduring thetree buildingphase, in-order to selectthe bestsplittingsingleattribute
It onlyacceptscategoricalattributes inbuilding atree model
1.Easy todetect in thesortedelements
1.Does not giveaccurate resultswhen there istoo-much noiseor details in thetraining data set
3 C4.5[79] It uses thegain ratioimpuritymethod toevaluate thesplittingattribute
Accepts bothcontinuousandcategoricalattributes inbuilding thedecision tree.
1. Byreplacing theinternal nodewith a leafnode, therebyreducing theerror rate
1.Unable todetect theunsorted noncategorical dataset
2.Noise datarequire preprocessing forthe detection
4 Classificationand regressiontrees(CART)[15]
CART isbased on thebinarysplitting ofthe attributes.
Uses bothnumeric andcategoricalattributes forbuilding thedecision tree
1.In-builtfeatures thatdeal withmissingattributes
1.Data is sortedat every node todetermine thebest splittingpoint.
2.The linearcombinationsplitting criteriais used duringthe regressionanalysis.
5 Support VectorMachine [42]
Supportvectormachine
Theeffectivenessof SVM liesin theselection ofkernel andsoft marginparameters.
1.HighlyAccurate
2. Able tomodelcomplexnonlineardecisionboundaries
1Highalgorithmiccomplexity andextensivememory.
2. The choice ofthe kernel isdifficult
6 Fuzzy[38] Fuzzy logichelps tosmooth theabruptseparation ofnormality and
Range valuesof zero andone.
1.It was ableto detect nonmaliciousport scanslaunchedagainst the
1. Unable todetect a greaterdiversity ofintrusions.
92
S.no Classifier Method Parameters Advantages Disadvantagesabnormality. system from
the localdomain.
7 Modified J48 Rangedeterminationfor thecommunication devicebased on
Sequentialvalue.
Ranges aredeterminedbased on thevalue and theminimalattributes.
1. Reduce thesearch timefor the Sortedelements.
2.The valueisregeneratedfor eachiteration withminimal treeapproach.
1. The searchrequired thesorting aspreprocess.
Table 5.1 Comparison of various decision tree algorithms
These algorithms limitations are overcome in the implementation of the
genetic approach and the highlights are discussed from similar research work.
5.3 Modified J48 Decision Tree Algorithm
The 16 bit representation of the device MAC address is presented in the
Current Active Directory List. The modified J48 decision tree algorithm examines the
normalized information gain that results from choosing an attribute for splitting the
data. To make the decision, the attribute with the highest normalized information gain
is used. Then the algorithm recurs on the smaller subsets. The splitting procedure
stops if all instances in a subset belong to the same class.
Then a leaf node is created in the decision tree telling to choose that class. In
this case, the modified J48 decision tree algorithm creates a decision node higher up
in the tree using the expected value of the class. If the generated LSB value in CADL
and incoming protocol device MAC address are same then the device is authenticated
otherwise the device recommended for the intruder.
93
The structure of the modified J48 decision tree is shown in Figure 5.1. The
first level of the tree is a single header node. It is just a pointer node to its children.
The second level of the tree has 2 sub trees labeled from 1 to 2.
Figure 5.1 Structure of the modified J48 decision tree
5.3.1 Pseudo code for the modified J48 algorithm
The following pseudo code is used to build decision trees
1. Check for base cases [ Initial Device List form Current Active Directory List]
2. For each attribute a{ from the captured packets- Device address MAC}
Find the normalized information gain from splitting on a{ select the 16 Bit
device from the Least Most Significant Bit}
94
3. Let a best be the attribute with the highest normalized information gain
{Allowed to communicate on Network}
4. Create a decision node that splits on a best {Select the Least Significant Bits
or the Significant bit for Cross Over}
5. Recourse on the sub lists obtained by splitting on a best, and add those nodes
as children of node
The incoming value from the MAC device is converted using IANA format
and the values are compared. The C++ coding is used for the implementation of the
modified J48 decision tree algorithm.
The generated MAC address values are passed as a parameter to the modified
J48 decision tree algorithm as mentioned in step1and step2. The sample
implementation process is explained with the use of the manufacturer and device
information 08-00-06-04-00-01.
The value of 08H is passed as a parameter. The search process of 08H is
presented with CADL to be verified. The modified J48 decision tree algorithm
employs two pruning methods. The first is known as the sub tree replacement. The
second type of pruning used in the modified J48 decision tree algorithm is termed, sub
tree rising. The sub tree rising is implemented with the following procedure is given
in Table 5.2.
95
Procedure
Assign CADL is a root of the tree for the search process
Tree = { Elements of CADL } // where CADL consists of
SMA, SDA, DMA, DDA
Calculation
Assign L as a first element // L is a low range value
Assign H as Last element // H is a Highest value
D = | L- H | // Difference
M = loc (D/2) //M- Selection of Middle Element
Search process
If (M = e) // e – search element
Declare element found
Else
If (M < e)
subtree = Tree(L ) …. Tree(M);
Tree = subtree;
Elsesubtree = Tree(M ) … Tree(H) ;
Tree = subtree;
End if
End if
Table 5.2 procedure for Modified J48 Algorithm
96
While the search option is carried out in the list, the mid value of the separated
list is identified, and the test is processed.
5.4 Illustration
The modified J48 decision tree algorithm is explained with an example. Let
the tree represent the manufacturer OUI value and the device value, which is fetched
from the Current Active Directory List.
Level 0: Create a decision tree.
Level 1: The sub trees are generated based on the values instead of the Index.
CADL List
97
Level 2:
The parameter 08 H is less than the mid value 11.Therefore, the left tree is
identified as a dependent element and selected for further evaluation.
Level 3:
The parameter 08H is less than the mid value 0C. Therefore, the left tree is
identified as a dependent element and selected for further evaluation.
98
The mid value is the fourth element 08H. The element and the mid value are
the same; therefore, the element is identified and recommended as an authenticated
value.
The existence of 08H is confirmed; therefore, the return value is 1. It
represents that the device communicated in the network is authenticated. The
detection result sample is produced below in Figure 5.2.
Figure 5.2 Intrusion Detection.
According to the above explained procedure the intrusions are detected using
ARP, ICMP and SNMP-ALG protocol standardization and their results are obtained.
5.5. Experimental Result
The intrusion detection process is implemented with real-time packets, which
are collected from protocol observations and converted into a file. This set is
differentiated from the features of the normal connections, and from those of the
attack. Three types of protocol structure (ARP, SNMP-ALG, and ICMP) in the hybrid
academic network with 800 nodes are analyzed.
99
The intrusion detection level in an ARP Protocol against the initiated intrusion
is given in the Eqn(5.1)
ARP_IDS= )/ (5.1)
Where IdARP is Identified Intruders in ARP Protocol. The IDS process
efficiency is based on the identification from the packets. The number of packets is
permitted to communicate in the authenticated manner, after the manufacturer and
device evaluations. Table 5.3 illustrates the result of the ARP protocol packet
intrusion detection.
S.No Numberof packets
InitiatedIntrusion
Numberof ARPpackets
ARPIdentified
% ofARP
IdentifiedVs Total
% of ARPIdentified
Vs Initiated
%ofIdentifiedARP Vs
ARP
1 52489 6650 28010 1525 2.91 22.93 5.44
2 52489 4107 28010 1271 2.42 30.95 4.54
3 49676 3881 23424 549 1.11 14.15 2.34
4 49676 5252 23424 891 1.79 16.96 3.80
5 51305 8606 25252 2244 4.37 26.07 8.89
6 51305 7478 25252 1766 3.44 23.62 6.99
7 49781 3779 26489 884 1.78 23.39 3.34
8 57756 3728 27169 824 1.43 22.10 3.03
9 63383 6869 24464 848 1.34 12.35 3.47
10 54761 4478 27906 1072 1.96 23.94 3.84
Min 49676.00 3728.00 23424 549.00 1.11 12.35 2.34
Max 63383.00 8606.00 28010 2244.00 4.37 30.95 8.89
Avr 53262.10 5482.80 25940 1187.40 2.25 21.65 4.57
Table 5.3.Result of the ARP protocol packet Intrusion detection (10 files)
100
The intrusion detection level in SNMP–ALG protocol against the initiated
intrusion is given in the Eqn(5.2).
SNMP–ALG_ IDS= )/ (5.2)
Where Id SNMP–ALG is a Identified Intruders in SNMP–ALG protocol. The
result summary of the SNMP–ALG protocol packet intrusion detection is
presented in Table 5.4, given below.
S. No Number ofpackets
InitiatedIntrusion
SNMP-ALG
packets
SNMP-ALG
Identified
% ofSNMP-ALG
IdentifiedVs Total
% ofSNMP-ALG
IdentifiedVs Initiated
% ofIdentifiedSNMP-ALG VsSNMP-
ALG
1 52489 6650 3154 203 0.39 3.05 6.44
2 52489 4107 696 19 0.04 0.46 2.73
3 49676 3881 2583 98 0.20 2.53 3.79
4 49676 5252 932 45 0.09 0.86 4.83
5 51305 8606 720 44 0.09 0.51 6.11
6 51305 7478 746 51 0.10 0.68 6.84
7 49781 3779 4260 139 0.28 3.68 3.26
8 57756 3728 1080 25 0.04 0.67 2.31
9 63383 6869 2325 98 0.15 1.43 4.22
10 54761 4478 2989 117 0.21 2.61 3.91
Min 49676.00 3728.00 696 19.00 0.04 0.46 2.31
Max 63383.00 8606.00 4260 203.00 0.39 3.68 6.84
Avr 53262.10 5482.80 2478 83.90 0.16 1.65 4.44
Table 5.4 Result of SNMP-ALG protocol packet Intrusion detection (10files)
101
The intrusion detection level in ICMP protocol against the initiated intrusion is
given in the Eqn(5.3).
ICMP_ IDS= )/ (5.3)
Where IdICMP is an identified intruders in ICMP protocol. The result
summary of the ICMP protocol packet intrusion detection is presented in Table 5.5,
given below.
S. No Number ofpackets
InitiatedIntrusion
ICMPpackets
IdentifiedICMP
Packets
% ofICMP
IdentifiedVs Total
% of ICMPIdentified
Vs Initiated
% ofIdentifiedICMP Vs
ICMP
1 52489 6650 6061 529 1.01 7.95 8.73
2 52489 4107 4033 113 0.22 2.75 2.80
3 49676 3881 4799 168 0.34 4.33 3.50
4 49676 5252 3189 128 0.26 2.44 4.01
5 51305 8606 2643 168 0.33 1.95 6.36
6 51305 7478 4732 317 0.62 4.24 6.70
7 49781 3779 4822 241 0.48 6.38 5.00
8 57756 3728 6964 184 0.32 4.94 2.64
9 63383 6869 7759 302 0.48 4.40 3.89
10 54761 4478 6645 184 0.34 4.11 2.77
Min 49676.00 3728.00 2643.00 113.00 0.22 1.95 2.64
Max 63383.00 8606.00 7759.00 529.00 1.01 7.95 8.73
Avr 53262.10 5482.80 5164.70 233.40 0.44 4.35 4.64
Table 5.5 Result of ICMP protocol packet Intrusion detection (10files)
102
As per the analysis from the above mentioned Tables 5.3, 5.4 and 5.5, the
ARP protocol identified and detected 2.25 % of the total packets. At the same time, it
performed 21.65 % of the total initiated intrusion and 5.5% of the ARP protocol
packets. The SNMP-ALG protocol identified and detected less number of packets in
the total packets fetched from the network, compared to the ARP packets. The
detected intrusion is 0.16 % of the total number of packets, and it performed 1.65% of
the initiated intrusion and 4.44% of the total SNMP –ALG protocol packets.
Similarly, the detected intrusion is 0.44% of the total number of packets, and 4.35 %
of the total initiated intrusion, and 4.64% of the captured ICMP protocol packets. As
per the results, the ARP protocol performed the best.
The result comparison of IDS (10 files) in ARP, SNMP-ALG and ICMP
protocols with initiated intrusion is depicted graphically in Figure 5.3.
Figure 5.3 Comparison of IDS in ARP,SNMP-ALG and ICMP (10 files)
with initiated intrusion
1 2 3 4 5 6 7 8 9 10
SNMP_ALG 3.05 0.46 2.53 0.86 0.51 0.68 3.68 0.67 1.43 2.61ICMP 7.95 2.75 4.33 2.44 1.95 4.24 6.38 4.94 4.4 4.11ARP 22.93 30.95 14.15 16.96 26.07 23.62 23.39 22.1 12.35 23.94
0
5
10
15
20
25
30
35
Comparison of IDS in ARP, ICMP, and SNMP-ALG
103
The same process was executed for 211 files. The result summary of the
identified percentage of intrusion detection in each protocol with total number of
packets is presented in Table 5.6, given below.
Total numberof Observed
packets
InitiatedIntrusion
% of ARPIdentified Vs
Total
% of SNMP-ALG Identified
Vs Total
% ofICMP
IdentifiedVs Total
Min 49272 3114 0.49 0.03 0.04
Max 84957 13102 7.11 0.95 1.63
Avr 66471.11 7051.934 2.05 0.28 0.55
Table 5.6 Percentage of detected intrusion in total number of
observed packets with 211 files
In the implementation of 211 sets of observed packets the ARP, SNMP-ALG
and ICMP identified the intrusion. The ARP protocol identified 2.05 %, the SNMP-
ALG protocol identified 0.28% and the ICMP protocol identified 0.55% out of the
number of packets. Of the total 211 files of observed protocols, ARP, SNMP-ALG
and ICMP protocols are adopted for the intrusion detection. The detected intrusion
data set is listed below in Figure 5.4.
Figure 5.4 Detected Intrusion list
104
The result comparison of IDS in ARP, SNMP-ALG and ICMP protocols is
depicted graphically in Figure 5.5.
Figure 5.5 Comparison of IDS in ARP,SNMP-ALG and ICMP( 10 and 211 files)
As per the observation, the ARP protocol plays a significant role in the
identification of intrusions with the total observed packets in real time data
transformation across the hybrid academic network. The detailed protocol result
analysis result is given in the Appendix.The identified percentage of the detected
intrusions in each protocol with 211 files is presented in Table 5.7, given below.
S.No
Totalnumber ofObservedpackets
InitiatedIntrusion
% ofARP
Identified VsInitiate
d
% ofIde
ARPVs
TotalARP
% ofSNMP-
ALGIdentified
VsInitiated
% ofIdentifiedSNMP-ALG VsSNMP-ALG
% ofICMP
IdentifiedVs
Initiated
% ofIdentifiedICMP Vs
ICMPTotal
Min 49272 3114 6.97 1.91 0.28 1.86 0.51 1.87
Max 84957 13102 43.43 10.58 7.16 10.44 15.31 11.95
Avr 66471.11 7051.93 19.23 5.09 2.68 5.30 5.26 5.50
Table 5.7 Percentage of intrusion detected with individual protocolsfrom initiated intrusion
00.5
11.5
22.5
% of ARPIdentified Vs
Total
% of SNMP-ALGIdentified Vs
Total
% of ICMPIdentified Vs
Total10 sets 2.25 0.16 0.44211 sets 2.05 0.28 0.55%
of Id
entif
ied
IDS
pack
ets V
s tot
alcomparison of Data sets
105
While comparing the identified intrusions in each protocol, the ARP has
identified 43.43%, ICMP 15.31% and SNMP-ALG 7.16% against the initiated
intrusions. According to the captured protocols, the determined and blocked number
of packets and related resources are classified. In this observation, we have considered
only these three protocols; this protocol has mostly covered the basic network subnet
architecture of the OSI model. The searching time is minimized, using the modified
J48 decision tree algorithm, when compared to the searching time of the Genetic
algorithm.
The comparison of the searching process time using the Genetic algorithm and
the modified J48 algorithm is shown below in Table 5.8.
Nodes Number ofpackets
InitiatedIntrusion
Searching Timeusing GA in Sec
Searching Timeusing Modified J48
in Sec
10 53262 5483 363 350
100 66923 7620 403 383
500 69273 7615 404 401
600 68105 7267 420 374
800 66471 7052 412 363
Table 5.8 Comparison of the searching process time, using the Genetic
algorithm and modified J48 algorithm
The searching time comparison of different network capacities (10 nodes, 100
nodes, 500 nodes, 600 nodes, 800 nodes) using the GA and the modified J48
algorithm is depicted graphically in Figure 5.6.
106
Figure 5.6 Comparison of searching time using Genetic algorithm
and modified J48 algorithm
The number of files in the nodes is increased; the search time for the IDS is
also increased. The time complexity is unbalanced. However, with the modified J48
algorithm, the time is comparatively lesser than with the GA method, because the
device ranges are determined based on the value index. If the packets are identified as
intrusions those devices are listed as a suspected device of the network; therefore the
average time is less while evaluating more number of files in high number of nodes in
the network.
5.6 Summary
The implementation part of the modified J48 decision tree algorithm and
evaluation of the search process was discussed in this chapter. The searching time
comparison of different network capacities (10 nodes, 100 nodes, 500 nodes, 600
nodes, 800 nodes) using the GA and the modified J48 algorithm is discussed. The
implementation of the process and its impacts from the observation of different
protocols are discussed.
0
50
100
150
200
250
300
350
400
450
0 200 400 600 800 1000
Time In GA
Time in Modified J48