CHAPTER 5 MODIFIED J48 DECISION TREE ALGORITHM FOR...

86

CHAPTER 5

MODIFIED J48 DECISION TREE ALGORITHM FOR

EVALUATION

5.1 Introduction

After the 64byte protocol structure standardization and Genetic approach

functions of mutation and cross over, the fitness of the protocol device identification

is carried out, using the modified J48 decision tree algorithm. The implementation of

the decision tree algorithm and the identified results are discussed in this chapter.

The main objective of developing this modified J48 decision tree algorithm is

to minimize the search process in compare with the Current Active Directory List.

The MAC address of the device is available in CADL. This list is used as an input to

identify the intruder. In order to identify, it is proposed to use a modified J48 decision

tree algorithm. This algorithm is also provided the same result as good as GA. But it

counts less time. The process flow of the modified J48 decision tree algorithm is as

follows.

Precondition

All the Current Active Directory List elements are sorted in an ascending

order and generated list with unique device representation of the network.

Step 1: The generated MAC value is evaluated with the Current Active Directory List

(CADL)

Step 2: The Current Active Directory list elements are formed as a tree based on the

value of the existing MAC address of the devices

Step3: At each node left tree formed with less value elements and right tree formed

with greater value elements.

87

Step 4: Execute the step 3 recursively till all the elements in the CADL is included in

the tree.

Step 5: Evaluate the Left Significant Byte (LSB) to identify the gain (Similarity) of

the Device of the observed and suspected packet generated value.

Step 6: If the bits are equal then returns 1 otherwise 0 as a gain.

Step 7: Count the gain value

Step8: If the gain maximum value is in the CADL then that device is authenticated

device else recommend for the suspected device.

This process is implemented as part of the data mining to detect the IDS.

5.2 Data Mining approaches for IDS decision

Data mining is the process of extracting patterns from data. Data mining is an

important tool by which modern business transforms data into business intelligence

giving an informational advantage. It is currently used in a wide range of profiling

practices. The most important reason for using data mining is to assist in the analysis

of collection and observation of behavior. Data mining technology is advanced for

processing huge amounts of data, and to discover hidden and ignored information.

Data mining commonly involves four classes of clustering, classification, regression

and association. Clustering is the task of discovering “similar” groups and structures

in the data, without using the known structures. Classification is the task of

generalizing known structure to apply to new data. Common algorithms include

decision tree learning, nearest neighbor, Naive Bayesian classification [30][71],

neural networks and support vector machines. Regression attempts to find a function

which models the data with the least error. Association rule learning searches for

relationships between variables. The data mining approach is used to determine the

unknown interested pattern from the domain applications. The network security

researchers attempted to determine the IDS using various data mining techniques.

88

5.2.1 Decision Tree Algorithm

A decision tree performs the classification of a given data sample through

various levels of decisions to help us reach a final decision. Such a sequence of

decisions is represented in a tree structure. The tree structure is used in classifying

unknown data records. A decision tree with a range of discrete (symbolic) class labels

is called a classification tree, whereas a decision tree with a range of continuous

(numeric) values is called a regression tree. CART (Classification and Regressing

Tree) is a well-known program, used in the designing of decision trees. Decision trees

make use of the IDE3, C4.5 and CART algorithms [91][92].

5.2.2 IDE3 Algorithm

The IDE3 (Iterative Dichotomiser 3) decision tree algorithm is introduced in

1986 by Quinlan Ross [81][80]. It is based on Hunt’s algorithm, and is serially

implemented. Like other decision tree algorithms the tree is constructed in two

phases; tree growth and tree pruning. Data is sorted at every node during the tree

building phase, in-order to select the best splitting single attribute [20]. The IDE3 uses

an information gain measure in choosing the splitting attribute. It only accepts

categorical attributes in building a tree model [81][80].The IDE3 does not give

accurate results when there is too-much noise or details in the training data set; thus

an intensive pre-processing of data is carried out before building a decision tree model

with the IDE3.

5.2.3 C4.5 Algorithm

The C4.5 algorithm is an improvement of the IDE3 algorithm, developed by

Quinlan Ross (1993). It is based on Hunt’s algorithm and like the IDE3, it is serially

implemented. Pruning takes place in C4.5 by replacing the internal node with a leaf

node, thereby reducing the error rate (Podgorelecet al, 2002). Unlike the IDE3, the

C4.5 accepts both continuous and categorical attributes in building the decision tree. It

89

has an enhanced method of tree pruning that reduces misclassification errors, due to

noise or too-much detail in the training data set. Like the IDE3 the data is sorted at

every node of the tree, in order to determine the best splitting attribute. It uses the gain

ratio impurity method to evaluate the splitting attribute [79].

5.2.4 CART Algorithm

The CART (Classification and regression trees) is introduced by Breiman

[15]. It builds both classification and regression trees. The classification tree

construction by CART is based on the binary splitting of the attributes. It is also based

on Hunt’s model of decision tree construction, and can be implemented serially [15].

It uses the gini index splitting measure in selecting the splitting attribute. Pruning is

done in CART by using a portion of the training data set. The CART uses both

numeric and categorical attributes for building the decision tree, and has in-built

features that deal with missing attributes.

The CART is different from other Hunt’s based algorithms as it is also used

for the regression analysis, with the help of the regression trees. The regression

analysis feature is used in forecasting a dependent variable (result), given a set of

predictor variables over a specific period of time. It uses many single variable

splitting criteria like the gini index, symgini, etc., and one multi-variable (linear

combinations) in determining the best split point and the data is sorted at every node

to determine the best splitting point. The linear combination splitting criteria is used

during the regression analysis.

5.2.5 Support Vector Machines (SVM)

The SVM first maps the input vector into a higher dimensional feature space,

and then obtains the optimal separating hyper-plane in the higher dimensional feature

space. An SVM classifier is designed for binary classification. The generalization in

this approach usually depends on the geometrical characteristics of the given training

90

data, and not on the specifications of the input space [42]. This procedure transforms

the training data into a feature space of a huge dimension.

5.2.6 Fuzzy Logic

It processes the input data from the network, and describes measures that are

significant for anomaly detection. Fuzzy logic [38] is a form of many-valued logic. It

deals with reasoning, which is approximate rather than fixed and exact. In contrast to

traditional logic theory, where binary sets have two-valued logic: (true or false), fuzzy

logic variables may have a truth value that ranges in degrees between 0 and 1.Fuzzy

algorithms have been successfully applied to a variety of industrial applications,

including automobiles, autonomous vehicles, chemical processes, and robotics. A

fuzzy system [84] comprises of a group of linguistic statements based on expert

knowledge. This knowledge is usually in the form of if-then rules. A case or an object

can be distinguished by applying a set of fuzzy logic rules, based on the attributes’

linguistic values. A comparison of the various decision tree algorithms is presented

below in Table 5.1.

S.no Classifier Method Parameters Advantages Disadvantages

1 Decision Tree[91][92]

Decision treeis based onthe binaryclassificationtree.

Positive andnegativeinstances.

1.Constructiondoes notrequire anydomainknowledge.

2. Canhandle highdimensionaldata.

3. Able toprocess bothnumericalandcategoricaldata.

1. Outputattribute must becategorical.

2. Limited to oneoutput attribute.

3. Decision treealgorithms areunstable.

4. Trees createdfrom numericdatasets can becomplex.

91

S.no Classifier Method Parameters Advantages Disadvantages

2 IterativeDichotomiser 3[IDE3][81][80]

Data is sortedat every nodeduring thetree buildingphase, in-order to selectthe bestsplittingsingleattribute

It onlyacceptscategoricalattributes inbuilding atree model

1.Easy todetect in thesortedelements

1.Does not giveaccurate resultswhen there istoo-much noiseor details in thetraining data set

3 C4.5[79] It uses thegain ratioimpuritymethod toevaluate thesplittingattribute

Accepts bothcontinuousandcategoricalattributes inbuilding thedecision tree.

1. Byreplacing theinternal nodewith a leafnode, therebyreducing theerror rate

1.Unable todetect theunsorted noncategorical dataset

2.Noise datarequire preprocessing forthe detection

4 Classificationand regressiontrees(CART)[15]

CART isbased on thebinarysplitting ofthe attributes.

Uses bothnumeric andcategoricalattributes forbuilding thedecision tree

1.In-builtfeatures thatdeal withmissingattributes

1.Data is sortedat every node todetermine thebest splittingpoint.

2.The linearcombinationsplitting criteriais used duringthe regressionanalysis.

5 Support VectorMachine [42]

Supportvectormachine

Theeffectivenessof SVM liesin theselection ofkernel andsoft marginparameters.

1.HighlyAccurate

2. Able tomodelcomplexnonlineardecisionboundaries

1Highalgorithmiccomplexity andextensivememory.

2. The choice ofthe kernel isdifficult

6 Fuzzy[38] Fuzzy logichelps tosmooth theabruptseparation ofnormality and

Range valuesof zero andone.

1.It was ableto detect nonmaliciousport scanslaunchedagainst the

1. Unable todetect a greaterdiversity ofintrusions.

92

S.no Classifier Method Parameters Advantages Disadvantagesabnormality. system from

the localdomain.

7 Modified J48 Rangedeterminationfor thecommunication devicebased on

Sequentialvalue.

Ranges aredeterminedbased on thevalue and theminimalattributes.

1. Reduce thesearch timefor the Sortedelements.

2.The valueisregeneratedfor eachiteration withminimal treeapproach.

1. The searchrequired thesorting aspreprocess.

Table 5.1 Comparison of various decision tree algorithms

These algorithms limitations are overcome in the implementation of the

genetic approach and the highlights are discussed from similar research work.

5.3 Modified J48 Decision Tree Algorithm

The 16 bit representation of the device MAC address is presented in the

Current Active Directory List. The modified J48 decision tree algorithm examines the

normalized information gain that results from choosing an attribute for splitting the

data. To make the decision, the attribute with the highest normalized information gain

is used. Then the algorithm recurs on the smaller subsets. The splitting procedure

stops if all instances in a subset belong to the same class.

Then a leaf node is created in the decision tree telling to choose that class. In

this case, the modified J48 decision tree algorithm creates a decision node higher up

in the tree using the expected value of the class. If the generated LSB value in CADL

and incoming protocol device MAC address are same then the device is authenticated

otherwise the device recommended for the intruder.

93

The structure of the modified J48 decision tree is shown in Figure 5.1. The

first level of the tree is a single header node. It is just a pointer node to its children.

The second level of the tree has 2 sub trees labeled from 1 to 2.

Figure 5.1 Structure of the modified J48 decision tree

5.3.1 Pseudo code for the modified J48 algorithm

The following pseudo code is used to build decision trees

1. Check for base cases [ Initial Device List form Current Active Directory List]

2. For each attribute a{ from the captured packets- Device address MAC}

Find the normalized information gain from splitting on a{ select the 16 Bit

device from the Least Most Significant Bit}

94

3. Let a best be the attribute with the highest normalized information gain

{Allowed to communicate on Network}

4. Create a decision node that splits on a best {Select the Least Significant Bits

or the Significant bit for Cross Over}

5. Recourse on the sub lists obtained by splitting on a best, and add those nodes

as children of node

The incoming value from the MAC device is converted using IANA format

and the values are compared. The C++ coding is used for the implementation of the

modified J48 decision tree algorithm.

The generated MAC address values are passed as a parameter to the modified

J48 decision tree algorithm as mentioned in step1and step2. The sample

implementation process is explained with the use of the manufacturer and device

information 08-00-06-04-00-01.

The value of 08H is passed as a parameter. The search process of 08H is

presented with CADL to be verified. The modified J48 decision tree algorithm

employs two pruning methods. The first is known as the sub tree replacement. The

second type of pruning used in the modified J48 decision tree algorithm is termed, sub

tree rising. The sub tree rising is implemented with the following procedure is given

in Table 5.2.

95

Procedure

Assign CADL is a root of the tree for the search process

Tree = { Elements of CADL } // where CADL consists of

SMA, SDA, DMA, DDA

Calculation

Assign L as a first element // L is a low range value

Assign H as Last element // H is a Highest value

D = | L- H | // Difference

M = loc (D/2) //M- Selection of Middle Element

Search process

If (M = e) // e – search element

Declare element found

Else

If (M < e)

subtree = Tree(L ) …. Tree(M);

Tree = subtree;

Elsesubtree = Tree(M ) … Tree(H) ;

Tree = subtree;

End if

End if

Table 5.2 procedure for Modified J48 Algorithm

96

While the search option is carried out in the list, the mid value of the separated

list is identified, and the test is processed.

5.4 Illustration

The modified J48 decision tree algorithm is explained with an example. Let

the tree represent the manufacturer OUI value and the device value, which is fetched

from the Current Active Directory List.

Level 0: Create a decision tree.

Level 1: The sub trees are generated based on the values instead of the Index.

CADL List

97

Level 2:

The parameter 08 H is less than the mid value 11.Therefore, the left tree is

identified as a dependent element and selected for further evaluation.

Level 3:

The parameter 08H is less than the mid value 0C. Therefore, the left tree is

identified as a dependent element and selected for further evaluation.

98

The mid value is the fourth element 08H. The element and the mid value are

the same; therefore, the element is identified and recommended as an authenticated

value.

The existence of 08H is confirmed; therefore, the return value is 1. It

represents that the device communicated in the network is authenticated. The

detection result sample is produced below in Figure 5.2.

Figure 5.2 Intrusion Detection.

According to the above explained procedure the intrusions are detected using

ARP, ICMP and SNMP-ALG protocol standardization and their results are obtained.

5.5. Experimental Result

The intrusion detection process is implemented with real-time packets, which

are collected from protocol observations and converted into a file. This set is

differentiated from the features of the normal connections, and from those of the

attack. Three types of protocol structure (ARP, SNMP-ALG, and ICMP) in the hybrid

academic network with 800 nodes are analyzed.

99

The intrusion detection level in an ARP Protocol against the initiated intrusion

is given in the Eqn(5.1)

ARP_IDS= )/ (5.1)

Where IdARP is Identified Intruders in ARP Protocol. The IDS process

efficiency is based on the identification from the packets. The number of packets is

permitted to communicate in the authenticated manner, after the manufacturer and

device evaluations. Table 5.3 illustrates the result of the ARP protocol packet

intrusion detection.

S.No Numberof packets

InitiatedIntrusion

Numberof ARPpackets

ARPIdentified

% ofARP

IdentifiedVs Total

% of ARPIdentified

Vs Initiated

%ofIdentifiedARP Vs

ARP

1 52489 6650 28010 1525 2.91 22.93 5.44

2 52489 4107 28010 1271 2.42 30.95 4.54

3 49676 3881 23424 549 1.11 14.15 2.34

4 49676 5252 23424 891 1.79 16.96 3.80

5 51305 8606 25252 2244 4.37 26.07 8.89

6 51305 7478 25252 1766 3.44 23.62 6.99

7 49781 3779 26489 884 1.78 23.39 3.34

8 57756 3728 27169 824 1.43 22.10 3.03

9 63383 6869 24464 848 1.34 12.35 3.47

10 54761 4478 27906 1072 1.96 23.94 3.84

Min 49676.00 3728.00 23424 549.00 1.11 12.35 2.34

Max 63383.00 8606.00 28010 2244.00 4.37 30.95 8.89

Avr 53262.10 5482.80 25940 1187.40 2.25 21.65 4.57

Table 5.3.Result of the ARP protocol packet Intrusion detection (10 files)

100

The intrusion detection level in SNMP–ALG protocol against the initiated

intrusion is given in the Eqn(5.2).

SNMP–ALG_ IDS= )/ (5.2)

Where Id SNMP–ALG is a Identified Intruders in SNMP–ALG protocol. The

result summary of the SNMP–ALG protocol packet intrusion detection is

presented in Table 5.4, given below.

S. No Number ofpackets

InitiatedIntrusion

SNMP-ALG

packets

SNMP-ALG

Identified

% ofSNMP-ALG

IdentifiedVs Total

% ofSNMP-ALG

IdentifiedVs Initiated

% ofIdentifiedSNMP-ALG VsSNMP-

ALG

1 52489 6650 3154 203 0.39 3.05 6.44

2 52489 4107 696 19 0.04 0.46 2.73

3 49676 3881 2583 98 0.20 2.53 3.79

4 49676 5252 932 45 0.09 0.86 4.83

5 51305 8606 720 44 0.09 0.51 6.11

6 51305 7478 746 51 0.10 0.68 6.84

7 49781 3779 4260 139 0.28 3.68 3.26

8 57756 3728 1080 25 0.04 0.67 2.31

9 63383 6869 2325 98 0.15 1.43 4.22

10 54761 4478 2989 117 0.21 2.61 3.91

Min 49676.00 3728.00 696 19.00 0.04 0.46 2.31

Max 63383.00 8606.00 4260 203.00 0.39 3.68 6.84

Avr 53262.10 5482.80 2478 83.90 0.16 1.65 4.44

Table 5.4 Result of SNMP-ALG protocol packet Intrusion detection (10files)

101

The intrusion detection level in ICMP protocol against the initiated intrusion is

given in the Eqn(5.3).

ICMP_ IDS= )/ (5.3)

Where IdICMP is an identified intruders in ICMP protocol. The result

summary of the ICMP protocol packet intrusion detection is presented in Table 5.5,

given below.

S. No Number ofpackets

InitiatedIntrusion

ICMPpackets

IdentifiedICMP

Packets

% ofICMP

IdentifiedVs Total

% of ICMPIdentified

Vs Initiated

% ofIdentifiedICMP Vs

ICMP

1 52489 6650 6061 529 1.01 7.95 8.73

2 52489 4107 4033 113 0.22 2.75 2.80

3 49676 3881 4799 168 0.34 4.33 3.50

4 49676 5252 3189 128 0.26 2.44 4.01

5 51305 8606 2643 168 0.33 1.95 6.36

6 51305 7478 4732 317 0.62 4.24 6.70

7 49781 3779 4822 241 0.48 6.38 5.00

8 57756 3728 6964 184 0.32 4.94 2.64

9 63383 6869 7759 302 0.48 4.40 3.89

10 54761 4478 6645 184 0.34 4.11 2.77

Min 49676.00 3728.00 2643.00 113.00 0.22 1.95 2.64

Max 63383.00 8606.00 7759.00 529.00 1.01 7.95 8.73

Avr 53262.10 5482.80 5164.70 233.40 0.44 4.35 4.64

Table 5.5 Result of ICMP protocol packet Intrusion detection (10files)

102

As per the analysis from the above mentioned Tables 5.3, 5.4 and 5.5, the

ARP protocol identified and detected 2.25 % of the total packets. At the same time, it

performed 21.65 % of the total initiated intrusion and 5.5% of the ARP protocol

packets. The SNMP-ALG protocol identified and detected less number of packets in

the total packets fetched from the network, compared to the ARP packets. The

detected intrusion is 0.16 % of the total number of packets, and it performed 1.65% of

the initiated intrusion and 4.44% of the total SNMP –ALG protocol packets.

Similarly, the detected intrusion is 0.44% of the total number of packets, and 4.35 %

of the total initiated intrusion, and 4.64% of the captured ICMP protocol packets. As

per the results, the ARP protocol performed the best.

The result comparison of IDS (10 files) in ARP, SNMP-ALG and ICMP

protocols with initiated intrusion is depicted graphically in Figure 5.3.

Figure 5.3 Comparison of IDS in ARP,SNMP-ALG and ICMP (10 files)

with initiated intrusion

1 2 3 4 5 6 7 8 9 10

SNMP_ALG 3.05 0.46 2.53 0.86 0.51 0.68 3.68 0.67 1.43 2.61ICMP 7.95 2.75 4.33 2.44 1.95 4.24 6.38 4.94 4.4 4.11ARP 22.93 30.95 14.15 16.96 26.07 23.62 23.39 22.1 12.35 23.94

0

5

10

15

20

25

30

35

Comparison of IDS in ARP, ICMP, and SNMP-ALG

103

The same process was executed for 211 files. The result summary of the

identified percentage of intrusion detection in each protocol with total number of

packets is presented in Table 5.6, given below.

Total numberof Observed

packets

InitiatedIntrusion

% of ARPIdentified Vs

Total

% of SNMP-ALG Identified

Vs Total

% ofICMP

IdentifiedVs Total

Min 49272 3114 0.49 0.03 0.04

Max 84957 13102 7.11 0.95 1.63

Avr 66471.11 7051.934 2.05 0.28 0.55

Table 5.6 Percentage of detected intrusion in total number of

observed packets with 211 files

In the implementation of 211 sets of observed packets the ARP, SNMP-ALG

and ICMP identified the intrusion. The ARP protocol identified 2.05 %, the SNMP-

ALG protocol identified 0.28% and the ICMP protocol identified 0.55% out of the

number of packets. Of the total 211 files of observed protocols, ARP, SNMP-ALG

and ICMP protocols are adopted for the intrusion detection. The detected intrusion

data set is listed below in Figure 5.4.

Figure 5.4 Detected Intrusion list

104

The result comparison of IDS in ARP, SNMP-ALG and ICMP protocols is

depicted graphically in Figure 5.5.

Figure 5.5 Comparison of IDS in ARP,SNMP-ALG and ICMP( 10 and 211 files)

As per the observation, the ARP protocol plays a significant role in the

identification of intrusions with the total observed packets in real time data

transformation across the hybrid academic network. The detailed protocol result

analysis result is given in the Appendix.The identified percentage of the detected

intrusions in each protocol with 211 files is presented in Table 5.7, given below.

S.No

Totalnumber ofObservedpackets

InitiatedIntrusion

% ofARP

Identified VsInitiate

d

% ofIde

ARPVs

TotalARP

% ofSNMP-

ALGIdentified

VsInitiated

% ofIdentifiedSNMP-ALG VsSNMP-ALG

% ofICMP

IdentifiedVs

Initiated

% ofIdentifiedICMP Vs

ICMPTotal

Min 49272 3114 6.97 1.91 0.28 1.86 0.51 1.87

Max 84957 13102 43.43 10.58 7.16 10.44 15.31 11.95

Avr 66471.11 7051.93 19.23 5.09 2.68 5.30 5.26 5.50

Table 5.7 Percentage of intrusion detected with individual protocolsfrom initiated intrusion

00.5

11.5

22.5

% of ARPIdentified Vs

Total

% of SNMP-ALGIdentified Vs

Total

% of ICMPIdentified Vs

Total10 sets 2.25 0.16 0.44211 sets 2.05 0.28 0.55%

of Id

entif

ied

IDS

pack

ets V

s tot

alcomparison of Data sets

105

While comparing the identified intrusions in each protocol, the ARP has

identified 43.43%, ICMP 15.31% and SNMP-ALG 7.16% against the initiated

intrusions. According to the captured protocols, the determined and blocked number

of packets and related resources are classified. In this observation, we have considered

only these three protocols; this protocol has mostly covered the basic network subnet

architecture of the OSI model. The searching time is minimized, using the modified

J48 decision tree algorithm, when compared to the searching time of the Genetic

algorithm.

The comparison of the searching process time using the Genetic algorithm and

the modified J48 algorithm is shown below in Table 5.8.

Nodes Number ofpackets

InitiatedIntrusion

Searching Timeusing GA in Sec

Searching Timeusing Modified J48

in Sec

10 53262 5483 363 350

100 66923 7620 403 383

500 69273 7615 404 401

600 68105 7267 420 374

800 66471 7052 412 363

Table 5.8 Comparison of the searching process time, using the Genetic

algorithm and modified J48 algorithm

The searching time comparison of different network capacities (10 nodes, 100

nodes, 500 nodes, 600 nodes, 800 nodes) using the GA and the modified J48

algorithm is depicted graphically in Figure 5.6.

106

Figure 5.6 Comparison of searching time using Genetic algorithm

and modified J48 algorithm

The number of files in the nodes is increased; the search time for the IDS is

also increased. The time complexity is unbalanced. However, with the modified J48

algorithm, the time is comparatively lesser than with the GA method, because the

device ranges are determined based on the value index. If the packets are identified as

intrusions those devices are listed as a suspected device of the network; therefore the

average time is less while evaluating more number of files in high number of nodes in

the network.

5.6 Summary

The implementation part of the modified J48 decision tree algorithm and

evaluation of the search process was discussed in this chapter. The searching time

comparison of different network capacities (10 nodes, 100 nodes, 500 nodes, 600

nodes, 800 nodes) using the GA and the modified J48 algorithm is discussed. The

implementation of the process and its impacts from the observation of different

protocols are discussed.

0

50

100

150

200

250

300

350

400

450

0 200 400 600 800 1000

Time In GA

Time in Modified J48

Date post:	19-Jun-2018
Category:	Documents
Upload:	doandien
View:	229 times
Download:	4 times

CHAPTER 5 MODIFIED J48 DECISION TREE ALGORITHM FOR...

Documents