Creating Decision Trees from Rules using RBDT-1 - uvic.ca · Most of the methods that generate...

Computational Intelligence, Volume 0, Number 0, 2014

CREATING DECISION TREES FROM RULES USING RBDT-1

AMANY ABDELHALIM, ISSA TRAORE, AND YOUSSEF NAKKABI

Department of Electrical and Computer Engineering, University of Victoria, Victoria, British Columbia,Canada

Most of the methods that generate decision trees for a specific problem use the examples of data instances inthe decision tree–generation process. This article proposes a method called RBDT-1—rule-based decision tree—forlearning a decision tree from a set of decision rules that cover the data instances rather than from the data instancesthemselves. The goal is to create on demand a short and accurate decision tree from a stable or dynamicallychanging set of rules. The rules could be generated by an expert, by an inductive rule learning program that inducesdecision rules from the examples of decision instances such as AQ-type rule induction programs, or extracted froma tree generated by another method, such as the ID3 or C4.5. In terms of tree complexity (number of nodes andleaves in the decision tree), RBDT-1 compares favorably with AQDT-1 and AQDT-2, which are methods that createdecision trees from rules. RBDT-1 also compares favorably with ID3 while it is as effective as C4.5 where both(ID3 and C4.5) are well-known methods that generate decision trees from data examples. Experiments show thatthe classification accuracies of the decision trees produced by all methods under comparison are indistinguishable.

Received 31 July 2009; Revised 5 August 2012; Accepted 29 August 2013

Key words: attribute selection criteria, decision rules, data-based decision tree, rule-based decision tree,tree complexity.

1. INTRODUCTION

The decision tree is one of the most popular classification algorithms used in datamining and machine learning for creating knowledge structures that guide the decision-making process (Michalski and Imam 1994, 1997; Akiba et al. 1998; Wang et al. 2000;Szydło et al. 2005; Chen and Hung 2009).

The most common methods for creating decision trees are those that create decisiontrees from a set of examples (data records). We refer to these methods as data-based deci-sion tree methods. The attribute selection criteria (e.g., entropy reduction and GINI indexof diversity) are the essential characteristics in all those methods (Imam 1996). These cri-teria are used to choose the best attributes to be assigned to the nodes of the decision tree(Mingers 1989; Cestnik and Karalie 1991; Chen et al. 2009; Hu et al. 2009; Li et al. 2009;Liu et al. 2010).

On the other hand, to our knowledge, there are only few published approaches that createdecision trees from rules, which we refer to as rule-based decision tree methods.

There is a major difference between building a decision tree from examples and buildingit from rules. When building a decision tree from rules, the method assigns attributes to thenodes using criteria based on the properties of the attributes in the decision rules, rather thanstatistics regarding their coverage of the data examples (Michalski and Imam 1994).

A decision tree can be an effective tool for guiding a decision process as long as nochanges occur in the data set used to create the decision tree. Thus, for the data-baseddecision tree methods, once there is a significant change in the data, restructuring the deci-sion tree becomes a desirable task. However, it is difficult to manipulate or restructuredecision trees. This is because a decision tree is a procedural knowledge representation,which imposes an evaluation order on the attributes. In contrast, rule-based decision tree

Address correspondence to Amany Abdelhalim, Department of Electrical and Computer Engineering, University ofVictoria, PO Box 3055 STN CSC, Victoria, BC, V8W 3P6, Canada; e-mail: [email protected]

© 2014 Wiley Periodicals, Inc.

COMPUTATIONAL INTELLIGENCE

methods handle manipulations in the data through the rules induced from the data ratherthan the decision tree itself. A declarative representation, such as a set of decision rules,is easier to incrementally update as new data are encountered than a procedural one. Thiseasiness is due to the absence of constraints on the order of evaluating the rules (Imam andMichalski 1993a, 1993b).

On the other hand, to be able to make a decision for some situation, we need to decidethe order in which tests should be evaluated in the rules. In that case a decision tree will becreated from the rules. Thus, the methods that create decision trees from rules combine thebest of both worlds. On the one hand, they easily allow changes to the data (when needed)by modifying the rules rather than the decision tree itself. On the other hand, they takeadvantage of the structure of the decision tree to organize the rules in a concise and efficientway required to make the best decision. Thus, knowledge can be stored in a declarativerule form and then be transformed (on the fly) into a decision tree only when needed for adecision-making situation (Imam and Michalski 1993a, 1993b).

Generating a decision tree from decision rules can potentially be performed faster thangenerating it from training examples because the number of decision rules per decision classis usually much smaller than the number of training examples per class. Thus, this processcould be performed on demand without any noticeable delay (Quinlan 1979; Witten andMacDonald 1988). Although rule-based decision tree methods create decision trees fromrules, they could be used also to create decision trees from examples by considering eachexample as a rule. Data-based decision tree methods create decision trees from data only.Thus, for generating a decision tree for problems where no data are available and only rulesare provided by an expert, rule-based decision tree methods are the only applicable solution.

There is a wide range of applications that are rule dependent and could benefit from rule-based decision tree methods. These include machine language translation systems, whichgenerally require rule-based methods to parse the source text to create an intermediate,symbolic representation, from which the text in the target language is generated. Somecomputer-aided legal reasoning systems are also rule dependent and could benefit fromtransforming the rules into a concise meaningful decision tree once a judgment needs tobe made. Other rule-based applications that could benefit from rule-based decision treemethods are air traffic control systems, intrusion detection systems, firewall systems, syntaxanalysis applications, and clinical decision support systems.

This article presents a new rule-based decision tree method called RBDT-1. To derivethe tree, the RBDT-1 method uses in sequence three different criteria to determine thefittest attribute for each node of the tree, referred to as attribute effectiveness (AE), attributeautonomy (AA), and minimum value distribution (MVD). In the article, the RBDT-1 methodis compared with the AQDT-1 and AQDT-2 methods, which are rule-based decision treemethods, along with ID3 and C4.5, which are two of the most well-known data-based deci-sion tree methods. The attribute selection criteria in RBDT-1 method give equal or betterresults than the other methods’ criteria in terms of tree complexity as we show empiricallyusing several publicly available data sets. Part of the work presented in this article has beenpreviously published in the study of Abdelhalim and Traore (2009).

The rest of the article is structured as follows. Section 2 summarizes the relatedwork. Section 3 describes the rule notation used in this work and describes the RBDT-1method. Section 4 provides an illustration of the method using a small data set problem,also presents a comparative study, and outlines the application of RBDT-1 to identity frauddetector. Section 5 discusses the complexity of the proposed algorithm. Section 6 presentsthe results of an experiment in which, based on several public data sets, the proposed methodis compared with the AQDT-1 and AQDT-2 methods, which create decision trees fromdeclarative rules, and with the ID3 and C4.5 algorithms, which create decision trees from


data examples. Finally, in Section 7, we make some concluding remarks and outline ourfuture work.

2. RELATED WORK

Few published works describe creating decision structures from declarative rules. TheAQDT-1 method introduced by Imam and Michalski (1993a, 1993b) is the first approachto create a decision tree from decision rules. It uses four attribute selection criteria, namelycost, disjointness, dominance, and extent, which are applied in the previous order. Inthe default setting, the cost equals 1 for all the attributes. Thus, the disjointness crite-rion is treated as the first criterion of the AQDT-1 and AQDT-2 methods in the decisiontree–building experiments throughout this article.

Michalski and Imam (1994) introduced the AQDT-2 method, which is a variant ofAQDT-1. AQDT-2 uses five attribute selection criteria, namely cost,1 disjointness, informa-tion importance, value distribution, and dominance, which are applied in the same specifiedorder shown. In both the AQDT-1 and AQDT-2 methods, the first criterion for choosing thefittest attribute is the attribute disjointness. However, as we will show in section 4, althoughthe disjointness is an important criterion for choosing the best attribute, using it as the firstcriterion is not the optimal solution for creating the shortest tree.

The calculation of the information importance in the AQDT-2 method depends on thetraining examples, which contradicts the method’s fundamental idea of being a rule-baseddecision tree method rather than a data-based decision tree method. AQDT-2 requires boththe examples and the rules to calculate the information importance at certain nodes wherethe first criterion—disjointness—is not enough in choosing the fittest attribute. AQDT-2being both dependent on the examples as well as the rules increases the running times ofthe algorithm remarkably in large data sets especially those with large number of attributes.

In contrast, the RBDT-1 method, proposed in this work, depends only on the rulesinduced from the examples and does not require the presence of the examples themselves.

Akiba et al. (1998) proposed a rule-based decision tree method for learning a single-decision tree that approximates the classification decision of a majority voting classifier. Inthe proposed method, if-then rules are extracted from each classifier, which is a decisiontree generated using the data-based decision tree method C4.5. The extracted rules are usedto learn a single-decision tree. Because the final learning result is represented as a single-decision tree, problems of intelligibility and classification speed and storage consumptionare improved. The proposed method depends both on the real examples used to create theclassifiers (decision trees) and on a set of training examples created using the rules extractedfrom the classifiers. The procedure followed in selecting the best attribute at each node ofthe tree is based on the C4.5 method as well. The size of a decision tree learned by theproposed method while using the rules extracted from multiple classifiers built by C4.5 isabout 1.2 to 4.2 times the size of a decision tree learned by C4.5 from the data. As will beshown later in the experiments, when using the rules extracted from a C4.5 decision tree,the RBDT-1 method generates a tree that is the same size as the decision tree learned byC4.5 from data, even smaller in some of the cases with an equal accuracy.

Chen and Hung (2009) proposed a method called associative classification tree (ACT)for building a decision tree from association rules rather than from the data. Associativeclassification usually has a high accuracy, but the rules are not well organized; they aredifficult to understand and are full of conflicts. On the other hand, decision tree represen-tation is a comprehensive way to organize knowledge. However, because of the greedysearch strategy of decision tree, its classification accuracy is usually not as high as otherapproaches. Therefore, the goal of the authors was to use decision tree representation to


summarize associative classification rule sets and generate a decision tree classificationmodel that has a better accuracy than a decision tree model.

Chen and Hung proposed two splitting algorithms for choosing attributes in the ACTmethod. The first algorithm is based on the confidence gain criterion, and the secondalgorithm is based on the entropy gain criterion. In both splitting algorithms, the attributeselection process at each node relies on both the existence of rules and the data itself as well.Unlike RBDT-1, ACT is not capable of building a decision tree from the rules in the absenceof data or from data (considering them as rules) in the absence of rules. The experimentalevaluation of the ACT method was based on four data sets from the UCI Machine Learn-ing Database Repository. It was claimed that ACT achieved slightly better classificationaccuracy than C4.5.

Szydło et al. (2005) discuss how inductive databases can benefit from rule-baseddecision tree methods. Inductive databases are databases that in addition to data alsocontain intentionally defined generalizations about the data. The authors present aframework that integrates the AQDT-2 method as a module of the inductive databasesystem VINLEN aiming at integrating conventional databases with a range of inductiveinference capabilities.

Hu (2010) uses rule-based decision trees in the assessment of airport terminal build-ings of different places. Hu analyzes assessment factors for terminal buildings of differentplaces in a form of rules, and based on that, the assessment model of terminal buildings ofdifferent places is derived as a decision tree model. It is not clear which rule-based decisiontree method was used; however, a reference was made to the ACT rule-based decision treemethod proposed by Chen and Hung (2009).

3. THE RBDT-1 METHOD

In this section, we describe our general notations and present in detail the RBDT-1method.

3.1. Notation

The input to RBDT-1 consists of rules, which can be provided by an expert or generatedalgorithmically.

LetA1; : : : ; An denote the attributes characterizing the data under consideration, and letD1; : : : ;Dn denote the domains of these attributes, respectively. Let C1; : : : ; Cm representthe decision classes associated with the data set.

The RBDT-1 method does not have any particular restriction on the type of rule set thatcan be used as input for decision tree generation. To perform, however, fair comparison withAQDT methods, we used in our experiments rule sets that follow the disjoint covers mode,in which there is no rule shared by two classes. This mode usually produces a complex setof rules in terms of the number of rules and the number of conditions. In this instance, weuse these kinds of rules to allow a fair comparison of our proposed method with the AQDTmethods in terms of their capabilities in transforming complex rules into small decisiontrees while preserving their accuracy. In this case, the decision classes induce a partitionover the complete set of rules.

Let R denote the complete set of rules and Ri denote the set of rules associated withdecision class Ci . Hence, we have the following: i ¤ j ) Ri \Rj D ;, where 1 i; j mIR D

S1im

Ri .


3.2. Preparing the Rules

The decision rules must be prepared into the proper format used by the RBDT-1 method.Each rule is submitted to RBDT-1 in the form of a .m C 1/-tuple where the last valuegives the decision value and the remaining m values are either specific values or “*,” whichrepresents any possible value from the domain.

For example, suppose that we have three attributes A1; A2; and A3, with V1; V2, and V3as possible values.

Let us assume that the following rules correspond to class C1:

R11 W C1 A1 D V1 &A2 D V2

R12 W C1 A1 D V3

The previous rules will be represented as follows:

R11 W .V1;V2;; C1/

R12 W .V3;; ; C1/

3.3. Attribute Selection Criteria

The RBDT-1 method applies three criteria on the attributes to select the fittest attributeto assign to a node of the decision tree. These criteria are AE, AA, and MVD.

3.3.1. Attribute Effectiveness (AE). The AE is the first criterion to be examined forthe attributes. It prefers an attribute that has the most influence in determining the decisionclasses. In other words, it prefers the attribute that has the least number of “*” values for theclass decisions in the rules, as this indicates its high relevance for discriminating among therule sets of given decision classes. On the other hand, an attribute that is omitted from allthe rules (i.e., has a “*” value) for a certain class decision does not contribute in producingthat corresponding decision. Thus, it is considered less important than the other attributes,which are mentioned in the rule for producing a decision of that class. Choosing attributesbased on this criterion maximizes the chances of reaching leaf nodes faster, which on itsturn minimizes the branching process and leads to producing a smaller tree.

Using the notation provided earlier, let Vij denote the set of values for attribute Ajinvolved in the rules in Ri , which denote the set of rules associated with decision classCi ; 1 i m. Let denote the generalized value; we calculate Cij ./ as follows:

Cij ./ D

²1 if 2 Vij0 otherwise

Given an attribute Aj , where 1 j n, the corresponding AE is given by

AE.Aj / Dm

PmiD1 Cij ./

m

where m is the total number of classes characterizing the data set.The attribute with the highest AE is selected as the fittest attribute. If more than one

attribute achieve the highest AE score, we will use the next criterion in our method, whichis the AA to determine the best attribute among them. We illustrate how we calculate the AEvalues for some attributes through the following example.


Consider that we have the following rules and we want to choose the attribute with themaximum AE:

C1 A1 D V1 &A2 D “”

C2 A1 D V2 &A2 D V1

C3 A1 D “” &A2 D V2

C3 A1 D “” &A2 D V1

C3 A1 D V1 &A2 D ””

Based on the calculations in Table 1, the attribute with the highest AE is attribute A1.

3.3.2. Attribute Autonomy (AA). The AA is the second criterion to be examined forthe attributes in the RBDT-1 method. This criterion is examined when the highest AE scoreis obtained by more than one attribute. This criterion prefers the attribute that will decreasethe number of subsequent nodes required ahead in the branch before reaching a leaf node.Thus, it selects the attribute that is less dependent on the other attributes in deciding on thedecision classes. We calculate the AA for each attribute, and the one with the highest scorewill be selected as the fittest attribute.

For the sake of simplicity, let us assume that the set of attributes that achieved the highestAE score are A1; : : : ; As; 2 s n. Let vj1; : : : ; vjpj denote the set of possible values forattribute Aj including the “*,” and let rj i denote the rule subset consisting of the rules thathave Aj appearing with the value vj i , where 1 j s; 1 j jDj j, and jDj j is themaximum number of values for Aj . Note that rj i will include the rules that have valuesfor Aj as well.

The AA criterion is computed in terms of the attribute disjointness score (ADS), whichwas introduced by Michalski and Imam (1994). For each rule subset rj i , let MaxADSj idenote the maximum ADS value, and let ADS_list j i denote a list that contains the ADSfor each attribute Al , where 1 l s; l ¤ j .

According to Michalski and Imam (1994), given an attribute Aj and two decisionclasses Ci and Ck (where 1 i; k mI 1 j s/, the degree of disjointness between therule set for Ci and the rule set for Ck with respect to attribute Aj is defined as follows:

ADS.Aj ; Ci ; Ck/ D

8ˆ<ˆ:

0 if Vij Vkj

1 if Vij Vkj

2 if Vij \ Vkj ¤ .; or Vij or Vkj /

3 if Vij \ Vkj D ;

TABLE 1. AE Calculations for Attributes A1 and A2.

Attribute C1j ./ C2j .

/ C3j ./ †Cij .

/ AE.Aj /

A1 0 0 1 1 0.67A2 1 0 1 2 0.33


The attribute disjointness of the attribute Aj ; ADS.Aj / score, is the summation of thedegrees of class disjointness ADS.Aj ; Ci ; Ck/:

ADS.Aj / D

mXiD1

X1 k mi ¤ k

ADS.Aj ; Ci ; Ck/

Thus, the number of ADS_list that will be created for each attribute Aj as well as thenumber of MaxADS values that are calculated will be equal to pj . The MaxADSj i value asdefined by Michalski and Imam (1994) is 3 m .m 1/, where m is the total number ofclasses in rj i .

We introduce the AA as a new criterion for attribute Aj as follows:

AA .Aj / D1

pjPiD1

AA.Aj ; i/

where AA.Aj ; i/ is defined as

AA.Aj ; i/D

8ˆ<ˆ:

0 if MaxADSji D 0

1 if ..MaxADSji ¤ 0/..s D 2/ .9l WMaxADSji D ADS_listji Œl ///

1C

".s 1/ MaxADSji

sPlD1;l¤j

ADS_listji Œl

#otherwise

The AA for each of the attributes is calculated using the previous formula, and theattribute with the highest AA score is selected as the fittest attribute. According to the previ-ous formula, AA .Aj ; i/ equals zero when the class decisions for the examined rule subsetcorrespond to one class. In that case, MaxADS D 0, which indicates that a leaf node isreached (best case for a branch). AA .Aj ; i/ equals 1 when s equals 2 or when one of theattributes in the ADS_list has an ADS score equal to MaxADS value (second best case). Thesecond best case indicates that only one extra node will be required to reach a leaf node.Otherwise, AA .Aj ; i/ will be equal to 1+ (the difference between the ADS scores of theattributes in the ADS_list and the MaxADS value), which indicates that more than one nodewill be required before reaching a leaf node.

The AA score focuses primarily on minimizing the depth of the decision tree. Thus, ittargets choosing an attribute that will lead to minimizing the number of attributes neededfor each single path of the tree to reach a decision, which as a consequence will reducethe width of the tree. The AA score has been designed to overcome limitations in the ADSscore. For instance, the ADS score could be the same for two attributes, which in some casesis an indication that both require assistance from other attributes in reaching the decisionclasses. While in these cases the ADS criterion is not capable of deciding which attributewill require less assistance, the AA criterion is capable of doing that, providing as a result aless complex tree.

As an illustration of how we calculate the minimum AA, consider that we havethe following nine rules and we want to choose the fittest attribute among ŒA1; A2; A3attributes:

C2 A1 D V1 &A2 D V1 &A3 D V1 C1 A1 D V1 &A2 D V3 &A3 D V1 C1 A1 D V1 &A2 D V2 &A3 D V1C1 A1 D V2 &A2 D V1 &A3 D V1 C1 A1 D V2 &A2 D V3 &A3 D V1 C2 A1 D V2 &A2 D V2 &A3 D V1C1 A1 D V3 &A2 D V1 &A3 D V1 C2 A1 D V3 &A2 D V3 &A3 D V1 C1 A1 D V3 &A2 D V2 &A3 D V1


TABLE 2. Example of AA Calculations.

Attribute A1 Attribute A2 Attribute A3

A1 D V1MaxADS 6

A2 D V1MaxADS 6

A3 D V1MaxADS 6

ADS_list [6, 0] ADS_list [6, 0] ADS_list [0, 0]

A1 D V2MaxADS 6

A2 D V2MaxADS 6

ADS_list [6, 0] ADS_list [6, 0]

A1 D V3MaxADS 6

A2 D V3MaxADS 6

ADS_list [6, 0] ADS_list [6, 0]

AA.A1/ 0.33 AA.A2/ 0.33 AA.A3/ 0.077

As we can see from the calculations in Table 2, the attributes with the highest AA areA1 and A2.

3.3.3. Minimum Value Distribution (MVD). The MVD criterion is concerned with thenumber of values that an attribute has in rj i . When the highest AA score is obtained by morethan one attribute, this criterion selects the attribute with the minimum number of valuesin rj i . This criterion minimizes the size of the tree because fewer values mean that fewerbranches will be involved and, consequently, the smaller the tree will be (Michalski andImam 1994). For the sake of simplicity, let us assume that the set of attributes that achievedthe highest AA score are A1; : : : ; Aq; 2 q s. Given an attribute Aj (where 1 j q/,we compute the corresponding MVD value as

MVD.Aj / D

ˇˇ [1im

Vij

ˇˇ (where jX j denote the cardinality of setX/:

For example, given thatAA.A1/ D AA.A2/,

The set of values for A1 in the rules = ¹V1; V2º.

The set of values for A2 in the rules = ¹V1; V2; V3º.

Hence, MVD .A1/ < MVD.A2/, the fittest attribute = A1.When the lowest MVD score is obtained by more than one attribute, any of these

attributes can be selected randomly as the fittest attribute. In our experiments, in case wheremore than two attributes have the lowest MVD score, we take the first attribute.

3.4. Building the Decision Tree

We describe, in this section, the RBDT-1 approach for building a decision tree from aset of decision rules.

Figure 1 summarizes the main steps involved in this process. The algorithm receives,as input, the total set of rules denoted as TR in the pseudocode and produces, as output, adecision tree. The algorithm uses three intermediary variables: RR, A, and B , which denote,


FIGURE 1. Pseudocode for the decision tree–building process employed by RBDT-1.

respectively, the current set of rules, the current set of attributes, and the current set ofbranches under consideration at a particular step. The algorithm starts by initializing RRand A, respectively, to the whole set of rules and the whole set of attributes, while B is setto empty. The decision tree is built iteratively in such a way that at each phase the algorithmwill output either an attribute that will be represented as a nonleaf node in the tree or adecision that will correspond to a leaf node. Each iteration will proceed as explained inthe following. If all the rules in RR belong to the same decision class, a leaf node will becreated and assigned a value of that decision class. Otherwise, if the current set of attributesunder consideration A is empty, then a leaf node will be created and assigned the value ofthe most frequent class found in the whole set of rules. Otherwise, the algorithm will selectattribute a from A based on the attribute selection criteria outlined in section 3 and assign ato a new nonleaf node. The new node will be expanded by generating from it a collection ofbranches each assigned a separate value of the fittest attribute a. Set A will be reduced byremoving the fittest attribute, while set B will be expanded by adding the newly generatedbranches. For each branch b in B, the previous process is repeated by using a reduced set Bfrom which b has been removed and an updated set RR as the subset of rules that satisfy theattribute value assigned to b. The algorithm will continue until each branch from the rootnode is terminated with a leaf node and no more branching is required.

4. ILLUSTRATION OF THE RBDT-1 METHOD AND COMPARATIVE STUDY

In this section, the RBDT-1 method is illustrated on a data set named the weekendproblem, which is a publicly available data set, and then compared with the AQDT-1, AQDT-2, ID3, and C4.5 methods in terms of the complexity and accuracy of the decision treesproduced. Furthermore, the design of a fraud detector using RBDT-1 is outlined.


4.1. The Weekend Problem

In this subsection, the steps for generating a decision tree by the RBDT-1 method willbe explained in detail using a small data set called the weekend problem. The weekendproblem is a data set that consists of ten data records listed in Table 3. The data set describesdifferent ways of spending the weekend. We obtained the data set from an online documentused in a lecture on decision trees proposed by Colton (2004).

The data set involves the following three attributes: parents-visiting, weather, andmoney. The parents-visiting attribute has two values (yes and no), the weather attribute hasthree values (sunny, windy, and rainy), and the money attribute has two values (rich andpoor). The decision class is one of the four values (cinema, tennis, shopping, or stay-in).We used the AQ19 rule induction program (Michalski and Kaufman 2001) to induce a ruleset, which was then converted into the format suitable for RBDT-1, as shown in Table 4.

To choose the fittest attribute for each node in our tree, we apply the three criteria of theRBDT-1 method on the attributes, in the same order explained in section 3. The candidateattribute with the highest AE, as shown in Table 5, is the parents-visiting attribute.

TABLE 3. Weekend Problem Data Set.

# Records Parents-visiting Weather Money Decision

1 Yes Sunny Rich Cinema2 No Sunny Rich Tennis3 Yes Windy Rich Cinema4 Yes Rainy Rich Cinema5 No Rainy Poor Stay-in6 Yes Rainy Poor Cinema7 No Windy Poor Cinema8 No Windy Rich Shopping9 Yes Windy Rich Cinema10 No Sunny Rich Tennis

TABLE 4. The Weekend Rule Set after Preparation for RBDT-1.

Rule # Description

1 Cinema Parents-visitingD “yes” & weatherD “*” & MoneyD “rich”2 Tennis Parents-visitingD “no” & weatherD “sunny” & MoneyD “*”3 Shopping Parents-visitingD “no” & weatherD “windy” & MoneyD “rich”4 Cinema Parents-visitingD “no” & weatherD “windy” & MoneyD “poor”5 Stay-in Parents-visitingD “no” & weatherD “rainy” & MoneyD “poor”

TABLE 5. AE Calculations for Parents-Visiting, Money, and Weather Attributes.

Cinema ./ Tennis ./ Shopping ./ Stay-in ./ †Cij ./ AE.Aj /

Parents-visiting 0 0 0 0 0 1Money 0 1 0 0 1 0.75Weather 1 0 0 0 1 0.75


A subset of the weekend rule set is assigned to the branch where parents-visiting D “yes” as shown in Table 6. This subset of rules consists of all the rules with parents-visitingD“yes” or “*.” The corresponding subset contains only one rule with “cinema” as a decision.Thus, a leaf node “cinema” is created and assigned to this branch.

Similarly, another subset of the weekend rule set is assigned to the branch whereparents-visiting D “no” as shown in Table 7. For this branch, the AE is calculated for boththe money and weather attributes.

In Table 8, the candidate attribute with the highest AE is the weather attribute. Thus, itis selected as the fittest attribute for the branch parents-visitingD “no.”

Three branches will be pulled out from the weather attribute corresponding to its threevalues: sunny, windy, and rainy. For the branch where parents-visitingD “no” and weatherD “sunny,” there is only one rule that corresponds to that branch with a class decision tennisas shown in Table 9. Thus, a leaf node with tennis as a decision will be created and assignedto that branch.

There are two rules corresponding to the branch where parents-visiting D “no” andweatherD “windy.” As shown in Table 10, one of the rules corresponds to the decision class

TABLE 6. Subset of Rules for Branch Parent-VisitingD “Yes”.

Cinema Parents-visitingD “yes” & weather D “*” & MoneyD “rich”

TABLE 7. Subset of Rules for Branch Parents-VisitingD “No”.

Tennis Parents-visitingD “no” & weatherD “sunny” & MoneyD “*”Shopping Parents-visitingD “no” & weatherD “windy” & MoneyD “rich”Cinema Parents-visitingD “no” & weatherD “windy” & MoneyD “poor”Stay-in Parents-visitingD “no” & weatherD “rainy” & MoneyD “poor”

TABLE 8. AE Calculations for Weather and Money Attributes.

Cinema ./ Tennis ./ Shopping ./ Stay-in ./ †Cij ./ AE.Aj /

Weather 0 0 0 0 0 1Money 0 1 0 0 1 0.75

TABLE 9. Subset of Rules for Parents-VisitingD “No” and WeatherD “Sunny”.

Tennis Parents-visitingD “no” & weatherD “sunny” & MoneyD “*”

TABLE 10. Subset of Rules for Parents-VisitingD “No” and WeatherD “Windy”.

Shopping Parents-visitingD “no” & weatherD “windy” & MoneyD “rich”Cinema Parents-visitingD “no” & weatherD “windy” & MoneyD “poor”


shopping, and the other rule corresponds to the decision class cinema. Both rules dependon the value of the money attribute for producing the decision. The money attribute will bechosen as the fittest attribute because it is the only candidate attribute left. Thus, a nodewill be created and assigned the attribute money. Two branches will be pulled out from themoney node. A leaf node will be created and assigned the class decision cinema for thebranch where money D “poor.” Another leaf node will be created and assigned the classdecision shopping for the branch where moneyD “rich.”

For the branch where parents-visitingD “no” and weatherD “rainy,” there is only onerule that corresponds to that branch with a class decision stay-in as shown in Table 11. Thus,a leaf node with a decision stay-in will be created and assigned to that branch. Overall, thecorresponding decision tree created by the proposed RBDT-1 method for the weekend prob-lem is shown in Figure 2. It consists of three nodes and five leaves with 100% classificationaccuracy for the data in Table 3.

4.2. Comparative Study

In this subsection, we apply the AQDT-1, AQDT-2, ID3, and C4.5 methods to the week-end problem and compare the outcomes with the results obtained by the RBDT-1 methodoutlined in the previous section. Figure 3 illustrates the decision tree created using theAQDT-1 and AQDT-2 methods for the weekend problem using the same rule set in Table 4used by the RBDT-1. The attribute selection process for both the AQDT-1 and AQDT-2methods identifies the fittest attribute according to the score of the attribute disjointnessas the first criterion. The attribute disjointness for the weather, parents-visiting, and moneyattributes in the weekend rule set in Table 4 is [21, 3, 10], respectively. In that case, theweather attribute will be selected as the fittest attribute because it has the highest attributedisjointness among the rest. Starting with the weather attribute as a root node will resultin a more complex tree (i.e., bigger tree) compared with the decision tree created using theRBDT-1 method.

TABLE 11. Subset of Rules for Parents-VisitingD “No” & WeatherD “Rainy”.

Stay-in Parents-visitingD “no” & weatherD “rainy” & MoneyD “poor”

FIGURE 2. The decision tree generated by RBDT-1 for the weekend problem.


FIGURE 3. The decision tree generated by AQDT-1, AQDT-2, and ID3 for the weekend problem.

By looking at the decision tree in Figure 3, we can see that the test parents-visitingD “yes” results in a decision class cinema for all the branches (values) of the weatherattribute. In that case, starting with the weather attribute will result in redundant branches(bigger tree). On the other hand, starting with the attribute parents-visiting as the root nodeas performed by the RBDT-1 method avoids the redundancy happening in Figure 3, whichresults in a smaller tree. As explained earlier, the parents-visiting was chosen as the rootnode because it had the highest AE score. The AE criterion, which is the first criterion inthe method, selects the most effective attribute in classifying the data. Because the attributeparents-visiting has no values in any of the class rule subsets, it was considered the mostimportant attribute among the others in contributing to the data classification process.

Using the attribute disjointness as the first criterion for choosing the fittest attribute isnot the best criterion to start with. It only could be so in some cases as a coincidence whenthe fittest attribute with the highest AE has the highest ADS as well; the examples of suchsituation are illustrated later where the same tree is obtained by all three methods (RBDT-1,AQDT-1, and AQDT-2). Although the tree created by AQDT-1 and AQDT-2 methods con-sists of five nodes and seven leaves, which is a bigger tree than the decision tree created bythe RBDT-1 method, both have the same classification accuracy.

Although both of the AQDT-1 and AQDT-2 methods produce the same tree, the AQDT-2method from our point of view contradicts the method’s idea of being a rule-based decisiontree method. This is because the second criterion employed by the method for choosingthe fittest attribute is the information importance. The training examples will be needed tocompute this criterion. Thus, the method is not independent from the examples as claimed.It appears that it depends on both the rules and the examples as well. Thus, given a set ofrules, the method would not be able to build a decision tree if the examples used to inducethe rules were unavailable.

We also compared the RBDT-1 decision tree with the decision tree produced by theID3 method. The ID3 method creates decision trees from data examples by calculating theinformation gain for each attribute and selects the attribute that has the highest informationgain (Quinlan 1979). ID3 method creates the same redundant tree created by the AQDT-1and AQDT-2 as shown in Figure 3. Thus, the tree created by ID3 for the weekend problemis also bigger than the RBDT-1 tree. This occurs because ID3 employs the information


gain, which is biased toward choosing the attributes with a large number of values. This isbased on the assumption that such attributes divide the data examples into subsets that aremore likely to be pure (correspond to only one decision class). A special case where thedecision trees produced by both ID3 and RBDT-1 are the same occurs by coincidence whenthe fittest attributes selected by the RBDT-1 criteria have the highest information gain aswell. Applying C4.5—running with the default parameters—to the data set of the weekendproblem resulted in a tree of the same size and accuracy as the pruned decision tree producedby RBDT-1.

4.3. Application for RBDT-1

We illustrate, in this section, the application of RBDT-1 for the design of an applicationfraud detector.

Application fraud occurs when an individual or an organization applies for an identitycertificate (e.g., passport, credit card, etc.) using someone else’s identification. A commonchallenge in developing fraud detectors is the lack of publicly available real data for modelbuilding and evaluation. The approach commonly adopted in the industry and in the litera-ture to address this issue is to encode expert knowledge and past knowledge of fraudulentbehavior into rule bases. However, because of the fast pace at which new fraud methods arecreated and used by fraudsters, the rule bases are subjected to constant changes and usu-ally tend to grow at an accelerating rate, quickly reaching unmanageable size. In this case, adecision tree represents an effective alternative to a rule-based reasoning for a quicker deci-sion. In particular, using a rule-based decision technique will allow the creation on demandof a short and accurate decision tree from a stable or dynamically changing set of rules.Our application fraud detector extracts identity information related to the applicant fromdifferent identity information sources and crosses such information with information con-tained in the application form, to detect and report possible inconsistencies or anomalies.We designed the fraud detector using our rule-based decision tree generation technique fedwith a set of simple heuristics that define general and common understanding of the notionof fraudulent and normal behaviors. The retrieval engine uses a name and a set of keywordsto search the identity information sources and returns a collection of identity informationrelated to the identity claim corresponding to the application being checked.

The returned identity information is organized into separate identity patterns basedon the following five attributes: social security number, mother maiden name, date ofbirth, address, and telephone number. The identity information extracted from the (current)application form is also structured into a similar attribute vector yielding what we referto as the reference identity pattern. Fraud detection is made by matching a returned iden-tity pattern vector against the reference pattern vector. The output of the match will be asingle-feature vector. The elements in the feature vector correspond to the resulting out-come of the comparison between each attribute value in the returned identity pattern vectorand the corresponding attribute value in the reference pattern vector. The outcome of eachcomparison will be either “1,” “0,” “?.” “1” indicates that both attribute values are identi-cal, “0” indicates that they are different, and “?” indicates that the value of this attribute ismissing in any of the pattern vectors. The matching decision can be one of the following:fraud (F), normal (N), and suspicious (S). We consider two strains or levels of suspicion:suspicious-low (S-) and suspicious-high (S+) that refer to unlikely and highly unlikelysituations, respectively.

We derived 82 rules for the initial implementation of our proof of concept. Table 12illustrates a sample of the rules. In the rules, “yes” corresponds to “1” in the feature vector,“no” corresponds to “0” in the feature vector, “na” corresponds to “?” in the feature vector,


TABLE 12. Sample of the Rules in the Rule Base for the Fraud Detector.

Rule # Description

1 F SSND “YES” & MMND “*” & DOBD “NO” & ADDD “*” & TELD “*”2 F SSND “YES” & MMND “NO” & DOBD “YES” & ADDD “*” & TELD “*”3 F SSND “YES” & MMND “NO” & DOBD “NA” & ADDD “*” & TELD “*”4 F SSND “*” & MMND “*” & DOBD “*” & ADDD “NO” & TELD “YES”5 N SSND “NO” & MMND “NO” & DOBD “*” & ADDD “NO” & TELD “NO”6 N SSND “NA” & MMND “NO” & DOBD “*” & ADDD “NO” & TELD “NO”7 N SSND “YES” & MMND “YES” & DOBD “YES” & ADDD “YES” & TELD “YES”

SSN, social security number; MMN, mother maiden name; DOB, date of birth; ADD, address; TEL, telephonenumber.

and “*” is a placeholder for any of the possible values of the feature vector elements (i.e.,1, 0, “?”). Some rules, such as rules #1 to #3, are straightforward fraud cases because inthese cases, while the social security numbers match, either the dates of birth or the mothermaiden names do not match. Some other straightforward cases of fraud are when the hometelephone numbers match while the addresses do not, such as rule #4. In both of the previouscases, one could infer that the same individual is impersonating two different individuals:in the first case, by changing either the mother maiden name or the date of birth, and inthe second case, by operating from different locations. Two straightforward cases are wheneither none of the attributes match or all the attributes match such as rules #5 and #7, respec-tively. We assume in the case when none of the attributes matches that the nonmatchingpatterns belong to different individuals and have not been used by the applicant (in some pastattempt to defraud the system). Obviously, the case when all the attributes match is normal.A variant of this case is when everything else matches except the telephone numbers; this isconsidered normal because it could simply be the case that the same individual owns severaltelephone lines.

The decision tree obtained, by applying the RBDT-1 method to the rule base, isillustrated in Figure 4. The resulting tree corresponds to 57 rules.

For this particular problem, using AQDT-1 to generate a decision tree results in a deci-sion tree that is exactly the same as the one in Figure 4. On the other hand, AQDT-2 couldnot be used to generate a decision tree for this problem because of the absence of data.

5. COMPLEXITY OF RBDT-1

We analyze in this section the complexity of RBDT-1 method and compare it with thecomplexity of AQDT methods regarding tree construction complexity.

For the calculation of the complexity of the decision tree construction for RBDT-1,consider the following:

Let r be the number of rules in our rule set R. Let a be the number of attributes in R. Let k be the size of the largest domain attribute.

The decision tree T generated by RBDT-1 has the following characteristics:

T is a K-ary tree. The maximum height of T is a.


FIGURE 4. The decision tree produced by RBDT-1 for the fraud detector.

Each path in T is unique. An attribute aj appears in a path H only once.

The decision tree has two types of nodes: the attribute node and the leaf node. Let c be thecomputation cost of constructing one attribute node in T , where c is at most r a.

The total computational cost C for constructing T equals the total number of attributenodes in T times the cost of constructing one attribute node c.

The attribute nodes AN D the total number of nodes in the trees N the number ofleaf nodes LN .

Thus, C D r a AN . Based on the previously mentioned characteristics of T andgiven that, for a K-ary tree, the maximum number of nodes N in a tree with the maximumheight isN D kaC11

k1, the maximum number of attribute nodes in T isAN D ka1

k1(Preiss

1999). Assuming that the tree will have the maximum number of attribute nodes, the worst


case cost of constructing T by RBDT-1 is Ohr a ka1

k1

i, which is the same as with the

AQDT methods.

6. EXPERIMENTS

To evaluate the RBDT-1 method, we conducted some experiments using 16 publiclyavailable data sets summarized in Table 13, including the weekend data set used in section 4.Other than the weekend data set, all the data sets were obtained from the UCI MachineLearning Repository (Asuncion and Newman 2007). The evaluation consisted mainly ofcomparing the RBDT-1 method with the AQDT-1, AQDT-2, and ID3 methods in terms ofthe complexity and accuracy of the decision trees produced. In view of the fact that theID3 method does not handle data sets with missing values, we used only the 12 completedata sets from Table 13 in this experiment. In this section, we first describe our evaluationsetup and data sets and then summarize and discuss the results obtained based on all the12 data sets. Because we were comparing our proposed method with AQDT-1 and AQDT-2methods, which are all rule-based decision tree methods, it was a good idea to compare theirperformance with the rule sets produced by different methods. Thus, besides using ID3-based rules, we conducted an experiment comparing all three rule-based methods alongwith the ID3 using AQ-based rules generated by AQ19.

We also present a follow-up experiment in which we compare RBDT-1 with AQDT-1,AQDT-2, and C4.5. The input rules to RBDT-1, AQDT-1, and AQDT-2 in this experimentare the two sets of C4.5-based rules. In this experiment, the whole 16 data sets summarizedin Table 13 were used because C4.5 is capable of handling data sets with missing values.

Because we mentioned that RBDT-1 is capable of creating a decision tree from dataexamples, we compared RBDT-1 with ID3 method. The results show that RBDT-1 is lesseffective using data examples instead of rules, and the reasons are explained in section 6.2.

Finally, we investigate the capability of our method in handling unseen examples byconducting tenfold cross validation using several of the public data sets.

Tables 14 to 18 present the results of the comparisons by giving the name of the methodthat produced the least complex trees under the method column; the “=” symbol indicatesthat the same tree was obtained by all methods under comparison.

6.1. Settings

We used the PYTHON language for implementing RBDT-1, AQDT-1, AQDT-2, and therule extractor for the ID3-based rules and the C4.5-based rules. We also used PYTHONprogramming language to implement a rule format converter for converting rules pro-duced by AQ19 into the rule format used by our method. An implementation for the ID3method written in PYTHON was obtained from the study of Roach (2006). For C4.5, weused Orange (Demsar et al. 2004), which is a component-based data mining software thatincludes a module for the C4.5. The experiments were conducted on a COMPAQ PC (2GHz, 512 MB of RAM) with AMD Sempron 3000+.

6.2. Using Complete Data Sets

In this subsection, we summarize and discuss the results obtained by applying all fourmethods to the 12 data sets.

Our evaluation consisted of comparing the decision trees produced by the RBDT-1,AQDT-1, AQDT-2, and ID3 methods for each data set in terms of tree complexity (numberof nodes and leaves) and accuracy.


TA

BL

E13

.A

Sum

mar

yof

the

Dat

aS

ets

Use

din

Our

Exp

erim

ents

.

#A

ttri

bute

#C

lass

Mis

sing

Dat

ase

tnam

eB

rief

desc

ript

ion

#R

ecor

ds#

Att

ribu

tes

valu

esde

cisi

ons

valu

es

Wee

kend

Ada

tase

tdes

crib

ing

diff

eren

t10

32–

34

No

way

sof

spen

ding

the

wee

kend

Len

ses

Ada

tase

tfor

fitt

ing

cont

actl

ense

s24

42–

33

No

Kin

gR

ook

vers

usA

data

setf

orK

ing

Kni

ght–

Kin

gR

ook

ches

sen

dga

me

2,02

116

2–3

2N

oK

ing

Kni

ghtc

hess

Car

eval

uati

onA

data

setf

orca

rev

alua

tion

1,72

86

3–4

4N

oS

hutt

le-l

andi

ng-c

ontr

olA

data

seto

fa

spac

esh

uttl

eau

tola

ndin

gdo

mai

n25

36

2–4

2N

oC

onne

ct-4

Ada

tase

tof

alll

egal

eigh

t-pl

ypo

siti

ons

inth

ega

me

of67

,557

423

3N

oco

nnec

t-4

inw

hich

neit

her

play

erha

sw

onye

t,an

din

whi

chth

ene

xtm

ove

isno

tfor

ced

Nur

sery

Ada

tase

tder

ived

from

ahi

erar

chic

alde

cisi

onm

odel

12,9

608

3–5

5N

oor

igin

ally

deve

lope

dto

rank

appl

icat

ions

for

nurs

ery

scho

ols

Bal

ance

scal

eA

data

setb

ased

onba

lanc

esc

ale

wei

ght

625

41–

53

No

and

dist

ance

data

base

MO

NK

’sC

onta

ins

thre

edi

ffer

entd

ata

sets

:MO

NK

’s1,

432

inea

chS

ixin

each

2–4

inea

chTw

oin

each

No

MO

NK

’s2,

and

MO

NK

’s3,

whi

chw

ere

the

basi

sof

the

MO

NK

’spr

oble

mM

ON

K’s

prob

lem

MO

NK

’spr

oble

mM

ON

K’s

prob

lem

firs

tint

erna

tion

alco

mpa

riso

nof

lear

ning

algo

rith

ms

Zoo

Azo

oda

tase

t10

116

2–6

7N

oB

reas

tcan

cer

Bre

astc

ance

rda

tase

t28

69

2–13

2Y

esL

ung

canc

erL

ung

canc

erda

tase

t32

563

3Y

esP

rim

ary

tum

orP

rim

ary

tum

orda

tase

t33

917

2–3

22Y

esV

otin

g19

84U

.S.c

ongr

essi

onal

voti

ngre

cord

s43

516

22

Yes


TABLE 14. Comparison of Tree Complexities of the RBDT-1, AQDT-1, AQDT-2, and ID3 Methods Using ID3-Based Rules.

Data set Method Data set Method

Weekend RBDT-1 MONK’s 1 RBDT-1Lenses RBDT-1, ID3 MONK’s 2 RBDT-1, AQDT-1 and 2Chess = MONK’s 3 =Car RBDT-1, ID3 Zoo =Shuttle-L-C = Nursery RBDT-1Connect-4 RBDT-1 Balance RBDT-1, ID3

TABLE 15. Comparison of Tree Complexities of the RBDT-1,AQDT-1, AQDT-2, and ID3 Methods Using AQ-Based Rules.

Data set Method Data set Method

Weekend RBDT-1 Monk’s 2 RBDT-1lenses RBDT-1, ID3 Monk’s 3 =Zoo RBDT-1 Chess =Car RBDT-1 Balance RBDT-1, ID3Monk’s 1 RBDT-1 Shuttle-L-C RBDT-1, AQDT-1 and 2

TABLE 16. Comparison between ID3 and RBDT-1 Decision Trees Gen-erated from the Examples.

Data set Complexity Accuracy

The weekend problem = =The fitting contact lenses problem ID3 =MONK’s 1 problem RBDT-1 =MONK’s 2 problem RBDT-1 =MONK’s 3 problem ID3 =The balance problem ID3 =The car evaluation problem ID3 =King Rook versus King Knight chess problem ID3 =The nursery problem ID3 =The zoo problem RBDT-1 =The poker ID3 =

All four methods run under the assumption that they will produce a complete and con-sistent decision tree yielding 100% correct recognition on the training examples. In the firstpart of our experiment, the rules used as input to RBDT-1, AQDT-1, and AQDT-2 wereextracted from the ID3 decision tree. All examples were used for building the decision treeby the ID3 method. Thus, the extracted ID3-based rules have 100% coverage. The numberof the rules is equal to the number of leaves produced by the ID3 tree. For seven of the 12data sets listed in Table 14, RBDT-1 produced a smaller tree than AQDT-1 and 2, with an


TABLE 17. Comparison of Tree Complexities of the RBDT-1, AQDT-1, AQDT-2, and C4.5Using C4.5-Based Rules.

Method Method Method MethodData set (Experiment 1) (Experiment 2) Data set (Experiment 1) (Experiment 2)

Weekend RBDT-1, C4.5 RBDT-1, C4.5 MONK’s 1 RBDT-1 =Lenses RBDT-1, C4.5 RBDT-1, C4.5 MONK’s 2 = =Chess = = MONK’s 3 AQDT-1 and 2 AQDT-1 and 2Car RBDT-1, C4.5 RBDT-1, C4.5 Zoo RBDT-1, C4.5 RBDT-1, C4.5Shuttle-L-C = = Breast-C = =Connect-4 RBDT-1 RBDT-1, C4.5 Lung-C RBDT-1, C4.5 RBDT-1, C4.5Nursery RBDT-1, AQDT-1 and 2 = Primary-T RBDT-1, C4.5 RBDT-1, C4.5Balance = = Voting AQDT-1 and 2 =

TABLE 18. Comparison of Tree Complexities and Accuracies of the RBDT-1, AQDT-1, AQDT-2, and ID3Using ID3-Based Rules.

Method Method Method MethodData set (Tree complexity) (Tree accuracy) Data set (Tree complexity) (Tree accuracy)

Weekend RBDT-1 = Monk’s 2 RBDT-1, AQDT-1 and 2 =Lenses RBDT-1, ID3 RBDT-1, AQDT-1 and 2 Monk’s 3 = =Zoo = RBDT-1,ID3 Chess = =Car RBDT-1,ID3 ID3 Balance RBDT-1 RBDT-1, AQDT-1 and 2Monk’s 1 RBDT-1 = Shuttle-L-C = =Nursery RBDT-1 RBDT-1, AQDT-1 and 2 Connect-4 RBDT-1 RBDT-1

average of 274 nodes less, and for the remaining five data sets, it produced the same tree asAQDT-1 and 2. On the other hand, RBDT-1 produced a smaller tree with five of these datasets compared with ID3 with an average of 142.6 nodes less while producing a same tree forthe rest of the data sets. Overall, based on the experiments, the RBDT-1 method performedbetter than the other three methods in most cases in terms of tree complexity and achievedat least the same level of accuracy.

In some cases such as Monk’s 1, balance scale, car, and connect-4 problems, the AQDT-1 and AQDT-2 led to a syntactical instable decision tree. That is, the structure of the finaldecision tree varied considerably in size each time the process is repeated for the sameoriginal training examples. This syntactical instability is undesirable, for instance, if themethods are applied in knowledge acquisition. For example, for the MONK’s 1 problem,325 ID3-based rules were used as input to the RBDT-1, AQDT-1, and AQDT-2 methods.These rules were extracted from the ID3 decision tree, which consisted of 325 leaves and172 nodes.

The RBDT-1 generated a decision tree consisting of 28 leaves and 13 nodes. AQDT-1and AQDT-2 produced a tree that varied syntactically in size from 109 leaves and 58 nodesto 172 leaves and 325 nodes. The size fluctuated because of a limitation of the attributedisjointness (ADS) criterion in choosing the fittest attribute when choosing from more thantwo attributes. Specifically in the MONK’s 1 problem, when building the tree, there wereseveral nodes where the choice was between more than two attributes, all of which achievedthe same ADS score. In which case, they were all considered fit from the criterion’s perspec-tive and passed to the subsequent criteria. At certain nodes, the attribute that would lead toa smaller tree got eliminated, and one of the other attributes was selected randomly.

Thus, using AQDT-1 and AQDT-2 methods to create a decision tree for the MONK’s 1data set could result each time in a different tree in terms of size (all sizes bigger than theRBDT-1 tree). In contrast, RBDT-1 method as described before calculates the AA criterion


to select the fittest attribute. In that case, although all attributes could achieve the same ADSscore, only the attributes with the maximum AA will be selected and passed to the finalcriterion. This is why the RBDT-1 decision tree was smaller than the other methods.

In our experiments, we also compared the decision tree complexity and accuracyof the RBDT-1, AQDT-1, AQDT-2, and ID3 methods using AQ-based rules generated by theAQ19 rule induction program. The rules generated are complete rules with 100% correctrecognition on each data set. The results for some of the data sets are presented in Table 15.Based on the results of the comparison in Table 15, the RBDT-1 method performed bet-ter than the AQDT-1 and 2 and the ID3 methods in most cases in terms of tree complexityby an average of 33.1 nodes less than AQDT-1 and 2 and by an average of 88.5 nodesless than the ID3 method. The decision tree classification accuracies of all four methodswere equal.

Based on the results of the comparison in Table 16, the RBDT-1 method performs bet-ter than the AQDT-1 and AQDT-2 methods in most cases in terms of tree complexity andachieves the same level of accuracy while using the rules produced by the AQ19 program.

The RBDT-1 was also used to create decision trees from examples by considering eachdata instance as a rule. The decision trees produced by RBDT-1 were compared with thetrees produced by the ID3 method; the results are presented in Table 17. Although RBDT-1method produced the same or even less complex decision trees in some of the data sets, ID3method produced decision trees that are less complex in most of the cases. This is becausethe RBDT-1 method criteria are designed with rules in mind rather than examples. Thus,the method’s criteria are based on the different characteristics that differentiate betweenexamples and rules, such as

A rule mostly describes a group of examples; in contrast, an example is a separateinstance of occurrence.

The attribute distribution in the examples is constant; on the other hand, in a rule, one ormore attributes could be absent.

The attribute presence in an example does not impose a direct importance as imposedwhen existing in a rule.

The number of rules is usually much smaller than the number of examples.

6.3. Using Incomplete Data Sets

In this subsection, we assess the strength of our proposed method in handling data setsthat have missing values.

In the previous experiments, we used two different rule sets: one extracted from thedecision tree created using the ID3 method and another rule set produced by an AQ-type ruleinduction program. Here, we present an experiment in which we used two different C4.5-based rule sets. The two different C4.5-based rule sets are extracted from two decision treesgenerated using the C4.5 method for each data set: one with the Orange pruning optionturned on and the other one without pruning. Unlike the ID3 method that handles onlycomplete data sets, C4.5 is capable of handling data sets with missing values. Thus, we werecapable of experimenting with both rules extracted from complete and incomplete data sets.The comparison was based on 16 data sets; thus, accordingly, 32 C4.5-based rule sets wereextracted and used in this experiment. The rule sets are given as input to the three rule-basedmethods under comparison.

In Table 17, we illustrate the results of the comparison between RBDT-1, AQDT-1,AQDT-2, and C4.5 in two experiments: experiment 1 and experiment 2. In each experiment,the C4.5 decision tree was generated from the whole set of examples of each data set, and


the rules extracted from that tree served as input to the other three rule-based decisiontree methods.

In experiment 1, the pruning option was turned off, while in experiment 2, the pruningoption was turned on. Based on the results in Table 17, AQDT-1 and 2 produce a larger treeby an average of 146.33 nodes with the exception of three rule sets for where our proposedmethod’s tree is larger by an average of three nodes. In addition, the results illustrate thatRBDT-1 is as effective as C4.5 except in experiment 1; our method produced a slightlysmaller tree for the connect-4, MONK’s 1, and nursery rule sets. In terms of accuracy, thefour methods have equal performance, which is equal to the classification accuracy of theC4.5-based rules.

6.4. Classifying Unseen Examples

To evaluate RBDT-1’s classification capability for unseen examples, we conductedan experiment using tenfold cross validation over the 12 first public data sets in Table 13.Each data set was divided into ten subsets, nine of which were used for training and onefor testing in each validation round. At each fold, the training set was used to generate aset of ID3-based rules that was passed as input to the method (e.g., RBDT-1) to generate adecision tree. The accuracy of the generated tree was then computed using the testing set.The accuracy of the data set was obtained as the average of the accuracies of the decisiontrees in the ten folds.

In Table 18, we illustrate the results of the comparison between RBDT-1, AQDT-1,AQDT-2, and ID3 in terms of tree complexities and accuracies. The experiment shows thaton average, RBDT-1 accuracy is better than the ID3-based rules for four of the data sets by7% and is the same for the rest except for the car problem where ID3 is 1% more accuratethan RBDT-1.

The increase in the accuracy achieved by RBDT-1 was from 1% (e.g., connect problem)to 24% (e.g., balance problem). On average, the RBDT-1 method achieves the same accuracylevel as the AQDT methods in most cases, except for one of the data sets (zoo data set),where RBDT-1 achieved better accuracy by 1%. On the other hand, on average, for six of thedata sets, RBDT-1 produced a less complex decision tree than the ID3 and AQDT methodsby 50 and 30 nodes, respectively, and achieved the same tree complexity for the rest.

7. CONCLUSIONS AND FUTURE WORK

The RBDT-1 method proposed in this work allows a decision tree to be generated fora specific decision-making situation from a set of rules covering the possible actions. Fol-lowing this methodology, knowledge can be stored in declarative rule form and transformedinto a decision structure when it is needed for decision making. Generating a decision treefrom decision rules can potentially be performed much faster than by generating it fromtraining examples.

Modifications to the data are handled easier in rule-based methods than in data-basedmethods. This is because the modifications are applied to the rules rather than the decisiontree itself. At the same time, rule-based methods could transform the rules into a decisiontree once we need to decide the order in which tests should be evaluated. Rule-based deci-sion tree methods, although designed to create decision trees from rules, could also generatedecision trees from data examples. Rule-based decision tree methods are the only solutionfor generating a decision tree for applications where no data are available and only rulesexist. The price of the RBDT-1 advantages is the need to generate rules first before beingcapable of generating the tree. However, there are efficient rule learning systems available.


In our experiments, our proposed method was compared with two other rule-based deci-sion tree methods: AQDT-1 and AQDT-2. RBDT-1 was also compared with the ID3 method,which is a data-based decision tree method. Experiments were performed using public datasets from the UCI repository, and the results for the RBDT-1 were better than the resultsfor other methods in most of the cases in terms of tree complexity. RBDT-1 also achievedon average better tree accuracies. In our experiments, we used the rules extracted from theID3, C4.5 decision trees, and the AQ19 rule induction programs as input to the rule-baseddecision tree methods that we were using in the comparison.

We also conducted some experiments on data sets that have missing values using theRBDT-1 method. It was shown that RBDT-1 was as effective in terms of accuracy and thetree complexity as C4.5.

In our future work, we intend to investigate how to integrate the cost factor in ourmethod. The cost criterion will allow generating a decision structure that avoids evaluat-ing an attribute that is difficult or costly to measure. We believe that this criterion is goingto enrich our method. We will also investigate the suitability of our proposed method forvarious applications involving problem-solving challenges.

ACKNOWLEDGMENTS

The authors would like to thank the UCI Machine Learning Repository for the datasets used in the presented experiments. We also thank Dr. Janusz Wojtusiak, the direc-tor of the Machine Learning and Inference Laboratory at George Mason University(http://www.mli.gmu.edu), for providing us with the AQ19 rule induction program.

REFERENCES

ABDELHALIM, A., and I. TRAORE. 2009. Converting declarative rules into decision trees. In The Interna-tional Conference on Computer Science and Applications (WCECS2009), Vol. 2178 (1), San Francisco,pp. 206–212.

AKIBA, Y., S. KANEDA, and H. ALMUALLIM. 1998. Turning majority voting classifiers into a single decisiontree. In Proceedings of the 10th IEEE International Conference on Tools with Artificial Intelligence, Taipei,Taiwan, pp. 224–230.

ASUNCION, A., and D. J. NEWMAN. 2007. UCI Machine Learning Repository (http://www.ics.uci.edu/mlearn/MLRepository.html). University of California, School of Information and Computer Science,Irvine, CA.

CESTNIK, B., and A. KARALIE. 1991. The estimation of probabilities in attribute selection measures for deci-sion structure induction. In Proceeding of the European Summer School on Machine Learning, PrioryCorsendonk, Belgium, pp. 22–31.

CHEN, G., Z. WANG, and Z. YU. 2009. Constructing decision tree by integrating multiple information metrics.In Proceedings of the Chinese Conference on Pattern Recognition, Nanjing, China, pp. 1–5.

CHEN, Y., and L. T. HUNG. 2009. Using decision trees to summarize associative classification rules. ExpertSystems with Applications, 36(2): 2338–2351.

COLTON, S. 2004. Online Document. Available at: http://www.doc.ic.ac.uk/sgc/teaching/v231/lecture11.html[Accessed June 26, 2014].

DEMSAR, J., B. ZUPAN, and G. LEBAN. 2004. Orange: from experimental machine learning to interactive datamining. White Paper (www.ailab.si/orange), Faculty of Computer and Information Science, University ofLjubljana, Slovenia.


HU, B. 2010. Assessment method for terminal building of different place based on decision tree. InProceedings of the Third International Conference on Knowledge Discovery and Data Mining, WKDD2010, Phuket, Thailand, pp. 299–301.

HU, J., J. DENG, and M. SUI. 2009. A new approach for decision tree based on principal component analysis.In Proceedings of the International Conference on Computational Intelligence and Software Engineering,Wuhan, China, pp. 1–4.

IMAM, I. F. 1996. An empirical comparison between learning decision trees from examples and from decisionrules. In Proceedings of the Ninth International Symposium on Methodologies for Intelligent Systems,Zakopane, Poland.

IMAM, I. F., and R. S. MICHALSKI. 1993a. Should decision trees be learned from examples of from decisionrules? Source lecture notes in computer science. In Proceedings of the 7th International Symposium onMethodologies, Trondheim, Norway, Vol. 689, pp. 395–404.

IMAM, I. F., and R. S. MICHALSKI. 1993b. Learning decision trees from decision rules: A method and initialresults from a comparative study. Journal of Intelligent Information Systems, 2(3): 279–304.

LI, N., L. ZHAO, A. CHEN, Q. MENG, and G. ZHANG. 2009. A new heuristic of the decision tree induction.In Proceedings of the International Conference on Machine Learning and Cybernetics, Hebei University,Baoding, China, pp. 1659–1664.

LIU, Q., D. HU, and Q. YAN. 2010. Decision tree algorithm based on average Euclidean distance. InProceedings of the 2nd International Conference on Future Computer and Communication, Wuhan, China,pp. 507–511.

MICHALSKI, R. S., and I. F. IMAM. 1994. Learning problem-oriented decision structures from decision rules:The AQDT-2 system. In Proceedings of 8th International Symposium Methodologies for IntelligentSystems, Lecture Notes in Artificial Intelligence 869. Springer Verlag: Heidelberg; pp. 416–426.

MICHALSKI, R. S., and I. F. IMAM. 1997. On learning decision structures. Fundamenta Informaticae, 31(1):49–64.

MICHALSKI, R. S., and K. KAUFMAN. 2001. The AQ19 system for machine learning and pattern discovery: Ageneral description and user’s guide. In Reports of the Machine Learning and Inference Laboratory, MLI01-2, George Mason University, Fairfax, VA.

MINGERS, J. 1989. An empirical comparison of selection measures for decision-structure induction. MachineLearning, Kluwer Academic Publishers, 3(3): 319–342.

PREISS, B. R. 1999. Data structures and algorithms with object-oriented design patterns in Java. Available at:http://www.brpreiss.com/books/opus5/html/page257.html [Accessed June 26, 2014].

QUINLAN, J. R. 1979. Discovering rules by induction from large collections of examples. In Expert Systems inthe Microelectronic Age. Edinburgh, UK: Edinburgh University Press; pp. 168–201.

ROACH, K. 2006. ID3 implementation with Python. Available at: http://www.onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html [Accessed June 26, 2014].

SZYDŁO, T., B. SNIEZYNSKI, and R. S. MICHALSKI. 2005. A rules-to-trees conversion in the inductive databasesystem VINLEN. In Proceedings of the Intelligent Information Processing and Web Mining Conference,Gdansk, Poland, pp. 496–500.

VAN ROSSUM, G., and F. L. DRAKE (EDS.) 2001. Python Reference Manual. Falls Church. VA: PythonLabs.Available at: http://www.python.org. [Accessed June 26, 2014].

WANG, K., S. ZHOU, and Y. HE. 2000. Growing decision trees on support-less association rules. In KDD ’00:Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and DataMining, Washington, DC, pp. 265–269.

WITTEN, I. H., and B. A. MACDONALD. 1988. Using concept learning for knowledge acquisition. InternationalJournal of Man-Machine Studies, 29: 349–370.

http://www.brpreiss.com/books/opus5/html/page257.html

http://www.onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html

http://www.onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html

Date post:	19-Jun-2018
Category:	Documents
Upload:	nguyendan
View:	220 times
Download:	0 times

Creating Decision Trees from Rules using RBDT-1 - uvic.ca · Most of the methods that generate...

Documents