FRULEX - Fuzzy Rules Extraction Using Rapid Back ... · networks, mean-square error, neural fuzzy...

FRULEX - Fuzzy Rules Extraction Using Rapid Back Propagation Neural Networks

Dr. Mahmoud Wahdan Projects Manager

Ministry of Communications

and Information Technology [email protected]

Prof. Adel Elmaghraby Acting Chair

Computer Engineering and

Computer Science Department J. B. Speed Scientific School

University of Louisville [email protected]

Mohamed Farouk Abdel Hady Teaching Assistant

Information and Computer

Science Department Institute of Statistical Studies

and Research University of Cairo

[email protected]

Abstract: In this paper, we present a new approach for extracting fuzzy rules from numerical inputoutput data for pattern classification. The approach combines the merits of the fuzzy logic theory, and neural networks. The proposed approach uses rapid back propagation neural network (RBPNN), which can handle both quantitative (numerical) and qualitative (linguistic) knowledge. The network can be regarded both as an adaptive fuzzy inference system with the capability of learning fuzzy rules from data, and as a connectionist architecture provided with linguistic meaning. Fuzzy rules are extracted in three phases: initialization, optimization, and simplification of the fuzzy model. In the first phase, the data set is partitioned automatically into a set of clusters based on input-similarity and output-similarity tests. Membership functions associated with each cluster are defined according to statistical means and variances of the data points. Then, a fuzzy if-then rule is extracted from each cluster to form a fuzzy model. In the second phase, the extracted fuzzy model is used as starting point to construct an RBPNN then the fuzzy model parameters are refined, by analyzing the nodes of the network that was trained via the back propagation gradient descent method. In the third phase, a simplification method is used to reduce the antecedent parts in the extracted fuzzy rules.

Keywords: Feature selection, fuzzy rule base, gradient-descent, local function neural networks, mean-square error, neural fuzzy systems, logical rule extraction, similarity test. 1. Introduction System modeling is the task of modeling the operation of an unknown system from a combination of prior knowledge and measured input-output data. It plays a very important role in many areas such as communications, control, expert systems, etc. Through the simulated system model, one can easily understand the underlying properties of the unknown system and handle it properly. When we are trying to model a complex system, usually the only available information is a collection of imprecise data; it is called fuzzy modeling, whose objective is to extract a model in the form of fuzzy inference rules. Zadeh, [1] proposed the fuzzy set theory to deal with such kind of uncertain information

1

mailto:[email protected]:[email protected]:[email protected]

and many researchers have pursued research on fuzzy modeling, however, this approach lacks a definite method to determine the number of fuzzy rules required and the membership functions associated with each rule. Also, it lacks an effective learning algorithm to refine these functions to minimize output errors. Another approach using neural networks was proposed, which like fuzzy modeling, is considered to be a universal approximator. In recent years, there has been a proliferation of methods using this approach (See [2], and [3] for surveys of the field). This approach has advantages of excellent learning capability and high precision. However, it usually suffers from slow convergence, local minima, and low understandability. Considerable work has been done to integrate neural networks with fuzzy inference systems resulting in neuro-fuzzy modeling approaches such as [5], and [6] which combine the benefits of these two powerful paradigms into a single capsule, i.e., adaptability, quick convergence, representation power, and high accuracy. Jang and Sun, [4] have shown that fuzzy systems are functionally equivalent to a class of radial basis function (RBF) networks, based on the similarity between the local receptive fields of the network and the membership functions of the fuzzy system. The rest of this paper is organized as follows. The next section briefly describes the rapid back propagation neural networks. Section 3 gives an overview of the FRULEX approach. Self-Constructing Rule Generator, SCRG, is described in Section 4. Section 5 describes the training of RBP networks. Simplification of the extracted fuzzy model is described in Section 6. Experimental results are presented in Section 7. An evaluation of FRULEX is presented in Section 8. Finally, conclusions and future work are given in Section 9.

2. Rapid Back Propagation Neural Networks In the field of Artificial Neural Networks, (ANNs), there are several types of networks that utilize units with local response characteristics to solve classification, and function approximation problems. While there are many methods for extracting rules from specialized networks, there are a small number of published techniques for extracting rules from local basis function networks. Tresp, Hollatz and Ahmed, [7] describe a method for extracting rules from gaussian Radial Basis Function (RBF) network. Berthold and Huber, [8] describe a method for extracting rules from a specialized local function network, the Rectangular Basis Function (RecBF) network. Abe and Lan [9] describe a recursive method for constructing hyper-boxes and extracting fuzzy rules from them. Duch et. al. [10] describe a method for extraction, optimization and application of sets of fuzzy rules from soft trapezoidal membership functions. Lapedes and Faber, [11] give a method for constructing locally responsive units using pairs of axis-parallel logistic sigmoid functions. Subtracting the value of one sigmoid from the other one will construct such local response region. They did not however offer a training scheme for networks constructed of such units. Geva and Sitte, [12] describe a parameterization and training scheme for networks composed of such sigmoid based hidden units. Andrews and Geva, [13], [14] propose a method to extract and refine crisp rules from these networks. RBP networks are similar to RBF networks in that the hidden layer consists of a set of locally responsive units.

2

Each local response unit (LRU) of the hidden layer of the RBP network is constructed as follows:

In each input dimension, form a region of local response according to the equation ♦ ),,;(),,;(),,;( iiiiiiiiiiii kbcxkbcxkbcxr

−+ −= σσ

))(,())(,( iiiiiiii bcxkbcxk −−−+−= σσ

iiiiiiii kbcxkbcx ee )()( 11

11

−−−+−− +−

+= (1)

This construction forms an axis parallel ridge function in the ith dimension of the input space, , that is almost zero everywhere except in the region between the steepest part of the two logistic sigmoid functions (See Figure 1 and Figure 2). The parameters c

),,;( iiii kbcxr

),,; iii kbci, bi, and ki of the sigmoid functions and

represent the center, breadth, and edge steepness respectively of the ridge, and x

),,;( iiii kbcx+σ

( ix−σ

i is the input value.

♦

Figure 1: Construction of a ridge Figure 2: Cylindrical Extension of a ridge

The intersection of such N ridges, with a common center, produces a function f that represents a local peak at the point of intersection with secondary ridges extending to infinity, on either sides of the peak, in each dimension (See Figure 3). The function f is the sum of the N ridge functions

♦

∑=

=N

iiiii kbcxrkbcxf

1),,;(),,;( (2)

Figure 3: Intersection of two Ridges Figure 4: Production of an LRU

3

To make the function local, these component ridges must be cut off by the application of a suitable sigmoid to leave a locally responsive region in input space (see Figure 4). The function that eliminates the unwanted regions of the radiated ridge functions is shown below

♦

)),,;(,(),,;( BkbcxfKkbcx −= σl (3)

Where B is selected to ensure that the maximum value of the function f, located at x = c , coincides with the center of the linear region of the output sigmoid. The parameter K determines the steepness of the output sigmoid ),,;( kbcxl . The parameter B is set to produce appreciable activation only when each of the xi input values lie in the ridge defined in the ith dimension. The parameter K is chosen such that output sigmoid ),,;( kbcxl cuts off the secondary ridges outside the boundary of the local function. Experiment has shown that good network performance can be obtained if B is set equal to the input dimensionality, B = N, and K is set in the range 2-4.

♦

A network that is suitable for function approximation and binary classification tasks can be created with an input layer, a hidden layer of ridge functions, a hidden layer of local functions, and an output unit. (See Figure 5, and Figure 6)

♦

Figure 5: Construction of a local function Figure 6: Structure of an RBP Network

The activation for the output unit is given as: ♦

∑=

=J

jjjjj kbcxwxy

1),,;()( l (4)

which is a linear combination of J local response functions with centers jc , widths jb , and steepness jk . Where is the output weight associated with each of the individual local response functions l . (Network output is simply the weighted sum of the outputs of the local response functions.)

jw

For multi-class classification problems, several such networks can be combined, one network per class, with the output class being the maximum of the activations of the individual networks, that combination is called MCRBP Network.

♦

4

The RBP network is trained using gradient descent on an error surface to adjust the parameters (output weights, and the individual ridge center, breadth, and edge steepness).

♦

3. Overview of FRULEX Approach

Our approach of acquiring fuzzy rules from a given data set can be shown in Figure 7.

Final FuzzyRules

Input-OutputData

Feature Subset Selection By Relevance

Back Propagation Learning

SelfConstructing Rule Generator

Simplification Phase

Optimization Phase

Initialization Phase

Figure 7: Phases of FRULEX Approach

In the initialization phase, a set of initial fuzzy rules is extracted from the given data set with an adaptive self-constructing rule generator [15]. The jth fuzzy rule is defined to take the following form:

)(...)(...)(: 111 NNjNiijijj xISxANDANDxISxANDANDxISxIFR µµµ

jMMjkkj wISyANDANDwISyANDANDwISyTHEN ......11 (5)

Where wjk is a constant representing the kth consequent part and )( iij xµ are membership functions each of which is a ridge function with center , width and steepness k , i.e.,

ijc ijb ij

),(),())(,())(,(

),,;()(ijijijij

ijijiijijijiijijijijiiij bkbk

bcxkbcxkkbcxrx

−−

−−−+−==

σσσσ

µ (6)

Note that for each rule, the first antecedent corresponds to the first input, the second antecedent corresponds to the second input, etc., and the conclusion corresponds to the output. The firing strength of the rule j is

5

∏=

=N

iijijijij kbcxr

1

),,;(α (7)

Also, we use the centroid defuzzification method to calculate the output of this fuzzy system as follow:

∑∑==

=J

jj

J

jjkjk wy

11

)4( . αα (8)

In the parameter optimization phase, we improve the accuracy of the initial fuzzy system with neural network techniques. In the rule base simplification phase, FRULEX implements facilities for simplifying the optimized rule set in order to improve the comprehensibility of the rule set.

A four-layer MCRBP neural network is constructed based on the fuzzy rules obtained by SCRG method shown in the first phase, shown in Figure 8.

The operations of the MCRBP neural network, is described as follows: • Layer 1 contains N nodes. Node i of this layer produces output by transmitting its

input signal directly to layer 2, i.e., for Ni ≤≤1

ii xO =)1( (9)

• Layer 2 contains J groups and each group contains N nodes. Each group representing the IF-part of a fuzzy rule. Node (i, j) of this layer produces its output by computing the value of the corresponding normalized ridge function, for Ni ≤≤1 and 1 Jj ≤≤

),(),())(,())(,(

),,;()2(ijijijij

ijijiijijijiijijijijiijij bkbk

bcxkbcxkkbcxrrO

−−

−−−+−===

σσσσ

(10)

• Layer 3 contains J nodes. Node j of this layer produces its output by computing the value of the logistic function, i.e., for Jj ≤≤1

( ) ),(,;1

)2()3( ∑=

−==N

iijjjj BOKbcxO σl (11)

• Layer 4 contains M nodes. Node k of this layer produces its output by the centroid defuzzification, i.e.,

∑

∑

=

== J

jj

J

jjkj

k

O

wOO

1

)3(

1

)3(

)4(

. (12)

Clearly, cij, bij, and wjk are the parameters that can be tuned to improve the performance of the fuzzy system. We use the backpropagation gradient descent method to refine these parameters.

6

Figure 8: Architecture of Rapid Back Propagation Neural Network

O11(2) ONJ(2)

ON1(2)

Group 1

Oij(2)

Group j

OM(4)

Ok(4)

O1(3)

OJ(3)

Group J

Oj(3)

w11wjk wJM

xix1

O1(4)

xN

Trained RBP networks can be used for numeric inference, or final fuzzy rules can be extracted from networks for symbolic reasoning.

4. Self-Constructing Rule Generator First, the given input-output data set is partitioned into fuzzy (overlapped) clusters. The degree of association is strong for data points within the same fuzzy cluster and weak for data points in different fuzzy clusters. Then, a fuzzy if-then rule describing the distribution of the data in each fuzzy cluster is obtained. These fuzzy rules form a rough model of the unknown system and the precision of description can be improved in the phase of parameter identification. Unlike common clustering-based methods, such as fuzzy c-means method, [16], which require the number of clusters, and hence the number of rules, to be appropriately pre-selected, SCRG performs clustering with the ability to adapt the number of clusters as it proceeds. For a system with N inputs and M outputs, we define a fuzzy cluster j as a pair

( ) )w ,( jxjl where ( )xjl is defined as:

( ) ( ) )),,;(,(,,; 1

∑=

−==N

iijijijijjjj BkbcxrKkbcxx σll (13)

where [ ]Nxxx ,...,1= , [ ]Nj ccc ,...,1= , [ ]Nj bbb ,...,1= , [ ]Nj kkk ,...,1= , K, and jw denote the input vector, center vector, width vector, steepness and height vector respectively, of the cluster j. Let J be the number of existing fuzzy clusters and Sj be the size of cluster j.

7

Clearly, J initially equals zero. For an input-output instance v, ),(vv

qp , where

[ ]vNvv ppp ,...,1= , and [ vMvv qqq ,...,1= ]. We calculate )( vj pl for each existing cluster j, . We say that instance v passes input-similarity test on cluster j if Jj ≤≤1

jl

vjke

e0

kc

tl

itb

ρ≥)(v

p (14)

where ρ, 0 1≤≤ ρ , is a predefined threshold. Then, we calculate

jkvk wq −= (15)

For each cluster j on which instance v has passed the input-similarity test. Let dk = qkmax - qkmin where qkmax and qkmin are the maximum and minimum value of the kth output, respectively, of the given data set. We say that instance v passed the output-similarity test on cluster j if

kvjk dτ≤ (16) where τ, 1≤≤ τ , is another predefined threshold. We have two cases. First, there is no existing fuzzy clusters on which instance v has passed both input-similarity and output-similarity tests. For this case, we assume that instance v is not close enough to any existing cluster and a new fuzzy cluster k = J+1 is created with

v

p= , ok bb = , and vk qw = (17) such that [ ]oNoioo bbbb ,...,,...,1= and )( iloweriupperooi XXbb −= . Where and are the upper and lower limit of the i

iupperX ilowerX

oth input, respectively, of the given data set and b is a

user-defined constant vector. Note that the new cluster k contains only one member, instance v, at this time. Of course, the number of existing clusters is increased by 1 and the size of cluster k should be initialed to 1, i.e.,

J = J+1 and Sk=1. (18)

Second, if there exist a number of fuzzy clusters on which instance v has passed both input-similarity and output-similarity tests, let these clusters are j1, j2and jf and let the cluster t be the cluster with the largest membership degree.

))(),...,(),(max()( 21 vjfvjvjv pppp lll= (19) In this case, we assume that instance v is closest to cluster t and cluster t should be modified to include instance v as its member. The modification to cluster t is shown below, for 1 Ni ≤≤

it

viitt

t

t

t

viittoitt bS

pcSS

SS

pcSbbS0

2222

11))(1(

+

+++

−++−−

= (20)

1++

=t

viittit S

pcSc (21)

8

1++

=t

vktkttk S

qwSw (22)

1+= tt SS (23) Note that J is not changed in this case. The above-mentioned process is iterated until all the input-output instances have been processed. At the end, we have J fuzzy cluster. Note that each cluster j is described as )w ,)( jxjl( . where )(xjl contains center vector

jc , and width vector jb . We can represent cluster j by a fuzzy rule having the form of (5) with

),,;()( ijijijiiij kbcxrx =µ (24) for 1 and the conclusion is Ni ≤≤ jw for Mj ≤≤1 . Finally, we have a set of J initial fuzzy rules for the given input-output data set. With this approach, when new training data are considered, the existing clusters can be adjusted or new cluster can be created, without the necessity of generating the whole set of rules from the scratch.

5. Back Propagation Gradient-Descent Learning Algorithm After the set of J initial fuzzy rules is obtained in phase one, we improve the accuracy of these rules with neural network techniques in the phase of parameter optimization. First, a four-layer fuzzy rules-based RBP network is constructed by turning each fuzzy rule into a local response unit (LRU), as shown in Figure 8. A gradient-based optimization method performing the steepest descent on a surface in the network parameter space is used. The goal of this phase is to adjust both the premise and consequent parameters so as to minimize the mean squared error function shown below

∑=

=P

vvEP

E1

1 (25)

where ( )∑=

=M

kvkv eE

1

2

21

, kkvk vqvye −= and )()4(

vkkpOvy = is the actual output of the v

th

training pattern. The update formula for a generic weight α is ( )αηα α ∂∂−=∆ E , (26)

where αη is the learning rate for that weight. In summary, given a training set T of P training patterns { } { }),...,(),,...,(,...,1:),( 11 vMvvNvvv qqppPvqpT === . For the sake of simplicity, the subscript v indicating the current sample will be dropped in the following. Starting at the first layer, a forward pass is used to compute the activity levels of all the nodes in the network to obtain the current output values. Then, starting at the output layer, a backward pass is used to compute α∂∂E for all the nodes.

9

The complete learning algorithm is summarized as follow:

1) Initialize the weights { } JjNijijiji

kbc ,..,1,..,1

,, ==

and { } MkJjjk

w ,..,1,..,1

=

=with rule parameters obtained in

the SCRG phase.

2) Select the next input vector p from T, propagate it through the network and determine

the output )4(kk Oy =

3) Compute the error terms as follows:

kkkqO −= )4()4(δ (27)

∑∑==

−=M

ttkjk

M

kkj

OOw1

)3()4(

1

)4()3( )(δδ (28)

)1( )3()3()3()2( jjjij OKO −= δδ (29) 4) Update the gradients of { } Jj

Nijijibc ,..,1

,..,1, =

=and { } Mk

Jjjkw ,..,1

,..,1=

=respectively according to:

−−

−−−=−

∂∂

−−++

),(),()1()1()2(

ijijijij

ijijijijijij

ij bkbkk

cE

σσσσσσ

δ (30)

−−

−+−=+

∂∂

−−++

),(),()1()1()2(

ijijijij

ijijijijijij

ij bkbkk

bE

σσσσσσ

δ (31)

∑

=

=+

∂∂

J

tt

jk

jk O

OwE

1

)3(

)3()4(δ (32)

5) After applying the whole training set T, Update the weights { } JjNijijiji

kbc ,..,1,..,1

,, ==

and

respectively according to: { } Mk Jjjkw ,..,1,..,1==

∂∂

−=∆ij

ij cEc η (33)

∂∂

−=∆ij

ij bEb η (34)

ij

oij b

Kk = (35)

∂∂

−=∆jk

jk wEw η (36)

where η being the learning rate (by a proper selection of η the speed of convergence can be varied) and Ko is the initial steepness.

10

6) If E < ε or maximum number of iterations reached stop else go to step 2. (where ε is the error goal)

6. Feature Subset Feature Selection By Relevance Since in application areas like medicine not only the accuracy but also the rule simplicity and comprehensibility is important, the extracted fuzzy model was reduced by applying a feature selection algorithm to cope with the high dimensionality of the real-world dataset.

Starting with an initial trained RBP network having the complete set of features, the algorithm iteratively produces a sequence of networks with smaller set of input nodes. The iterative nature of our feature selection method allows a systematic investigation of the performance of reduced network models with fewer input nodes. The proposed feature selection method is computationally cheap, as the overall computational cost of each iteration depends mainly on the training of reduced network, we first sort the features according to their relevance for the classification. That is, for each feature, an RBP neural network is created by using the full feature set except that feature. The classification accuracy of the network on the test dataset is saved for that subset.

The feature, whose corresponding network produces the smallest classification accuracy, is the most relevant one.

♦

♦ We sort features in ascending order according to their corresponding networks test classification accuracy, which is the best feature set.

Then an RBP neural network is created by using the best feature (the most relevant one). The classification accuracy of the network on the test dataset is saved for that subset. Next, the best two features are tested, followed by the best three features and it goes like that till the best N features (N numbers of features) are tested. For example, If the sorted list is like {f1, f2, ..., fN}. we test the subsets {f1}, {f1, f2}, {f1, f2, f3}, , {f1, f2, ..., fN}. We find the subset with the best test set classification accuracy. Since we want the smallest feature, we take the full feature set accuracy (accfull) as our base and find the smallest subset within a certain range of that accuracy ( )β−≥

ullfnextaccacc .

For example, if the accuracy of the full feature set is 95% and best current subset with 3 features has accuracy of 97% and next best subset with 2 features has the accuracy of 92% and %5=β then we choose the subset with 2 features (because ( )%5%95%92 −≥ and it becomes the best subset. An outline of the feature subset selection algorithm is given in Figure 9. Here is how we find the final best feature subset:

In each fold, we find the best subset. (as mentioned above) ♦ ♦ ♦

For each feature, we find in how many folds that feature is a member of its best subset. Then, we find the average-times-in-best-subset value (total of times-in-best-subset values of all the features divided by the number of features).

11

For the final feature subset, we choose the feature that appeared in more subsets than the average-times-in-best-subset value.

♦

visitedList = emptySet; N = numFeats(fullFeatureSet); for (i=0; i < N; i++) { currentSubSet = fullFeatureSet featurei Construct an RBP Network by using currentSubSet and the training set Test the RBP Network by using test set Find the classification accuracy (acci) of the test set Add the pair (featurei, acci) to the visitedList } sort the visitedList in ascending order according to accuracy (Now the visitedList is sorted from the most relevant feature to the least) bestAcc = -1; currentSubSet = emptySet; bestSubSet = currentSubSet; for (i=0; i < N; i++) { if ( bestAcc == 100 AND numFeats(bestSubSet) == 1) STOP Add the next most relevant feature to the currentSubSet Construct an RBP Network by using currentSubSet and the training set Test the RBP Network by using test set Find the classification accuracy (currentAcc) of the test set if ((currentAcc >= fullAcc Beta) AND (numFeats(currentSubSet) < numFeats(bestSubSet))) { bestAcc = currentAcc; bestSubSet = currentSubSet; } } return bestSubSet

Figure 9: Feature Subset Selection by Relevance Algorithm

7. Experimental Results

The validity of our approach to fuzzy reasoning and rule extraction has been tested on a well-known benchmark problem from literature [17].

7.1 Iris Flower Classification Problem The classification problem of the Iris data consists of classifying three species of iris

flowers (setosa, versicolor and virginica). There are 150 samples for this problem, 50 of each class. Each sample is a four-dimensional pattern vector representing four attributes of the iris flower (See Table 1 and Table 2). Results obtained with various methods for this dataset are collected in Table 6.

12

ID Class 1 Setosa 2 Versicolor 3 Virginica

Table 1: Classes for the iris flower classification dataset

ID Feature Feature values F1 Sepal length [4.3, 7.9] F2 Sepal width [2.0, 4.4] F3 Petal length [1.0, 6.9] F4 Petal width [0.1, 2.5]

Table 2: Features and Feature values for the iris flower classification dataset

A MCRBP network with 4 inputs and 3 outputs, corresponding to the 3 classes, was constructed. The whole data set was divided into two parts. A part consisting of 15 samples uniformly drawn from the three classes was used as a test set for the network trained with the remaining 135 data points.

The SCRG method described in Section 4 is used to determine the initial centers and widths of the membership functions of the input features. The results after finishing SCRG are shown in Table 3. (We take σo=0.2, ρ = 0.01, and τ =0.195)

♦

After SCRG Iris

Training Set Test Set Whole Set

Run Rules Class. Accuracy No. of

Misclass. Class.

AccuracyNo. of

Misclass. Class.

Accuracy No. of

Misclass. 1 3 94.07 8 86.67 2 90.37 5 2 3 95.56 6 80.00 3 87.78 4.5 3 4 87.41 17 93.33 1 90.37 9 4 3 94.07 8 93.33 1 93.70 4.5 5 3 94.07 8 86.67 2 90.37 5 6 3 95.56 6 80.00 3 87.78 4.5 7 3 95.56 6 86.67 2 91.12 4 8 3 94.07 8 93.33 1 93.70 4.5 9 3 94.81 7 93.33 1 94.07 4

10 3 94.81 7 93.33 1 94.07 4 Average 3.1 94.00 8.1 88.67 1.7 91.33 4.9

Table 3: Results of the 10-fold cross validation after SCRG phase for the iris flower classification dataset

The backpropagation gradient descent method (Section 5) is used to optimize the fuzzy rule base extracted in phase one. The results obtained after 100 epochs are shown in Table 4. (We take ε =0.01, and η = 1.0)

♦

13

After BP Iris Training Set Test Set Whole Set

Run Rules Class. Accuracy No. of

Misclass. Class.

AccuracyNo. of

Misclass. Class.

Accuracy No. of

Misclass. 1 3 94.81 7 86.67 2 90.74 4.5 2 3 96.3 5 86.67 2 91.49 3.5 3 4 93.33 9 100.00 0 96.67 4.5 4 3 94.81 7 93.33 1 94.07 4 5 3 95.56 6 93.33 1 94.45 3.5 6 3 95.56 6 86.67 2 91.12 4 7 3 95.56 6 93.33 1 94.45 3.5 8 3 94.07 8 93.33 1 93.70 4.5 9 3 94.81 7 93.33 1 94.07 4

10 3 94.81 7 93.33 1 94.07 4 Average 3.1 94.96 6.8 92.00 1.2 93.48 4

Table 4: Results of the 10-fold cross validation after BP learning phase for the iris flower classification dataset

Figure 10: Fuzzy rules obtained after BP learning phase

for the iris flower classification dataset

The Feature Subset Selection By Relevance method (Section 6) is used to simplify the fuzzy rule base extracted in phase one. The results obtained after this phase are shown in Table 5. (We take β =0)

♦

Features Sorting by Relevance

708090

100

F1 F2 F3 F4

Removed Feature

Tes

t C

lass

ifica

tion

Acc

urac

y

Figure 11: Performance of RBP network during the removal of input features

of the iris flower classification dataset

14

Features Subset Selection

80

90

100

110

F4 F3 F2 F1

Added Feature

Tes

t Cla

ssifi

catio

n A

ccur

acy

Figure 12: Performance of the RBP network with different features

of the iris flower classification dataset

After Simplification Iris Training Set Test Set Whole Set

Run Rules Class.

Accuracy No. of

Misclass. Class.

AccuracyNo. of

Misclass.Class.

AccuracyNo. of

Misclass.

No. of Antec.

Best Feature

Set 1 3 95.56 6 100.00 0 97.78 3 1 F4 2 3 95.56 6 100.00 0 97.78 3 1 F4 3 4 89.63 14 80.00 3 84.82 8.5 1 F4 4 3 87.41 17 93.33 1 90.37 9 2 F1, F3 5 3 95.56 6 100.00 0 97.78 3 1 F4 6 3 96.30 5 93.33 1 94.82 3 1 F4 7 3 87.41 17 93.33 1 90.37 9 2 F1, F3 8 3 95.56 6 100.00 0 97.78 3 1 F4 9 3 95.56 6 100.00 0 97.78 3 1 F4

10 3 96.30 5 93.33 1 94.82 3 1 F4 Average 3.1 93.49 8.8 95.33 0.7 94.41 4.75 1.2 F4

Table 5: Results of the 10-fold cross validation after Simplification phase for the iris flower classification dataset

Figure 13: Fuzzy rules obtained after Simplification phase for

the iris flower classification dataset

15

Approach Classification

Accuracy Rules

NumberAntecedents

Per rule Reference Rule

Type MLP with BP 97.36% N/A N/A Ster et al. [20] N/A

BIO-RE 78.67% 4 3 Taha et al. [19] C Full-RE 97.33% 3 1 to 2 Taha et al. [19] C RULEX 94.0% 5 3 Andrews et al. [18] C FRULEX 94.41% 3 1 Our result F

Table 6: Comparing FRULEX approach to some other approaches for

the iris flower classification dataset 8. Evaluation of FRULEX Approach There are six different criteria for the evaluation of our approach as follows:

A. Rule Format It can be seen that FRULEX extracts fuzzy rules. In the directly extracted fuzzy system, each fuzzy rule contains an antecedent condition for each input dimension as well as a consequent, which describes the output class covered by that rule.

B. Complexity of the approach FRULEX, unlike other decomposition algorithms, does not rely on any form of search to extract rules rather it relies on the direct analysis of the weights of the trained network. The initialization module is linear in the number of fuzzy clusters (or fuzzy rules) and the number of training patterns, O(J.P). The optimization module is linear in the number of iterations, number of training patterns and the number of hidden nodes, O(I.P.J). The module associated with rule simplification is linear in the number of features, the number of iterations, number of training patterns and the number of hidden nodes, O(N.I.P.J). Therefore, FRULEX is computationally efficient and its usage to include an explanation facility adds little overhead to the learning phase of a neural network.

C. Quality of the extracted rules As stated previously, the essential function of rule extraction algorithms such as FRULEX is to provide an explanation facility for the trained network. The rule quality criteria provide insight into the degree of trust that can be placed in the explanation. Rule quality is assessed according to the accuracy, fidelity and comprehensibility of the extracted rules.

C.1. Accuracy During training phase, local response units will grow, shrink, and/or move to form a more accurate representation of the knowledge encoded in the training data.

16

C.2. Fidelity Fidelity is closely related to accuracy and the factors that affect accuracy also affect the fidelity of the rule sets. In general, the rule sets extracted by FRULEX display an extremely high degree of fidelity with the networks from which they were drawn.

C.3. Comprehensibility In general, comprehensibility is inversely related to the number of rules and to the number of antecedents per rule. The RBP network is based on a greedy, covering algorithm. Hence, its solutions are achieved with relatively small numbers of training iterations and are typically compact, i.e. the trained network contains a small number of local response units. Given that FRULEX converts each local response unit into a single fuzzy rule, therefore the extracted rule set contains the same number of rules as the number of local response units in the trained network.

D. Consistency of the approach Rule extraction algorithms that generate rules by querying the trained neural network with patterns drawn from the problem domain may generate a variety of different fuzzy models from any given training run of the neural network. Such algorithms may have low consistency. FRULEX on the other hand is a deterministic algorithm that always generates the same fuzzy model from any given training run. Hence, FRULEX always exhibits 100% consistency.

E. Translucency of the approach FRULEX is a decompositional approach, as fuzzy rules are extracted at the level of the hidden layer units. Each local response unit is treated in isolation with the output weights being converted directly into a fuzzy rule.

F. Portability of the approach FRULEX is non-portable having been specifically designed to work with RBP networks, which is a local function network. This means that it cannot be used as a general-purpose device for providing an explanation component for existing, trained neural networks. However, the RBP network is applicable to a broad range of problem domains (including continuous valued, discrete valued domains and domains which include missing values). Hence, FRULEX is also applicable to a broad variety of problem domains.

9. Conclusions We developed a new fuzzy rules extraction approach (FRULEX). FRULEX is able to provide an explanation facility for the MCRBP trained network. It extracts fuzzy systems from trained feedforward RBP networks and simplifies the fuzzy system in a way to maximize the fidelity between the system and the neural network.

17

The main features and advantages of the proposed approach are:

It is a general framework that combines two technologies, namely neural networks and fuzzy systems

♦

♦

♦

♦

♦

♦

The knowledge embedded inside the RBP network can be explained in concept of a fuzzy model and hence it can be easily understood.

The number of fuzzy rules is determined automatically and the membership functions match closely with the real distribution of the training data points

The selection of relevant features is automatic.

Also, it learns faster, and produces higher classification accuracy than other machine learning methods, as shown in the section of experimental results.

In many application areas like medical diagnosis, not only the accuracy but also the rule simplicity and comprehensibility is important. Our approach care of this in the third phase.

9.1 Future Work The following are the subjects of our on-going research: 1. Function approximation: We are planning to apply our approach to function

approximation problems. 2. Mamdani-type fuzzy models: We can extend our proposed approach to be applied to

other types of fuzzy models, such as Mamdani-type fuzzy models. 3. Real-world problems: We expect that the proposed approach should be considered

further in respect to a wider range of real-world problems. 4. Genetic Algorithms: The use of Genetic Algorithms (GA) instead of backpropagation

learning algorithm. GA does not suffer from convergence problems with the same degree that the BP suffers.

To verify the effectiveness of the new approach, the approach was applied on a well-known benchmark problem.

References [1] L. A. Zadeh, Fuzzy sets, Information Control, Vol. 8, PP. 338-353, 1965. [2] R. Andrews, A.B. Tickle and J. Diederich, A Survey and Critique of Techniques for

Extracting Rules from Trained Artificial Neural Networks, Knowledge Based Systems Vol. 8, PP. 378-389, 1995.

[3] S. Mitra, and Y. Hayashi, Neuro-Fuzzy Rule Generation: Survey in Soft Computing Framework, IEEE Trans. Neural Networks, Vol. 11, No. 3, May 2000.

[4] J. -S.R. Jang and C. -T. Sun, Functional equivalence between radial basis function networks and fuzzy inference systems, IEEE Trans. on Neural Networks, Vol. 4, PP. 156.159, 1993.

18

[5] Y. Lin, G. A. Cunningham III and S. V. Coggeshall, Using fuzzy partitions to create fuzzy systems from input-output data and set the initial weights in a fuzzy neural network, IEEE Trans. Fuzzy Syst., Vol. 5, PP. 614-621, August 1997.

[6] W. A. Farag, V. H. Quintana and G. Lambert-Torres, A genetic-based neuro-fuzzy approach for modeling and control of dynamical systems, IEEE Trans. Neural Networks, Vol.9, PP. 756-767, Oct. 1998.

[7] V.Tresp, J. Hollatz and S. Ahmed, Network Structuring and Training Using Rule-Based Knowledge, Advanced in Neural Information Processing Systems (NIPS*6), PP. 871-878, 1993.

[8] M. Berthold and K. Huber, Building Precise Classifiers with Automatic Rule Extraction, in Proc. of the IEEE Int. Conf. On Neural Networks, Perth, Australia, Vol. 3, PP. 1263-1268, 1995.

[9] S. Abe and M.S. Lan, A Method for Fuzzy Rules Extraction Directly from Numerical Data and its Application to Pattern Classification, IEEE Trans. on Fuzzy Systems, Vol. 3, No.1, PP. 18-28, 1995.

[10] W. Duch, R.Adamczak and K. Grabczcwski, Neural Optimization of Linguistic Variables and Membership Functions, Proc. of the 6th Int. Conf. On Neural Information Processing ICONIP’99, Perth, Australia, Vol. 2, PP. 616-621, 1999.

[11] A. Lapedes and R. Faber, How Neural Networks Work, Neural Information Processing Systems, Anderson D.Z.(ed), American Institute of Physics, New York, PP.442-456, 1987.

[12] S. Geva and J. Sittle, Constrained Gradient Descent, Proc.of the 5th Australian Conference on Neural Computing, Brisbane, Australia, 1994.

[13] R. Andrews and S. Geva, On the Effects of Initializing a Neural Network with Prior Knowledge, Proc. of the International Conference on Neural Information Processing, Perth, Western Australia, PP.251-256, 1999.

[14] R. Andrews and S. Geva, RULEX & CEBP Networks As the Basis for a Rule Refinement System, in Hybrid Problems Hybrid Solutions, Hallam J. (Ed), IOS Press, PP. 1-12, 1995.

[15] S. J. Lee and C. S. Ouyang, A Neuro-Fuzzy System Modeling with Self-Constructing Rule Generation and Hybrid SVD Based Learning, IEEE Trans. on Fuzzy Systems, Vol.11, PP. 341-353, June 2003.

[16] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, N. Y., 1981.

[17] C. J. Mertz and P. M. Murphy, UCI repository of machine learning databases, [Online]. Available: http://www.ics.uci.edu/pub/machine-learning-data-bases

[18] R. Andrews and S. Geva, Rule Extraction From Local Cluster Neural Nets, Submitted to Neurocomputing, February 2000.

[19] I. Taha and J. Ghosh, Symbolic Interpretation of Artificial Neural Networks, IEEE Trans. Knowledge And Data Engineering, Vol. 11, No. 3, PP. 448-463, May 1999.

[20] B. Ster and A. Dobnikar, Neural networks in medical diagnosis: Comparison with other methods, Proc. of Int. Conf. E`ANN’96, A. Bulsari, Ed., PP. 427-430, 1996

19

http://www.ics.uci.edu/pub/machine-learning-data-bases

IntroductionRapid Back Propagation Neural NetworksOverview of FRULEX ApproachSelf-Constructing Rule GeneratorBack Propagation Gradient-Descent Learning AlgorithmFeature Subset Feature Selection By RelevanceExperimental ResultsIris Flower Classification Problem��

Evaluation of FRULEX ApproachRule FormatComplexity of the approachQuality of the extracted rulesAccuracyFidelityComprehensibility

Consistency of the approachTranslucency of the approachPortability of the approach

ConclusionsFuture Work

References

Date post:	19-Oct-2020
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

FRULEX - Fuzzy Rules Extraction Using Rapid Back ... · networks, mean-square error, neural fuzzy...

Documents