+ All Categories
Home > Documents > Neuro-fuzzy modeling and control

Neuro-fuzzy modeling and control

Date post: 02-Mar-2023
Category:
Upload: independent
View: 1 times
Download: 0 times
Share this document with a friend
30
Transcript

11

1Neuro-Fuzzy Modeling and ControlJyh-Shing Roger Jang, Chuen-Tsai SunAbstract| Fundamental and advanced developments inneuro-fuzzy synergisms for modeling and control are re-viewed. The essential part of neuro-fuzzy synergisms comesfrom a common framework called adaptive networks, whichuni�es both neural networks and fuzzy models. The fuzzymodels under the framework of adaptive networks is calledANFIS (Adaptive-Network-based Fuzzy Inference System),which possess certain advantages over neural networks. Weintroduce the design methods for ANFIS in both modelingand control applications. Current problems and future di-rections for neuro-fuzzy approaches are also addressed.Keywords|Fuzzy logic, neural networks, fuzzy modeling,neuro-fuzzy modeling, neuro-fuzzy control, ANFIS.I. IntroductionIn 1965, Zadeh published the �rst paper on a novel wayof characterizing non-probabilistic uncertainties, which hecalled fuzzy sets [118]. This year marks the 30th an-niversary of fuzzy logic and fuzzy set theory, which hasnow evolved into a fruitful area containing various disci-plines, such as calculus of fuzzy if-then rules, fuzzy graphs,fuzzy interpolation, fuzzy topology, fuzzy reasoning, fuzzyinferences systems, and fuzzy modeling. The applications,which are multi-disciplinary in nature, includes automaticcontrol, consumer electronics, signal processing, time-seriesprediction, information retrieval, database management,computer vision, data classi�cation, decision-making, andso on.Recently, the resurgence of interest in the �eld of arti�-cial neural networks has injected a new driving force intothe \fuzzy" literature. The back-propagation learning rule,which drew little attention till its applications to arti�cialneural networks was discovered, is actually an universallearning paradigm for any smooth parameterized models,including fuzzy inference systems (or fuzzy models). As aresult, a fuzzy inference system can now not only take lin-guistic information (linguistic rules) from human experts,but also adapt itself using numerical data (input/outputpairs) to achieve better performance. This gives fuzzy in-ference systems an edge over neural networks, which cannottake linguistic information directly.In this paper, we formalize the adaptive networks asa universal representation for any parameterized mod-els. Under this common framework, we reexamine back-propagation algorithm and propose speedup schemes uti-lizing the least-squared method. We explain why neuralnetworks and fuzzy inference systems are all special in-stances of adaptive networks when proper node functionsThis paper is to appear in the Proceedings of the IEEE, March 1995Jyh-Shing Roger Jang is with the Control and SimulationGroup, The MathWorks, Inc., Natick, Massachusetts. Email:[email protected] Sun is with the Department of Computer and Infor-mation Science, National Chiao Tung University, Hsinchu, Taiwan.Email: [email protected].

are assigned, and all learning schemes applicable to adap-tive networks are also quali�ed methods for neural net-works and fuzzy inference systems.When represented as an adaptive network, a fuzzy in-ference system is called ANFIS (Adaptive-Networks-basedFuzzy Inference Systems). For three of the most commonlyused fuzzy inference systems, the equivalent ANFIS can bederived directly. Moreover, the training of ANFIS followsthe spirit of the minimum disturbance principle [111]and is thus more e�cient than sigmoidal neural networks.Once a fuzzy inference system is equipped with learningcapability, all the design methodologies for neural networkcontrollers become directly applicable to fuzzy controllers.We brie y review these design techniques and give relatedreferences for further studies.The arrangement of this article is as follows. In Section 2,an in-depth introduction to the basic concepts of fuzzy sets,fuzzy reasoning, fuzzy if-then rules, and fuzzy inference sys-tems are given. Section 3 is devoted to the formalizationof adaptive networks and their learning rules, where theback-propagation neural network and radial basis functionnetwork are included as special cases. Section 4 explainsthe ANFIS architecture and demonstrates its superiorityover back-propagation neural networks. A number of de-sign techniques for fuzzy and neural controllers is describedin Section 5. Section 6 concludes this paper by pointing outcurrent problems and future directions.II. Fuzzy Sets, Fuzzy Rules, Fuzzy Reasoning,and Fuzzy ModelsThis section provides a concise introduction to and asummary of the basic concepts central to the study of fuzzysets. Detailed treatments of speci�c subjects can be foundin the reference list.A. Fuzzy SetsA classical set is a set with a crisp boundary. For exam-ple, a classical set A can be expressed asA = fx j x > 6g; (1)where there is a clear, unambiguous boundary point 6 suchthat if x is greater than this number, then x belongs to theset A, otherwise x does not belong to this set. In contrastto a classical set, a fuzzy set, as the name implies, is aset without a crisp boundary. That is, the transition from\belonging to a set" to \not belonging to a set" is gradual,and this smooth transition is characterized by member-ship functions that give fuzzy sets exibility in modelingcommonly used linguistic expressions, such as \the water ishot" or \the temperature is high." As Zadeh pointed outin 1965 in his seminal paper entitled \Fuzzy Sets" [118],

2such imprecisely de�ned sets or classes \play an impor-tant role in human thinking, particularly in the domainsof pattern recognition, communication of information, andabstraction." Note that the fuzziness does not come fromthe randomness of the constituent members of the sets,but from the uncertain and imprecise nature of abstractthoughts and concepts.De�nition 1: Fuzzy sets and membership functionsIfX is a collection of objects denoted generically by x, thena fuzzy set A in X is de�ned as a set of ordered pairs:A = f(x; �A(x)) j x 2 Xg (2)�A(x) is called the membership function (MF for short) ofx in A. The MF maps each element of X to a continuousmembership value (or membership grade) between 0 and1. 2Obviously the de�nition of a fuzzy set is a simple exten-sion of the de�nition of a classical set in which the char-acteristic function is permitted to have continuous valuesbetween 0 and 1. If the value of the membership function�A(x) is restricted to either 0 or 1, then A is reduced to aclassical set and �A(x) is the characteristic function of A.Usually X is referred to as the universe of discourse,or simply the universe, and it may contain either discreteobjects or continuous values. Two examples are given be-low.Example 1: Fuzzy sets with discrete XLet X = f1, 2, 3, 4, 5, 6, 7, 8g be the set of numbers ofcourses a student may take in a semester. Then the fuzzyset A = \appropriate number of courses taken" may bedescribed as follows:A = f(1; 0:1); (2; 0:3); (3; 0:8); (4; 1); (5;0:9); (6;0:5); (7;0:2); (8;0:1)g:This fuzzy set is shown in Figure 1 (a). 2Example 2: Fuzzy sets with continuous XLet X = R+ be the set of possibles ages for human be-ings. Then the fuzzy set B = \about 50 years old" may beexpressed as B = f(x; �B(x)jx 2 Xg;where �B(x) = 11 + �x� 505 �4 :This is illustrated in Figure 1 (b). 2An alternative way of denoting a fuzzy set A isA = � Pxi2X �A(xi)=xi; if X is discrete:RX �A(x)=x; if X is continuous: (3)The summation and integration signs in equation (3) standfor the union of (x; �A(x)) pairs; they do not indicate sum-mation or integration. Similarly, \=" is only a marker and

does not imply division. Using this notation, we can rewritethe fuzzy sets in examples 1 and 2 asA = 0:1=1+0:3=2+0:8=3+1:0=4+0:9=5+0:5=6+0:2=7+0:1=8;and B = ZR+ 11 + (x�505 )4�x;respectively.From example 1 and 2, we see that the construction ofa fuzzy set depends on two things: the identi�cation of asuitable universe of discourse and the speci�cation of anappropriate membership function. It should be noted thatthe speci�cation of membership functions is quite subjec-tive, which means the membership functions speci�ed forthe same concept (say, \cold") by di�erent persons mayvary considerably. This subjectivity comes from the in-de�nite nature of abstract concepts and has nothing todo with randomness. Therefore the subjectivity and non-randomness of fuzzy sets is the primary di�erence betweenthe study of fuzzy sets and probability theory, which dealswith objective treatment of random phenomena.0

0.5

1

2 4 6 8

*

*

*

**

*

**

X = number of courses

mem

bers

hip

valu

e

(a) MF on a discrete X

0

0.5

1

0 50 100

X = age

mem

bers

hip

valu

e

(b) MF on a continuous XFig. 1. (a) A = \appropriate number of courses taken"; (b) B =\about 50 years old".Corresponding to the ordinary set operations of union,intersection, and complement, fuzzy sets have similar oper-ations, which were initially de�ned in Zadeh's paper [118].Before introducing these three fuzzy set operations, �rst wewill de�ne the notion of containment, which plays a cen-tral role in both ordinary and fuzzy sets. This de�nition ofcontainment is, of course, a natural extension of the casefor ordinary sets.De�nition 2: Containment or subsetFuzzy set A is contained in fuzzy set B (or, equivalently,A is a subset of B, or A is smaller than or equal to B) ifand only if �A(x) � �B(x) for all x. In symbols,A � B () �A(x) � �B(x): (4)2De�nition 3: Union (disjunction)The union of two fuzzy sets A and B is a fuzzy set C,written as C = A [ B or C = A OR B, whose MF isrelated to those of A and B by�C(x) = max(�A(x); �B(x)) = �A(x) _ �B(x): (5)2

3As pointed out by Zadeh [118], a more intuitive and ap-pealing de�nition of union is the smallest fuzzy set contain-ing both A and B. Alternatively, if D is any fuzzy set thatcontains both A and B, then it also contains A [ B. Theintersection of fuzzy sets can be de�ned analogously.De�nition 4: Intersection (conjunction)The intersection of two fuzzy sets A and B is a fuzzy setC, written as C = A \ B or C = A AND B, whose MFis related to those of A and B by�C(x) = min(�A(x); �B(x)) = �A(x) ^ �B(x): (6)2As in the case of the union, it is obvious that the intersec-tion of A and B is the largest fuzzy set which is containedin both A and B. This reduces to the ordinary intersectionoperation if both A and B are nonfuzzy.De�nition 5: Complement (negation)The complement of fuzzy set A, denoted by A (:A,NOT A), is de�ned as�A(x) = 1� �A(x): (7)2Figure 2 demonstrates these three basic operations: (a)illustrates two fuzzy sets A and B, (b) is the complementof A, (c) is the union of A and B, and (d) is the intersectionof A and B.0

0.5

1

(a) two fuzzy sets A, B

0

0.5

1

(b) "NOT A"

0

0.5

1

(c) "A OR B"

0

0.5

1

(d) "A AND B"

A B

Fig. 2. Operations on fuzzy sets: (a) two fuzzy sets A and B; (b) A;(c) A [B; (d) A \B.Note that other consistent de�nitions for fuzzy AND andOR have been proposed in the literature under the namesT-norm and T-conorm operators [16], respectively. Ex-cept for min and max, none of these operators satisfy thelaw of distributivity:�A[(B\C)(x) = �(A[B)\(A[C)(x);�A\(B[C)(x) = �(A\B)[(A\C)(x):However, min and max do incur some di�culties in ana-lyzing fuzzy inference systems. A popular alternative is to

use the probabilistic AND and OR:�A\B(x) = �A(x)�B(x):�A[B(x) = �A(x) + �B(x)� �A(x)�B(x):In the following, we shall give several classes of parame-terized functions commonly used to de�ne MF's. These pa-rameterized MF's play an important role in adaptive fuzzyinference systems.De�nition 6: Triangular MF'sA triangularMF is speci�ed by three parameters fa; b; cg,which determine the x coordinates of three corners:triangle(x; a; b; c) = max�min�x� ab� a ; c� xc� b� ; 0� : (8)Figure 3 (a) illustrates an example of the triangular MFde�ned by triangle(x; 20; 60; 80). 2De�nition 7: Trapezoidal MF'sA trapezoidal MF is speci�ed by four parametersfa; b; c; dg as follows:trapezoid(x; a; b; c; d) = max�min�x� ab� a ; 1; d� xd� c� ; 0�(9)Figure 3 (b) illustrates an example of a trapezoidal MFde�ned by trapezoid(x; 10; 20; 60;95). Obviously, the tri-angular MF is a special case of the trapezoidal MF. 2Due to their simple formulas and computational e�-ciency, both triangular MF's and trapezoidal MF's havebeen used extensively, especially in real-time implementa-tions. However, since the MF's are composed of straightline segments, they are not smooth at the switching pointsspeci�ed by the parameters. In the following we introduceother types of MF's de�ned by smooth and nonlinear func-tions.De�nition 8: Gaussian MF'sA Gaussian MF is speci�ed by two parameters f�; cg:gaussian(x;�; c) = e��x� c� �2 ; (10)where c represents the MF's center and � determines theMF's width. Figure 3 (c) plots a Gaussian MF de�ned bygaussian(x; 20; 50). 2De�nition 9: Generalized bell MF'sA generalized bell MF (or bellMF) is speci�ed by threeparameters fa; b; cg:bell(x; a; b; c) = 11 + ���x� ca ���2b ; (11)where the parameter b is usually positive. Note that thisMF is a direct generalization of the Cauchy distribution

40 50 100

0

0.2

0.4

0.6

0.8

1

x

mem

bers

hip

valu

e(a) trianguler MF

0 50 1000

0.2

0.4

0.6

0.8

1

x

mem

bers

hip

valu

e

(b) trapezoidal MF

0 50 1000

0.2

0.4

0.6

0.8

1

x

mem

bers

hip

valu

e

(c) Gaussian MF

0 50 1000

0.2

0.4

0.6

0.8

1

x

mem

bers

hip

valu

e

(d) bell MFFig. 3. Examples of various classes ofMF's: (a) triangle(x; 20;60;80); (b) trapezoid(x; 10;20;60;95);(c) gaussian(x; 20;50); (d) bell(x; 20;4; 50).used in probability theory. Figure 3 illustrates a general-ized bell MF de�ned by bell(x; 20; 4; 50). 2A desired generalized bell MF can be obtained by aproper selection of the parameter set fa; b; cg. Speci�cally,we can adjust c and a to vary the center and width of theMF, and then use b to control the slopes at the crossoverpoints. Figure 4 shows the physical meanings of each pa-rameter in a bell MF.Because of their smoothness and concise notation, Gaus-sian MF's and bell MF's are becoming increasingly popu-lar methods for specifying fuzzy sets. Gaussian functionsare well known in the �elds of probability and statistics,and they possess useful properties such as invariance un-der multiplication and Fourier transform. The bell MFhas one more parameter than the Gaussian MF, so it canapproach a nonfuzzy set if b!1.1.0

0.5

0cc-a c+a

2a

slope=-b/2a

x

MFFig. 4. Physical meaning of parameters in a generalized bell function.De�nition 10: Sigmoidal MF'sA sigmoidal MF is de�ned bysigmoid(x; a; c) = 11 + exp[�a(x� c)] ; (12)where a controls the slope at the crossover point x = c. 2Depending on the sign of the parameter a, a sigmoidalMF is inherently open right or left and thus is appropri-ate for representing concepts such as \very large" or \very

negative." Sigmoidal functions of this kind are employedwidely as the activation function of arti�cial neural net-works. Therefore, for a neural network to simulate thebehavior of a fuzzy inference system, the �rst problem weface is how to synthesize a close MF through a sigmoidalfunction. There are two simple ways to achieve this: oneis to take the product of two sigmoidal MF's; the other isto take the absolute di�erence of two sigmoidal MF's.It should be noted that the list of MF's introduced in thissection is by no means exhaustive; other specialized MF'scan be created for speci�c applications if necessary. Inparticular, any types of continuous probability distributionfunctions can be used as an MF here, provided that a set ofparameters are given to specify the appropriate meaningsof the MF.B. Fuzzy If-Then RulesA fuzzy if-then rule (fuzzy rule, fuzzy implicationor fuzzy conditional statement) assumes the formif x is A then y is B, (13)where A and B are linguistic values de�ned by fuzzy setson universes of discourse X and Y , respectively. Often \xis A" is called the antecedent or premise while \y is B"is called the consequence or conclusion. Examples offuzzy if-then rules are widespread in our daily linguisticexpressions, such as the following:� If pressure is high then volume is small.� If the road is slippery then driving is dangerous.� If a tomato is red then it is ripe.� If the speed is high then apply the brake a little.Before we can employ fuzzy if-then rules to model andanalyze a system, we �rst have to formalize what is meantby the expression \if x is A then y is B", which is some-times abbreviated as A ! B. In essence, the expressiondescribes a relation between two variables x and y; thissuggests that a fuzzy if-then rule be de�ned as a binaryfuzzy relation R on the product space X � Y . Note thata binary fuzzy relation R is an extension of the classicalCartesian product, where each element (x; y) 2 X � Y isassociated with a membership grade denoted by �R(x; y).Alternatively, a binary fuzzy relation R can be viewed as afuzzy set with universe X � Y , and this fuzzy set is char-acterized by a two-dimensional MF �R(x; y).Generally speaking, there are two ways to interpret thefuzzy rule A ! B. If we interpret A ! B as A coupledwith B, thenR = A! B = A�B = ZX�Y �A(x) ~� �B(y)=(x; y);where ~� is a fuzzy AND (or more generally, T-norm) op-erator and A ! B is used again to represent the fuzzyrelation R. On the other hand, if A ! B is interpretedas A entails B, then it can be written as four di�erentformulas:� Material implication: R = A! B = :A [B.� Propositional calculus: R = A! B = :A [ (A \B).

5� Extended propositional calculus: R = A! B = (:A\:B) [B.� Generalization of modus ponens: �R(x; y) =supfc j �A(x) ~� c � �B(y) and 0 � c � 1g, whereR = A! B and ~� is a T-norm operator.Though these four formulas are di�erent in appearance,they all reduce to the familiar identity A! B � :A[Bwhen A and B are propositions in the sense of two-valuedlogic. Figure 5 illustrates these two interpretations of afuzzy rule A ! B. Here we shall adopt the �rst inter-pretation, where A! B implies A coupled with B. Thetreatment of the second interpretation can be found in [34],[49], [50].Y

A

B

(a)

X

Y

A

B

X

(b)Fig. 5. Two interpretations of fuzzy implication: (a) A coupled withB; (b) A entails B..C. Fuzzy Reasoning (Approximate Reasoning)Fuzzy reasoning (also known as approximate rea-soning) is an inference procedure used to derive conclu-sions from a set of fuzzy if-then rules and one or moreconditions. Before introducing fuzzy reasoning, we shalldiscuss the compositional rule of inference [119], whichis the essential rationale behind fuzzy reasoning.The compositional rule of inference is a generalizationof the following familiar notion. Suppose that we have acurve y = f(x) that regulates the relation between x and y.When we are given x = a, then from y = f(x) we can inferthat y = b = f(a); see Figure 6 (a). A generalization ofthe above process would allow a to be an interval and f(x)to be an interval-valued function, as shown in Figure 6 (b).To �nd the resulting interval y = b corresponding to theinterval x = a, we �rst construct a cylindrical extension ofa (that is, extend the domain of a from X to X � Y ) andthen �nd its intersection I with the interval-valued curve.The projection of I onto the y-axis yields the interval y = b.X

Y

X

Y

a

b

x=a

y=bI

y=f(x)y=f(x)

(a) (b)Fig. 6. Derivation of y = b from x = a and y = f(x): (a) a and b arepoints, y = f(x) is a curve; (b) a and b are intervals, y = f(x)is an interval-valued function.Going one step further in our generalization, we assumethat A is a fuzzy set ofX and F is a fuzzy relation onX�Y ,as shown in Figure 7 (a) and (b). To �nd the resultingfuzzy set B, again, we construct a cylindrical extension

c(A) with base A (that is, we expand the domain of Afrom X to X � Y to get c(A)). The intersection of c(A)and F (Figure 7 (c)) forms the analog of the region ofintersection I in Figure 6 (b). By projecting c(A)\F ontothe y-axis, we infer y as a fuzzy set B on the y-axis, asshown in Figure 7 (d).Speci�cally, let �A, �c(A), �B , and �F be the MF's of A,c(A), B, and F , respectively, where �c(A) is related to �Athrough �c(A)(x; y) = �A(x):Then �c(A)\F (x; y) = min[�c(A)(x; y); �F (x; y)]= min[�A(x); �F (x; y)]:By projecting c(A) \ F onto the y-axis, we have�B(y) = maxxmin[�A(x); �F (x; y)]= _x[�A(x) ^ �F (x; y)]:This formula is referred to asmax-min composition andB is represented as B = A � F;where � denotes the composition operator. If we chooseproduct for fuzzy AND and max for fuzzy OR, thenwe have max-product composition and �B(y) is equal_x[�A(x)�F (x; y)].0

510

0

20

40

0.5

1

xy

mem

bers

hip

grad

e

(a) cylindrical extension of A

05

10

0

20

40

0.5

1

xy

mem

bers

hip

grad

e

(b) fuzzy relation F on x and y

05

10

0

20

40

0.20.40.60.8

xy

mem

bers

hip

grad

e

(c) min. of (a) and (b)

05

10

0

20

40

0.20.40.60.8

xy

mem

bers

hip

grad

e

(d) projection of (c) onto y-axisFig. 7. Compositional rule of inference.Using the compositional rule of inference, we can formal-ize an inference procedure, called fuzzy reasoning, upon aset of fuzzy if-then rules. The basic rule of inference intraditional two-valued logic is modus ponens, accordingto which we can infer the truth of a proposition B fromthe truth of A and the implication A ! B. For instance,if A is identi�ed with \the tomato is red" and B with \thetomato is ripe," then if it is true that \the tomato is red,"it is also true that \the tomato is ripe." This concept isillustrated below.premise 1 (fact): x is A,premise 2 (rule): if x is A then y is B,consequence (conclusion): y is B.

6 However, in much of human reasoning, modus ponens isemployed in an approximate manner. For example, if wehave the same implication rule \if the tomato is red thenit is ripe" and we know that \the tomato is more or lessred," then we may infer that \the tomato is more or lessripe." This is written aspremise 1 (fact): x is A0,premise 2 (rule): if x is A then y is B,consequence (conclusion): y is B0,where A0 is close to A and B0 is close to B. When A,B, A0, and B0 are fuzzy sets of appropriate universes, theabove inference procedure is called fuzzy reasoning or ap-proximate reasoning; it is also called generalized modusponens, since it has modus ponens as a special case.Using the composition rule of inference introduced ear-lier, we can formulate the inference procedure of fuzzy rea-soning as the following de�nition.De�nition 11: Fuzzy Reasoning Based On Max-MinComposition.Let A, A0, and B be fuzzy sets of X, X, and Y , re-spectively. Assume that the fuzzy implication A ! B isexpressed as a fuzzy relation R on X � Y . Then the fuzzyset B0 induced by \x is A0" and the fuzzy rule \if x is Athen y is B" is de�ned by�B0 (y) = maxxmin[�A0(x); �R(x; y)]= _x[�A0(x) ^ �R(x; y)]; (14)or, equivalently,B0 = A0 �R = A0 � (A! B): (15)2Remember that equation (15) is a general expression forfuzzy reasoning, while equation (14) is an instance of fuzzyreasoning where min and max are the operators for fuzzyAND and OR, respectively.Now we can use the inference procedure of the gener-alized modus ponens to derive conclusions, provided thatthe fuzzy implication A ! B is de�ned as an appropriatebinary fuzzy relation.C.1 Single rule with single antecedentFor a single rule with a single antecedent, the formula isavailable in equation (14). A further simpli�cation of theequation yields�B0(y) = [_x(�A0(x) ^ �A(x)]^ �B(y)= w ^ �B(y):In other words, �rst we �nd the degree of match w as themaximum of �A0 (x) ^ �A(x) (the shaded area in the an-tecedent part of Figure 8); then the MF of the resulting B0is equal to the MF of B clipped by w, shown as the shadedarea in the consequent part of Figure 8.A fuzzy if-then rule with two antecedents is usually writ-ten as \if x is A and y is B then z is C ." The correspondingproblem for approximate reasoning is expressed as

min

µA A’

X

µB

Y

B’Fig. 8. Fuzzy reasoning for a single rule with a single antecedent.premise 1 (fact): x is A0 and y is B0,premise 2 (rule): if x is A and y is B then z is C,consequence (conclusion): z is C 0.The fuzzy rule in premise 2 above can be put into thesimpler form \A�B ! C." Intuitively, this fuzzy rule canbe transformed into a ternary fuzzy relation R, which isspeci�ed by the following MF:�R(x; y; z) = �(A�B)�C (x; y; z) = �A(x) ^ �B(y) ^ �C(z):And the resulting C 0 is expressed asC 0 = (A0 �B0) � (A� B ! C):Thus�C0(z) = _x;y[�A0(x) ^ �B0 (y)] ^ [�A(x) ^ �B(y) ^ �C(z)]= _x;yf[�A0(x) ^ �B0(y) ^ �A(x) ^ �B(y)]g ^ �C(z)= f_x[�A0(x) ^ �A(x)]| {z }w1 g ^ f_y[�B0(y) ^ �B(y)]| {z }w2 g ^ �C(z)= w1 ^w2| {z }�ringstrength ^�C(z); (16)where w1 is the degree of match between A and A0; w2 isthe degree of match between B and B0; and w1^w2 is calledthe �ring strength or degree of ful�llmentof this fuzzyrule. A graphic interpretation is shown in Figure 9, wherethe MF of the resulting C 0 is equal to the MF of C clippedby the �ring strength w, w = w1 ^w2. The generalizationto more than two antecedents is straightforward.1w

2w

µ µ

X

A BA’ B’

Y

µ

Z

C

min

C’wFig. 9. Approximate reasoning for multiple antecedents.C.2 Multiple rules with multiple antecedentsThe interpretation of multiple rules is usually taken asthe union of the fuzzy relations corresponding to the fuzzyrules. For instance, given the following fact and rules:

7premise 1 (fact): x is A0 and y is B0,premise 2 (rule 1): if x is A1 and y is B1 then z is C1,premise 3 (rule 2): if x is A2 and y is B2 then z is C2,consequence (conclusion): z is C 0,we can employ the fuzzy reasoning shown in Figure 10 asan inference procedure to derive the resulting output fuzzyset C 0.µ µ

µ µX

X

Y

Y

A

A

B

B

A’

A’

B’

B’

1 1

2 2

µ

µ

µ

Z

Z

Z

C

C

1

2

max

C’

min

1

2

C’

C’Fig. 10. Fuzzy reasoning for multiple rules with multiple antecedents.To verify this inference procedure, let R1 = A1 � B1 !C1 and R2 = A2 � B2 ! C2. Since the max-min com-position operator � is distributive over the [ operator, itfollows thatC0 = (A0 �B0) � (R1 [R2)= [(A0 � B0) �R1][ [(A0 �B0) �R2]= C 01 [C 02; (17)where C01 and C 02 are the inferred fuzzy sets for rule 1and 2, respectively. Figure 10 shows graphically the op-eration of fuzzy reasoning for multiple rules with multipleantecedents.When a given fuzzy rule assumes the form \if x is A ory is B then z is C," then �ring strength is given as themaximum of degree of match on the antecedent part for agiven condition. This fuzzy rule is equivalent to the unionof the two fuzzy rules \if x is A then z is C" and \if y isB then z is C" if and only if the max-min composition isadopted.D. Fuzzy Models (Fuzzy Inference Systems)The Fuzzy inference system is a popular computingframework based on the concepts of fuzzy set theory, fuzzyif-then rules, and fuzzy reasoning. It has been successfullyapplied in �elds such as automatic control, data classi�ca-tion, decision analysis, expert systems, and computer vi-sion. Because of its multi-disciplinary nature, the fuzzyinference system is known by a number of names, such asfuzzy-rule-based system, fuzzy expert system [37],fuzzy model [98], [91], fuzzy associative memory [47],

fuzzy logic controller [60], [49], [50], and simply (andambiguously) fuzzy system.The basic structure of a fuzzy inference system consistsof three conceptual components: a rule base, which con-tains a selection of fuzzy rules, a database or dictionary,which de�nes the membership functions used in the fuzzyrules, and a reasoning mechanism, which performs theinference procedure (usually the fuzzy reasoning introducedearlier) upon the rules and a given condition to derive areasonable output or conclusion.Note that the basic fuzzy inference system can take eitherfuzzy inputs or crisp inputs (which can be viewed as fuzzysingletons that have zero membership grade everywhere ex-cept at certain points where the membership grades achieveunity), but the outputs it produces are almost always fuzzysets. Often it is necessary to have a crisp output, especiallyin a situation where a fuzzy inference system is used as acontroller. Therefore we need a defuzzi�cation strategyto extract a crisp value that best summarize a fuzzy set.A fuzzy inference system with a crisp output is shown inFigure 11, where the dashed line indicates a basic fuzzyinference system with fuzzy output and the defuzzi�cationblock serves the purpose of transforming a fuzzy outputinto a crisp one. An example of a basic fuzzy inferencesystem is the two-rule two-input system of Figure 10. Thefunction of the defuzzi�cation block will be explained at alater point.With crisp inputs and outputs, a fuzzy inference systemimplements a nonlinear mapping from its input space tooutput space. This mapping is accomplished by a num-ber of fuzzy if-then rules, each of which describes the localbehavior of the mapping. In particular, the antecedent ofeach rule de�nes a fuzzy region of the input space, and theconsequent speci�es the corresponding outputs.(crisp orfuzzy)

rule 1

rule 2

rule r

x is A y is Br r

(fuzzy)

(fuzzy)

(fuzzy)

aggregatorx (fuzzy)defuzzifier y(crisp)

r

x is A y is B2 2

x is A y is B1 11

2

w

w

wFig. 11. Block diagram for a fuzzy inference system.In what follows, we will �rst introduce three types offuzzy inference systems that have been widely employed invarious applications. The di�erences between these threefuzzy inference systems lie in the consequents of their fuzzyrules, and thus their aggregation and defuzzi�cation proce-dures di�er accordingly. Then we will introduce three waysof partitioning the input space for any type of fuzzy infer-ence system. Last, we will address brie y the features andthe problems of fuzzy modeling, which is concerned withthe construction of a fuzzy inference system for modelinga speci�c target system.

8D.1 Mamdani Fuzzy ModelThe Mamdani fuzzy model [60] was proposed as thevery �rst attempt to control a steam engine and boiler com-bination by a set of linguistic control rules obtained fromexperienced human operators. Figure 12 is an illustrationof how a two-rule fuzzy inference system of the Mamdanitype derives the overall output z when subjected to twocrisp inputs x and y.B2A

1 1A B

2

µ µ

µ µX

X

Y

Y

min

z

µ

µ

µ

Z

Z

Z

C

C

1

2

max

C’

x y

1C’

C’2Fig. 12. The Mamdani fuzzy inference system using min and maxfor fuzzy AND and OR operators, respectively.If we adopt product and max as our choice for thefuzzy AND and OR operators, respectively, and use max-product composition instead of the original max-min com-position, then the resulting fuzzy reasoning is shown inFigure 13, where the inferred output of each rule is a fuzzyset scaled down by its �ring strength via the algebraic prod-uct. Though this type of fuzzy reasoning was not employedin Mamdani's original paper, it has often been used in theliterature. Other variations are possible if we have di�er-ent choices of fuzzy AND (T-norm) and OR (T-conorm)operators.2A

1A 1

B

B

2

µ µ

µ µX

X

Y

Y

z

µ

µ

µ

Z

Z

Z

C

C

1

2

max

C’

x y

1C’

C’2

product

Fig. 13. The Mamdani fuzzy inference system using product andmax for fuzzy AND and OR operators, respectively.

In Mamdani's application [60], two fuzzy inference sys-tems were used as two controllers to generate the heat inputto the boiler and throttle opening of the engine cylinder,respectively, in order to regulate the steam pressure in theboiler and the speed of the engine. Since the plant takesonly crisp values as inputs, we have to use a defuzzi�er toconvert a fuzzy set to a crisp value. Defuzzi�cation refersto the way a crisp value is extracted from a fuzzy set asa representative value. The most frequently used defuzzi-�cation strategy is the centroid of area, which is de�nedas zCOA = RZ �C0(z)z dzRZ �C0(z) dz ; (18)where �C0(z) is the aggregated output MF. This formula isreminiscent of the calculation of expected values in prob-ability distributions. Other defuzzi�cation strategies arisefor speci�c applications, which includes bisector of area,mean of maximum, largest of maximum, and smallest ofmaximum, and so on. Figure 14 demonstrate these de-fuzzi�cation strategies. Generally speaking, these defuzzi-�cation methods are computation intensive and there is norigorous way to analyze them except through experiment-based studies. Other more exible defuzzi�cation methodscan be found in [73], [115], [80].Both Figure 12 and 13 conform to the fuzzy reasoningde�ned previously. In practice, however, a fuzzy inferencesystem may have certain reasoning mechanisms that do notfollow the strict de�nition of the compositional rule of infer-ence. For instance, one might use either min or product forcomputing �ring strengths and/or quali�ed rule outputs.Another variation is to use pointwise summation (sum)instead of max in the standard fuzzy reasoning, thoughsum is not really a fuzzy OR operators. An advantage ofthis sum-product composition [47] is that the �nal crispoutput via centroid defuzzi�cation is equal to the weightedaverage of each rule's crisp output, where the weightingfactor for a rule is equal to its �ring strength multipliedby the area of the rule's output MF, and the crisp outputof a rule is equal to the the centroid defuzzi�ed value ofits output MF. This reduces the computation burden if wecan obtain the area and the centroid of each output MF inadvance.Z

µ

µ*

A

centroid of area

mean of max.bisecter of area

smallest of max.largest of max.Fig. 14. Various defuzzi�cation schemes for obtaining a crisp output.D.2 Sugeno Fuzzy ModelThe Sugeno fuzzy model (also known as the TSKfuzzy model) was proposed by Takagi, Sugeno, and

9Kang [98], [91] in an e�ort to develop a systematic ap-proach to generating fuzzy rules from a given input-outputdata set. A typical fuzzy rule in a Sugeno fuzzy model hasthe form if x is A and y is B then z = f(x,y),where A and B are fuzzy sets in the antecedent, whilez = f(x; y) is a crisp function in the consequent. Usu-ally f(x; y) is a polynomial in the input variables x and y,but it can be any function as long as it can appropriatelydescribe the output of the system within the fuzzy regionspeci�ed by the antecedent of the rule. When f(x; y) is a�rst-order polynomial, the resulting fuzzy inference systemis called a �rst-order Sugeno fuzzy model, which wasoriginally proposed in [98], [91]. When f is a constant, wethen have a zero-order Sugeno fuzzy model, which canbe viewed either as a special case of the Mamdani fuzzy in-ference system, in which each rule's consequent is speci�edby a fuzzy singleton (or a pre-defuzzi�ed consequent), or aspecial case of the Tsukamoto fuzzy model (to be introducelater), in which each rule's consequent is speci�ed by anMF of a step function crossing at the constant. Moreover,a zero-order Sugeno fuzzy model is functionally equivalentto a radial basis function network under certain minor con-straints [32].It should be pointed out that the output of a zero-orderSugeno model is a smooth function of its input variables aslong as the neighboring MF's in the premise have enoughoverlap. In other words, the overlap of MF's in the con-sequent does not have a decisive e�ect on the smoothnessof the interpolation; it is the overlap of the MF's in thepremise that determines the smoothness of the resultinginput-output behavior.2A

w2

1w

z 1 2w z1w

ww2+

+z =

1

2

1

A 1B

B2

µ µ

µ µX

X

Y

Y

x y

weighted average

2z = p x + q y + r2

1z = p x + q y + r1

2

1

2

1

productmin or

Fig. 15. The Sugeno fuzzy model.Figure 15 shows the fuzzy reasoning procedure for a �rst-order Sugeno fuzzy model. Note that the aggregator anddefuzzi�er blocks in Figure 11 are replaced by the operationof weighted average, thus avoiding the time-consumingprocedure of defuzzi�cation. In practice, sometimes theweighted average operator is replaced with the weightedsum operator (that is, z = w1z1 + w2z2 in Figure 15)in order to further reduce computation load, especially intraining a fuzzy inference system. However, this simpli�ca-tion could lead to the loss of MF linguistic meanings unless

the sum of �ring strengths (that is,Piwi) is close to unity.D.3 Tsukamoto Fuzzy ModelIn the Tsukamoto fuzzy models [101], the consequentof each fuzzy if-then rule is represented by a fuzzy set with amonotonicalMF, as shown in Figure 16. As a result, the in-ferred output of each rule is de�ned as a crisp value inducedby the rule's �ring strength. The overall output is takenas the weighted average of each rule's output. Figure 16illustrates the whole reasoning procedure for a two-inputtwo-rule system.2

C1

2

A

1A

z

z 1

z

2 z

1

w

1

w2

B

1 + w

C2

w1

w2

2

1 w

B

+z =

2

µ µ

µ µX

X

Y

Y

µ

µZ

Z

x y weighted average

productmin or

Fig. 16. The Tsukamoto fuzzy model.Since each rule infers a crisp output, the Tsukamotofuzzy model aggregates each rule's output by the method ofweighted average and thus also avoids the time-consumingprocess of defuzzi�cation.D.4 Partition Styles for Fuzzy ModelsBy now it should be clear that the spirit of fuzzy infer-ence systems resembles that of \divide and conquer" { theantecedents of fuzzy rules partition the input space into anumber of local fuzzy regions, while the consequents de-scribe the behavior within a given region via various con-stituents. The consequent constituent could be an outputMF (Mamdani and Tsukamoto fuzzy models), a constant(zero-order Sugeno model), or a linear equation (�rst-orderSugeno model). Di�erent consequent constituents resultin di�erent fuzzy inference systems, but their antecedentsare always the same. Therefore the following discussionof methods of partitioning input spaces to form the an-tecedents of fuzzy rules is applicable to all three types offuzzy inference systems.� Grid partition: Figure 17 (a) illustrates a typicalgrid partition in a two-dimensional input space. Thispartition method is often chosen in designing a fuzzycontroller, which usually involves only several statevariables as the inputs to the controller. This partitionstrategy needs only a small number of MF's for eachinput. However, it encounters problems when we havea moderately large number of inputs. For instance,a fuzzy model with 10 inputs and two MF's on eachinput would result in 210 = 1024 fuzzy if-then rules,which is prohibitively large. This problem, usually

10 referred to as the curse of dimensionality, can bealleviated by the other partition strategies introducedbelow.� Tree partition: Figure 17 (b) shows a typical treepartition, in which each region can be uniquely speci-�ed along a corresponding decision tree. The tree par-tition relieves the problem of an exponential increasein the number of rules. However, more MF's for eachinput are needed to de�ne these fuzzy regions, andthese MF's do not usually bear clear linguistic mean-ings such as \small," \big," and so on.� Scatter partition: As shown in Figure 17 (c), bycovering a subset of the whole input space that char-acterizes a region of possible occurrence of the inputvectors, the scatter partition can also limit the numberof rules to a reasonable amount.(a) (b) (c)Fig. 17. Various methods for partitioning the input space: (a) gridpartition; (b) tree partition; (b) scatter partition.D.5 Neuro-Fuzzy ModelingThe process for constructing a fuzzy inference system isusually called fuzzy modeling, which has the followingfeatures:� Due to the rule structure of a fuzzy inference system, itis easy to incorporate human expertise about the tar-get system directly into the modeling process. Namely,fuzzy modeling takes advantage of domain knowl-edge that might not be easily or directly employed inother modeling approaches.� When the input-output data of a system to be modeledis available, conventional system identi�cation tech-niques can be used for fuzzy modeling. In other words,the use of numerical data also plays an importantrole in fuzzy modeling, just as in other mathematicalmodeling methods.A common practice is to use domain knowledge forstructure determination (that is, determine relevantinputs, number of MF's for each input, number of rules,types of fuzzy models, and so on) and numerical data forparameter identi�cation (that is, identify the values ofparameters that can generate best the performance). Inparticular, the term neuro-fuzzy modeling refers to theway of applying various learning techniques developed inthe neural network literature to fuzzy inference systems. Inthe subsequent sections, we will apply the concept of theadaptive network, which is a generalization of the commonback-propagation neural network, to tackle the parameteridenti�cation problem in a fuzzy inference system.

III. Adaptive NetworksThis section describes the architectures and learning pro-cedures of adaptive networks, which are a superset of allkinds of neural network paradigms with supervised learningcapability. In particular, we shall address two of the mostpopular network paradigms adopted in the neural networkliterature: the back-propagation neural network (BPNN)and the radial basis function network (RBFN). Other net-work paradigms that can be interpreted as a set of fuzzyif-then rules are described in the next section.A. ArchitectureAs the name implies, an adaptive network (Figure 18)is a network structure whose overall input-output behav-ior is determined by the values of a collection of modi�-able parameters. More speci�cally, the con�guration of anadaptive network is composed of a set of nodes connectedthrough directed links, where each node is a process unitthat performs a static node function on its incoming sig-nals to generate a single node output and each link spec-i�es the direction of signal ow from one node to another.Usually a node function is a parameterized function withmodi�able parameters; by changing these parameters, weare actually changing the node function as well as the over-all behavior of the adaptive network.In the most general case, an adaptive network is het-erogeneous and each node may have a di�erent node func-tion. Also remember that each link in an adaptive networkare merely used to specify the propagation direction of anode's output; generally there are no weights or parametersassociated with links. Figure 18 shows a typical adaptivenetwork with two inputs and two outputs.x8

x9

(output layer)

3

4

5

6

7

8

9

layer 1 layer 2 layer 3input layer

x

x1

2Fig. 18. A feedforward adaptive network in layered representation.The parameters of an adaptive network are distributedinto the network's nodes, so each node has a local param-eter set. The union of these local parameter sets is thenetwork's overall parameter set. If a node's parameter setis non-empty, then its node function depends on the pa-rameter values; we use a square to represent this kind ofadaptive node. On the other hand, if a node has anempty parameter set, then its function is �xed; we use acircle to denote this type of �xed node.Adaptive networks are generally classi�ed into two cat-egories on the basis of the type of connections they have:feedforward and recurrent types. The adaptive networkshown in Figure 18 is a feedforward network, since the out-put of each node propagates from the input side (left) to the

11x8

x9

3

4

5

6

7

8

9x

x1

2Fig. 19. A recurrent adaptive network .output side (right) unanimously. If there is a feedback linkthat forms a circular path in a network, then the networkis a recurrent network; Figure 19 is an example. (From theviewpoint of graph theory, a feedforward network is rep-resented by an acyclic directed graph which contains nodirected cycles, while a recurrent network always containsat least one directed cycle.)In the layered representationof the feedforward adap-tive network in Figure 18, there are no links between nodesin the same layer and outputs of nodes in a speci�c layer arealways connected to nodes in succeeding layers. This rep-resentation is usually preferred because of its modularity,in that nodes in the same layer have the same functional-ity or generate the same level of abstraction about inputvectors.Another representation of feedforward networks is thetopological ordering representation, which labels thenodes in an ordered sequence 1, 2, 3, ..., such that there areno links from node i to node j whenever i � j. Figure 20is the topological ordering representation of the networkin Figure 18. This representation is less modular than thelayer representation, but it facilitates the formulation of thelearning rule, as will be seen in the next section. (Note thatthe topological ordering representation is in fact a specialcase of the layered representation, with one node per layer.)3 4 5 6 7 8 9 x9

x8

x1 x2Fig. 20. A feedforward adaptive network in topological ordering rep-resentation.Conceptually, a feedforward adaptive network is actu-ally a static mapping between its input and output spaces;this mapping may be either a simple linear relationship ora highly nonlinear one, depending on the structure (nodearrangement and connections, and so on) for the networkand the function for each node. Here our aim is to con-struct a network for achieving a desired nonlinear mappingthat is regulated by a data set consisting of a number ofdesired input-output pairs of a target system. This dataset is usually called the training data set and the pro-cedure we follow in adjusting the parameters to improvethe performance of the network are often referred to as thelearning rule or learning algorithm . Usually an adap-tive network's performance is measured as the discrepancybetween the desired output and the network's output un-der the same input conditions. This discrepancy is calledthe error measure and it can assume di�erent forms for

di�erent applications. Generally speaking, a learning ruleis derived by applying a speci�c optimization technique toa given error measure.Before introducing a basic learning algorithm for adap-tive networks, we shall present several examples of adaptivenetworks.Example 3: An adaptive network with a single linearnode.Figure 21 is an adaptive network with a single node speci-�ed byx3 = f3(x1; x2; a1; a2; a3) = a1x1 + a2x2 + a3;where x1 and x2 are inputs and a1, a2, and a3 are modi-�able parameters. Obviously this function de�nes a planein x1 � x2 � x3 space, and by setting appropriate valuesfor the parameters, we can place this plane arbitrarily. Byadopting the squared error as the error measure for thisnetwork, we can identify the optimal parameters via thelinear least-squares estimation method. 2x1

x2

f x33Fig. 21. A linear single-node adaptive network .Example 4: A building block for the perceptron or theback-propagation neural network.If we add another node to let the output of the adaptivenetwork in Figure 21 have only two values 0 and 1, then thenonlinear network shown in Figure 22 is obtained. Speci�-cally, the node outputs are expressed asx3 = f3(x1; x2; a1; a2; a3) = a1x1 + a2x2 + a3;and x4 = f4(x3) = � 1 if x3 � 00 if x3 < 0 ;where f3 is a linearly parameterized function and f4 is astep function which maps x3 to either 0 or 1. The overallfunction of this network can be viewed as a linear classi-�er: the �rst node forms a decision boundary as a straightline in x1� x2 space, and the second node indicates whichhalf plane the input vector (x1; x2) resides in. Obviously wecan form an equivalent network with a single node whosefunction is the composition of f3 and f4; the resulting nodeis the building block of the classical perceptron.Since the step function is discontinuous at one point and at at all the other points, it is not suitable for learning pro-cedures based on gradient descent. One way to get aroundthis di�culty is to use the sigmoid function:x4 = f4(x3) = 11 + e�x3 ;which is a continuous and di�erentiable approximation tothe step function. The composition of f3 and this di�er-entiable f4 is the building block for the back-propagationneural network in the following example.

12 2x2

1x

x3f f3 4

x4Fig. 22. A nonlinear single-node adaptive network .Example 5: A back-propagation neural network.Figure 23 is a typical architecture for a back-propagationneural network with three inputs, two outputs, and threehidden nodes that do not connect directly to either inputsor outputs. (The term back-propagation refers to the waythe learning procedure is performed, that is, by propagat-ing gradient information from the network's outputs to itsinputs; details on this are to be introduced next.) Eachnode in a network of this kind has the same node function,which is the composition of a linear f3 and a sigmoidal f4in example 4. For instance, the node function of node 7 inFigure 23 isx7 = 11 + exp[�(w4;7x4 + w5;7x5 +w6;7x6 + t7)] ;where x4, x5, and x6 are outputs from nodes 4, 5, and 6,respectively, and fw4;7; w5;7; w6;7; t7g is the parameter set.Usually we view wi;j as the weight associated with the linkconnecting node i and j and tj as the threshold associatedwith node j. However, it should be noted that this weight-link association is only valid in this type of network. Ingeneral, a link only indicates the signal ow direction andthe causal relationship between connected nodes, as willbe shown in other types of adaptive networks in the subse-quent development. A more detailed discussion about thestructure and learning rules of the arti�cial neural networkwill be presented later. 2x1

x2

x3

x

x

7

8

(input layer)

4

5

6

(hidden layer)

7

8

(output layer)

layer 0 layer 1 layer 2Fig. 23. A 3-3-2 neural network .B. Back-Propagation Learning Rule for Feedforward Net-worksThe central part of a learning rule for an adaptive net-work concerns how to recursively obtain a gradient vectorin which each element is de�ned as the derivative of anerror measure with respect to a parameter. This is doneby means of the chain rule, and the method is generally re-ferred to as the back-propagation learning rule because

the gradient vector is calculated in the direction oppositeto the ow of the output of each node. Details follow below.Suppose that a given feedforward adaptive network inthe layered representation has L layers and layer l (l =0; 1; : : : ; L; l = 0 represents the input layer) has N (l)nodes. Then the output and function of node i (i =1; : : : ; N (l)) of layer l can be represented as xl;i and fl;i,respectively, as shown in Figure 24 (a). Without loss ofgenerality, we assume there are no jumping links, that is,links connecting non-consecutive layers. Since the outputof a node depends on the incoming signals and the parame-ter set of the node, we have the following general expressionfor the node function fl;i:xl;i = fl;i(xl�1;1; : : :xl�1;N(l�1); �; �; ; : : :); (19)where �, �, , etc. are the parameters pertaining to thisnode.x1,1

x1,2

x1,3

x2,2

x2,1

x3,2

x3,1

(a)

layer 1 layer 2 layer 3

x

x0,1

0,2

layer 0

i j n

n+1x = Epn+1

1

(b)

1,1f

f

ff

f

f1,2

1,3

f2,1

2,2

3,1

3,2

Fig. 24. Our notational conventions: (a) layered representation; (b)topological ordering representation.Assuming the given training data set has P entries, wecan de�ne an error measure for the p-th (1 � p � P )entry of the training data as the sum of squared errors:Ep = N(L)Xk=1 (dk � xL;k)2; (20)where dk is the k-th component of the p-th desired outputvector and xL;k is the k-th component of the actual outputvector produced by presenting the p-th input vector to thenetwork. (For notational simplicity, we omit the subscriptp for both dk and xL;k.) Obviously, when Ep is equal tozero, the network is able to reproduce exactly the desiredoutput vector in the p-th training data pair. Thus ourtask here is to minimize an overall error measure, which isde�ned as E =PPp=1Ep.Remember that the de�nition of Ep in equation (20) isnot universal; other de�nitions of Ep are possible for spe-ci�c situations or applications. Therefore we shall avoidusing an explicit expression for the error measure Ep inorder to emphasize the generality. In addition, we assumethat Ep depends on the output nodes only; more generalsituations will be discussed below.

13To use the gradient method to minimize the error mea-sure, �rst we have to obtain the gradient vector. Beforecalculating the gradient vector, we should observe thatchange inparameter � ) change in the outputof node containing � )change in the outputof the �nal layer ) change in theerror measurewhere the arrows) indicate causal relationships. In otherwords, a small change in a parameter � will a�ect the out-put of the node containing �; this in turn will a�ect theoutput of the �nal layer and thus the error measure. There-fore the basic concept in calculating the gradient vector ofthe parameters is to pass a form of derivative informationstarting from the output layer and going backward layerby layer until the input layer is reached.To facilitate the discussion, we de�ne the error signal�l;i as the derivative of the error measure Ep with respectto the output of node i in layer l, taking both direct andindirect paths into consideration. In symbols,�l;i = @+Ep@xl;i : (21)This expression was called the ordered derivative byWerbos [109]. The di�erence between the ordered deriva-tive and the ordinary partial derivative lies in the way weview the function to be di�erentiated. For an internal nodeoutput xl;i (where l 6= L), the partial derivative @Ep@xl;i isequal to zero, since Ep does not depend on xl;i directly.However, it is obvious that Ep does depend on xl;i indi-rectly, since a change in xl;i will propagate through indirectpaths to the output layer and thus produce a correspondingchange in the value of Ep. Therefore �l;i can be viewed asthe ratio of these two changes when they are made in�nites-imal. The following example demonstrates the di�erencebetween the ordered derivative and the ordinary partialderivative.Example 6: Ordered derivatives and ordinary partialderivativesConsider the simple adaptive network shown in Figure 25,where z is a function of x and y, and y is in turn a functionof x: � y = f(x);z = g(x; y):For the ordinary partial derivative @z@x , we assume that allthe other input variables (in this case, y) are constant:@z@x = @g(x; y)@x :In other words, we assume the direct inputs x and y areindependent, without paying attention to the fact that yis actually a function of x. For the ordered derivative, wetake this indirect causal relationship into consideration:@+z@x = @g(x; f(x))@x= @g(x; y)@x ����y=f(x) + @g(x; y)@y ����y=f(x) � @f(x)@x :

Therefore the ordered derivative takes into considerationboth the direct and indirect paths that lead to the causalrelationship. 2f

x

y

g zFig. 25. Ordered derivatives and ordinary partial derivatives (seetext for details).The error signal for the i-th output node (at layer L) canbe calculated directly:�L;i = @+Ep@xL;i = @Ep@xL;i : (22)This is equal to �L;i = �2(di � xL;i) if Ep is de�ned as inequation (20). For the internal (non-output) node at thei-th position of layer l, the error signal can be derived bythe chain rule:�l;i = @+Ep@xl;i| {z }error signalat layer l =PN(l+1)m=1 @+Ep@xl+1;m| {z }error signalat layer l + 1 �@fl+1;m@xl;i= PN(l+1)m=1 �l+1;m @fl+1;m@xl;i ; (23)where 0 � l � L � 1. That is, the error signal of an inter-nal node at layer l can be expressed as a linear combinationof the error signal of the nodes at layer l + 1. Thereforefor any l and i (0 � l � L and 1 � i � N (l)), we can�nd �l;i = @+Ep@xl;i by �rst applying equation (22) once to geterror signals at the output layer, and then applying equa-tion (23) iteratively until we reach the desired layer l. Sincethe error signals are obtained sequentially from the outputlayer back to the input layer, this learning paradigm iscalled the back-propagation learning rule by Rumelhart,Hinton and Williams [79].The gradient vector is de�ned as the derivative of theerror measure with respect to each parameter, so we haveto apply the chain rule again to �nd the gradient vector.If � is a parameter of the i-th node at layer l, we have@+Ep@� = @+Ep@xl;i @fl;i@� = �l;i @fl;i@� : (24)Note that if we allow the parameter � to be shared betweendi�erent nodes, then equation (24) should be changed to amore general form:@+Ep@� = Xx�2S @+Ep@x� @f�@� ; (25)where S is the set of nodes containing � as a parameterand f� is the node function for calculating x�.

14The derivative of the overall error measure E with re-spect to � is @+E@� = PXp=1 @+Ep@� : (26)Accordingly, the update formula for the generic param-eter � is 4� = ��@+E@� ; (27)in which � is the learning rate, which can be furtherexpressed as � = �qP�(@E@� )2 ; (28)where � is the step size, the length of each transition alongthe gradient direction in the parameter space. Usually wecan change the step size to vary the speed of convergence;two heuristic rules for updating the value of � are describedin [29].When an n-node feedforward network is represented inits topological order, we can envision the error measure Epas the output of an additional node with index n+1, whosenode function fn+1 can be de�ned on the outputs of anynodes with smaller indices; see Figure 24 (b). (ThereforeEp may depend directly on any internal nodes.) Applyingthe chain rule again, we have the following concise formulafor calculating the error signal �i = @Ep=@xi:@+Ep@xi = @fn+1@xi + Xi<j�n @+Ep@xj @fj@xi ; (29)or �i = @fn+1@xi + Xi<j�n�j @fj@xi ; (30)where the �rst term shows the direct e�ect of xi on Ep viathe direct path from node i to node n+1 and each productterm in the summation indicates the indirect e�ect of xion Ep. Once we �nd the error signal for each node, thenthe gradient vector for the parameters is derived as before.Another simple and systematic way to calculate the er-ror signals is through the representation of the error-propagation network (or sensitivity model), which isobtained from the original adaptive network by reversingthe links and supplying the error signals at the output layeras inputs. The following example illustrates this idea.Example 7: Adaptive network and its error-propagationmodelFigure 26 (a) is an adaptive network, where each node isindexed by a unique number. Again, we use fi and xito denote the function and output of node i. In orderto calculate the error signals at internal nodes, an error-propagation network is constructed in Figure 26 (b), wherethe output of node i is the error signal of this node in theoriginal adaptive network. In symbols, if we choose thesquared error measure for Ep, then we have the following:�9 = @+Ep@x9 = @Ep@x9 = �2 (d9 � x9);

�8 = @+Ep@x8 = @Ep@x8 = �2 (d8 � x8);(Thus nodes 9 and 8 in the error-propagation network areonly bu�er nodes.)�7 = @+Ep@x7 = @+Ep@x8 @f8@x7 + @+Ep@x9 @f9@x7 = �8 @f8@x7 + �9 @f9@x7 ;�6 = @+Ep@x6 = @+Ep@x8 @f8@x6 + @+Ep@x9 @f9@x6 = �8 @f8@x6 + �9 @f9@x6 :Similar expressions can be written for the error signals ofnode 1, 2, 3, 4 and 5. It is interesting to observe that inthe error-propagation net, if we associate each link con-necting nodes i and j (i < j) with a weight wij = @fj@xi ,then each node performs a linear function and the error-propagation net is actually a linear network. The error-propagation network is helpful in correctly formulating theexpressions for error signals. The same concept applies torecurrent networks with either synchronous or continuousoperations [34]. 2x

x

8

9

1

2

ε

ε

x2

(b)

3

4

5

6

7

8

9

3

4

5

6

7

8

9

1

2

(a)

9

ε

= Ex

p

Ex

= p

8

9

1

1

x1

w

w w w25 57

79

w1339Fig. 26. (a) An adaptive network and (b) its error-propagationmodel.Depending on the applications we are interested in, twotypes of learning paradigms for adaptive networks are avail-able to suit our needs. In o�-line learning (or batchlearning), the update formula for parameter � is basedon equation (26) and the update action takes place onlyafter the whole training data set has been presented, thatis, only after each epoch or sweep. On the other hand, inon-line learning (or pattern learning), the parametersare updated immediately after each input-output pair hasbeen presented, and the update formula is based on equa-tion (24). In practice, it is possible to combine these twolearning modes and update the parameter after k trainingdata entries have been presented, where k is between 1 andP and it is sometimes referred to as the epoch size.C. Back-Propagation Learning Rule for Recurrent Net-worksFor recurrent adaptive networks, the back-propagationlearning rule is still applicable if we can transform the net-work con�gurations to be of the feedforward type. To sim-plify our notation, we shall use the network in Figure 27 for

15our discussion, where x1 and x2 are inputs and x5 and x6are outputs. Because it has directional loops 3-4-5, 3-4-6-5,and 6 (a self loop), this is a typical recurrent network withnode functions denoted as follows:8>><>>: x3 = f3(x1; x5)x4 = f4(x2; x3)x5 = f5(x4; x6)x6 = f6(x4; x6) (31)3

6x

x

x6

x51

2 4

5Fig. 27. A simple recurrent network .In order to correctly derive the back-propagation learn-ing rule for the recurrent net in Figure 27, we have to distin-guish two operating modes through which the network maysatisfy equation (31). These two modes are synchronousoperation and continuous operation.For continuously operated networks, all nodes continu-ously change their outputs until equation (31) is satis�ed.This operating mode is of particular interest for analog cir-cuit implementations, where a certain kind of dynamicalevolution rule is imposed on the network. For instance,the dynamical formula for node 3 can be written as�3 dx3dt + x3 = f3(x1; x5): (32)Similar formulas can be devised for other nodes. It is obvi-ous that when x3(t) stops changing (i.e., dx3dt = 0), equa-tion (32) leads to the correct �xed points satisfying equa-tion (31). However, this kind of recurrent networks do posesome problems in software simulation, as the stable �xedpoint satisfying equation (31) may be hard to �nd. Herewe shall not go into details about continuously operatednetworks. A detailed treatment of continuously operatednetworks which use the Mason gain formula [62] as a learn-ing rule can be found in [34].On the other hand, if a network is operated syn-chronously, all nodes change their outputs simultaneouslyaccording to a global clock signal and there is a time delayassociated with each link. This synchronization is re ectedby adding the time t as an argument to the output of eachnode in equation (31) (assuming there is a unit time delayassociated with each link):8>><>>: x3(t+ 1) = f3(x1(t); x5(t))x4(t+ 1) = f4(x2(t); x3(t))x5(t+ 1) = f5(x4(t); x6(t))x6(t+ 1) = f6(x4(t); x6(t)) (33)C.1 Back-Propagation Through Time (BPTT)When using synchronously operated networks, we usu-ally are interested in identifying a set of parameters that

will make the output of a node (or several nodes) followa given trajectory (or trajectories) in a discrete time do-main. This problem of tracking or trajectory followingis usually solved by using a method called unfolding oftime to transform a recurrent network into a feedforwardone, as long as the time t does not exceed a reasonablemaximum T . This idea was originally introduced by Min-sky and Papert [64] and combined with back-propagationby Rumelhart, Hinton, and Williams [79]. Consider the re-current net in Figure 27, which is redrawn in Figure 28 (a)with the same con�guration except that the input variablesx1 and x2 are omitted for simplicity. The same network ina feedforward architecture is shown in Figure 28 (b) withthe time index t running from 1 to 4. In other words, for arecurrent net that synchronously evaluates each of its nodefunctions from t = 1, 2, ..., T , we can simply duplicateall units T times and arrange the resulting network in alayered feedforward manner.3

4

5

6

3

4

5

3

4

5

6

3

4

5

6

3

4

5

66

t=1

(a)

t=2 t=3 t=4

(b)Fig. 28. (a) A synchronously operated recurrent network and (b) itsfeedforward equivalent obtained via unfolding of time.It is obvious that the two networks in Figure 28 (a) and(b) will behave identically for t = 1 to T , provided thatall copies of each parameter at di�erent time steps remainidentical. For instance, the parameter in node 3 of Fig-ure 28 (a) must be the same at all time instants. This isthe problem of parameter sharing; a quick solution isto move the parameters from node 3 and 6 into the so-called parameter nodes, which are independent of thetime step, as shown in Figure 29. (Without loss of gener-ality, we assume nodes 3 and 6 both have only one param-eter, denoted by a and b, respectively.) After setting upthe parameter nodes in this way, we can apply the back-propagation learning rule as usual to the network in Fig-ure 29 (which is still feedforward in nature) without theslightest concern about the parameter sharing constraint.Note that the error signals of parameter nodes come fromthe error signals of nodes located at layers of di�erent timeinstants; thus the BP for this kind of unfolded network isoften called back-propagation through time (BPTT).C.2 Real Time Recurrent Learning (RTRL)BPTT generally works well for most problems; the onlycomplication is that it requires extensive computing re-sources when the sequence length T is large, because theduplication of nodes makes both memory requirements andsimulation time proportional to T . Therefore for long se-quences or sequences of unknown length, real time recur-

166

3

6

3

6

3

6

3

5

4

5

4 4

5

4

5

b

aFig. 29. An alternative representation of Figure 28 (b) that auto-matically satis�es the parameter-sharing requirement.rent learning (RTRL) [114] is employed instead to per-form on-line learning, that is, to update parameters whilethe network is running rather than at the end of the pre-sented sequences.To explain the rationale behind the RTRL algorithm,we take as an example the simple recurrent network in Fig-ure 30 (a), where there is only one node with one parametera. After moving the parameter out of the unfolded architec-ture, we obtain the feedforward network shown in Figure 30(b). Figure 30 (c) is the corresponding error-propagationnetwork. Here we assume E = PTi Ei = PTi (di � xi)2,where i is the index for time and di and xi are the desiredand the actual node output, respectively, at time instant i.x1 x32x

2x x3

4x

4xx1

(b)

(a)

f

f f f f

(c)

E E3 4E1 E2

aFig. 30. A simple recurrent adaptive network to illustrate RTRL:(a) a recurrent net with single node and single parameter; (b)unfolding-of-time architecture; (c) error-propagation network.To save computation and memory requirements, a sen-sible choice is to minimize Ei at each time step instead oftrying to minimize E at the end of a sequences. To achievethis, we need to calculate @+E=@a recursively at each timestep i. For i = 1, the error-propagation network is as shownin Figure 31 (a) and we have@+x1@a = @x1@a and @+E1@a = @E1@x1 @+x1@a : (34)For i = 2, the error-propagation network is as shown inFigure 31 (b) and we have@+x2@a = @x2@a + @x2@x1 @+x1@a and @+E2@a = @E2@x2 @+x2@a : (35)For i = 3, the error-propagation network is as shown inFigure 31 (c) and we have@+x3@a = @x3@a + @x3@x2 @+x2@a and @+E3@a = @E3@x3 @+x3@a : (36)

Ex1

1

(a)

1

(c)

1

Ex

3

3

2 3

(b)

1 2

Ex2

2

(d)

Ex

i

i

i

i-1Fig. 31. Error-propagation networks at di�erent time steps: (a)i = 1; (b) i = 2; (c) i = 3; (d) a general situation, where thethick arrow represents @+xi�1@a .In general, for the error-propagation at time instant i,we have@+xi@a = @xi@a + @xi@xi�1 @+xi�1@a and @+Ei@a = @Ei@xi @+x3@a ;(37)where @+xi�1@a is already available from the calculation atthe previous time instant. Figure 31 shows this generalsituation, where the thick arrow represents @+xi�1@a , whichis already available at the time instant i � 1.Therefore, by trying to minimize each individual Ei, wecan recursively �nd the gradient @+Ei@a at each time instant;there is no need to wait until the end of the presentedsequence. Since this is an approximation of the originalBPTT, the learning rate � in the update formula�a = ��@+Ei@ashould be kept small and, as a result, the learning processusually takes longer.D. Hybrid Learning Rule: Combining BP and LSEIt is observed that if an adaptive network's output (as-suming only one) or its transformation is linear in some ofthe network's parameters, then we can identify these linearparameters by the well-known linear least-squares method.This observation leads to a hybrid learning rule [24], [29]which combines the gradient method and the least-squaresestimator (LSE) for fast identi�cation of parameters.D.1 O�-Line Learning (Batch Learning)For simplicity, assume that the adaptive network underconsideration has only one outputoutput = F (~I; S); (38)where ~I is the vector of input variables and S is the setof parameters. If there exists a function H such that thecomposite function H �F is linear in some of the elementsof S, then these elements can be identi�ed by the least-squares method. More formally, if the parameter set S canbe decomposed into two setsS = S1 � S2; (39)(where � represents direct sum) such that H � F is linearin the elements of S2, then upon applying H to equation

17(38), we have H(output) = H � F (~I; S); (40)which is linear in the elements of S2. Now given values ofelements of S1, we can plug P training data into equation(40) and obtain a matrix equation:A� = B (41)where � is an unknown vector whose elements are param-eters in S2. This equation represents the standard linearleast-squares problem and the best solution for �, whichminimizes kA��Bk2, is the least-squares estimator (LSE)��: �� = (ATA)�1ATB; (42)where AT is the transpose of A and (ATA)�1AT is thepseudo-inverse of A if ATA is non-singular. Of course, wecan also employ the recursive LSE formula [23], [1], [58].Speci�cally, let the i-th row vector of matrix A de�ned inequation (41) be aTi and the i-th element of B be bTi ; then� can be calculated iteratively as follows:�i+1 = �i + Si+1ai+1(bTi+1 � aTi+1�i)Si+1 = Si � Siai+1aTi+1Si1+aTi+1Siai+1 ; i = 0; 1; � � � ; P � 1 ) ;(43)where the least-squares estimator �� is equal to �P . Theinitial conditions needed to bootstrap equation (43) are�0 = 0 and S0 = I, where is a positive large numberand I is the identity matrix of dimensionM�M . When weare dealing with multi-output adaptive networks (output inequation (38) is a column vector), equation (43) still appliesexcept that bTi is the i-th row of matrix B.Now we can combine the gradient method and the least-squares estimator to update the parameters in an adaptivenetwork. For hybrid learning to be applied in a batch mode,each epoch is composed of a forward pass and a backwardpass. In the forward pass, after an input vector is pre-sented, we calculate the node outputs in the network layerby layer until a corresponding row in the matrices A andB in equation (41) are obtained. This process is repreatedfor all the training data entries to form the complete Aand B; then parameters in S2 are identi�ed by either thepseudo-inverse formula in equation (42) or the recursiveleast-squares formulas in equation (43). After the parame-ters in S2 are identi�ed, we can compute the error measurefor each training data entry. In the backward pass, the er-ror signals (the derivative of the error measure w.r.t. eachnode output, see equations (22) and (23)) propagate fromthe output end toward the input end; the gradient vectoris accumulated for each training data entry. At the end ofthe backward pass for all training data, the parameters inS1 are updated by the gradient method in equation (27).For given �xed values of the parameters in S1, the pa-rameters in S2 thus found are guaranteed to be the globaloptimum point in the S2 parameter space because of thechoice of the squared error measure. Not only can thishybrid learning rule decrease the dimension of the search

space in the gradient method, but, in general, it will alsosubstantially reduce the time needed to reach convergence.It should be kept in mind that by using the least-squaresmethod on the data transformed by H(�), the obtained pa-rameters are optimal in terms of the transformed squarederror measure instead of the original one. In practice, thisusually will not cause a problem as long as H(�) is mono-tonically increasing and the training data are not too noisy.A more detailed treatment of this transformation methodcan be found in [34].D.2 On-Line Learning (Pattern Learning)If the parameters are updated after each data presen-tation, we have a on-line learning or pattern learningscheme. This learning strategy is vital to on-line parameteridenti�cation for systems with changing characteristics. Tomodify the batch learning rule to obtain an on-line version,it is obvious that the gradient descent should be based onEp (see equation (24)) instead of E. Strictly speaking, thisis not a truly gradient search procedure for minimizing E,yet it will approximate one if the learning rate is small.For the recursive least-squares formula to account forthe time-varying characteristics of the incoming data, thee�ects of old data pairs must decay as new data pairs be-come available. Again, this problem is well studied in theadaptive control and system identi�cation literature and anumber of solutions are available [20]. One simple methodis to formulate the squared error measure as a weightedversion that gives higher weighting factors to more recentdata pairs. This amounts to the addition of a forgettingfactor � to the original recursive formula:�i+1 = �i + Si+1ai+1(bTi+1 � aTi+1�i)Si+1 = 1� [Si � Siai+1aTi+1Si�+aTi+1Siai+1 ] ) ; (44)where the typical value of � in practice is between 0:9 and1. The smaller � is, the faster the e�ects of old data decay.A small � sometimes causes numerical instability, however,and thus should be avoided. For a complete discussion andderivation of equation (44), the reader is referred to [34],[58], [20].D.3 Di�erent Ways of Combining GD and LSEThe computational complexity of the least-squares esti-mator (LSE) is usually higher than that of the gradient de-scent (GD) method for one-step adaptation. However, forachieving a prescribed performance level, the LSE is usu-ally much faster. Consequently, depending on the availablecomputing resources and required level of performance, wecan choose from among at least �ve types of hybrid learn-ing rules combining GD and LSE in di�erent degrees, asfollows.1. One pass of LSE only: Nonlinear parameters are �xedwhile linear parameters are identi�ed by one-time ap-plication of LSE.2. GD only: All parameters are updated by GD itera-tively.

18 3. One pass of LSE followed by GD: LSE is employedonly once at the very beginning to obtain the initialvalues of linear parameters and then GD takes over toupdate all parameters iteratively.4. GD and LSE: This is the proposed hybrid learningrule, where each iteration (epoch) of GD used to up-date the nonlinear parameters is followed by LSE toidentify the linear parameters.5. Sequential (approximate) LSE only: The outputs ofan adaptive network are linearized with respect to itsparameters, and then the extended Kalman �lter algo-rithm [21] is employed to update all parameters. Thismethod has been proposed in the neural network lit-erature [85], [84], [83].The choice of one of the above methods should be basedon a trade-o� between computational complexity and per-formance. Moreover, the whole concept of �tting data toparameterized models is called regression in statistics lit-erature, and there are a number of other techniques foreither linear or nonlinear regression, such as the Guass-Newton method (linearization method) and the Marquardtprocedure [61]. These methods can be found in advancedtextbooks on regression and they are also viable techniquesfor �nding optimal parameters in adaptive networks.E. Neural Networks as Special Cases of Adaptive NetworksSome special cases of adaptive networks have been ex-plored extensively in the neural network literature. In par-ticular, we will introduce two types of neural networks: theback-propagation neural network (BPNN) and the radialbasis function network (RBFN). Other types of adaptivenetworks that can be interpreted as a set of fuzzy if-thenrules are investigated in the next section.E.1 Back Propagation Neural Networks (BPNN's)A back-propagation neural network (BPNN), as alreadymentioned in examples 4 and 5, is an adaptive networkwhose nodes (called neurons) perform the same functionon incoming signals; this node function is usually a com-posite function of the weighted sum and a nonlinear func-tion called the activation function or transfer func-tion. Usually the activation functions are of either a sig-moidal or a hyper-tangent type which approximates thestep function (or hard limiter) and yet provides di�er-entiability with respect to input signals. Figure 32 depictsthe four di�erent types of activation functions f(x) de�nedbelow.Step function: f(x) = � 1 if x � 0.0 if x < 0.Sigmoid function: f(x) = 11 + e�x :Hyper-tangent function: f(x) = tanh(x=2) = 1� e�x1 + e�x :Identity function: f(x) = x:When the step function (hard-limiter) is used as the ac-tivation function for a layered network, the network is of-ten called a perceptron [78], [70], as explained in exam-ple 4. For a neural network to approximate a continuous-

-10 0 10-2

-1

0

1

2(a) step function

-10 0 10-2

-1

0

1

2(b) sigmoid function

-10 0 10-2

-1

0

1

2(c) hyper-tangent function

-10 0 10-10

-5

0

5

10(d) identity functionFig. 32. Activation functions for BPNN's: (a) step function; (b)sigmoid function; (c) hyper-tangent function; (d) identity func-tion.

x

x3

2x x4

4

1x

node 4

w

w

w

14

24

34Fig. 33. A BPNN node.valued function not necessarily limited to the interval [0; 1]or [1;�1], we usually let the node function for the outputlayer be a weighted sum with no limiting-type activationfunctions. This is equivalent to the situation where the ac-tivation function is an identity function, and output nodesof this type are often called linear nodes.For simplicity, we assume the BPNN in question uses thesigmoidal function as its activation function. The net input�x of a node is de�ned as the weighted sum of the incomingsignals plus a threshold. For instance, the net input andoutput of node j in Figure 33 (where j = 4) are�xj =Piwijxi + tj ;xj = f(�xj ) = 11 + e��xj ; (45)where xi is the output of node i located in the previouslayer, wij is the weight associated with the link connectingnodes i and j, and tj is the threshold of node j. Sincethe weights wij are actually internal parameters associatedwith each node j, changing the weights of a node will alterthe behavior of the node and in turn alter the behaviorof the whole BPNN. Figure 23 shows a two-layer BPNNwith 3 inputs in the input layer, 3 neurons in the hiddenlayer, and 2 output neurons in the output layer. For sim-plicity, this BPNN will be referred to as a 3-3-2 structure,corresponding to the number of nodes in each layer. (Notethat the input layer is composed of three bu�er nodes fordistributing the input signals; therefore this layer is con-ventionally not counted as a physical layer of the BPNN.)BPNN's are by far the most commonly used NN struc-

19w3

w4

5w

w1

w2

x2

1x

f2f

3f

4f

5f

1fFig. 34. A radial basis function network (RBFN) .ture for applications in a wide range of areas, such asspeech recognition, optical character recognition (OCR),signal processing, data compression, and automatic con-trol.E.2 Radial Basis Function Networks (RBFN's)The locally-tuned and overlapping receptive �eld is awell-known structure that has been studied in the regionsof the cerebral cortex, the visual cortex, and so forth.Drawing on the knowledge of biological receptive �elds,Moody and Darken [66], [67] proposed a network struc-ture that employs local receptive �elds to perform func-tion mappings. Similar schemes have been proposed byPowell [74], Broomhead and Lowe [7], and many others inthe areas of interpolation and approximation theory; theseschemes are collectively called radial basis function approx-imations. Here we shall call this network structure the ra-dial basis function network orRBFN. Figure 34 showsa schematic diagram of an RBFN with �ve receptive �eldunits; the activation level of the i-th receptive �eld unit (orhidden unit) iswi = Ri(~x) = Ri(k~x� ~cik=�i); i = 1; 2; :::;H (46)where ~x is a multi-dimensional input vector, ~ci is a vectorwith the same dimension as ~x, H is the number of radialbasis functions (or equivalently, receptive �eld units), andRi(�) is the i-th radial basis function with a single maxi-mum at the origin. Typically,Ri(�) is chosen as a Gaussianfunction Ri(~x) = exp[�k~x� ~cik2�2i ] (47)or as a logistic functionRi(~x) = 11 + exp[k~x� ~cik2=�2i ] (48)Thus the activation level of the radial basis function wicomputed by the i-th hidden unit is maximum when theinput vector ~x is at the center ~ci of that unit.The output of a radial basis function network can becomputed in two ways. In the simpler method, as shownin Figure 34, the �nal output is the weighted sum of theoutput value associated with each receptive �eld:f(~x) = HXi=1 fiwi = HXi=1 fiRi(~x); (49)

where fi is the output value associated with the i-th re-ceptive �eld. A more complicated method for calculatingthe overall output is to take the weighted average of theoutput associated with each receptive �eld:f(~x) = PHi=1 fiwiPHi=1wi = PHi=1 fiRi(~x)PHi=1Ri(~x) : (50)This mode of calculation, though has a higher degree ofcomputational complexity, possesses the advantage thatpoints in the overlapping area of two receptive �elds willhave a well interpolated output value between the outputvalues of the two receptive �elds. For representation pur-poses, if we change the radial basis function Ri(~x) in eachnode of layer 2 in Figure 34 by its normalized counter-part Ri(~x)=PiRi(~x), then the overall output is speci�edby equation (50).Several learning algorithms have been proposed to iden-tify the parameters (~ci, �i and fi) of an RBFN. Note thatthe RBFN is an ideal example of the hybrid learning de-scribed in the previous section, where the linear param-eters are fi and the nonlinear parameters are ci and �i.In practice, the ~ci are usually found by means of vectorquantization or clustering techniques (which assume simi-lar input vectors produce similar outputs) and the �i areobtained heuristically (such as by taking the average dis-tance to the �rst several nearest neighbors of ~ci's). Oncethese nonlinear parameters are �xed, the linear parameterscan be found by either the least-squares method or the gra-dient method. Chen et al. [8] used an alternative methodthat employs the orthogonal least-squares algorithm to de-termine the ci's and fi's while keeping the �i's at a pre-determined constant.An extension of Moody-Darken's RBFN is to assign alinear function as the output function of each receptive�eld; that is, fi is a linear function of the input variablesinstead of a constant:fi = ~ai � ~x+ bi; (51)where ~ai is a parameter vector and bi is a scalar parameter.Stokbro et al. [89] used this structure to model the Mackey-Glass chaotic time series [59] and found that this extendedversion performed better than the original RBFN with thesame number of �tting parameters.It was pointed out by the authors that under certainconstraints, the RBFN is functionally equivalent to the thezero-order Sugeno fuzzy model. See [32] or [34] for details.IV. ANFIS: Adaptive Neuro-Fuzzy InferenceSystemsA class of adaptive networks that act as a fundamentalframework for adaptive fuzzy inference systems is intro-duced in this section. This type of networks is referredto as ANFIS [25], [24], [29], which stands for Adaptive-Network-based Fuzzy Inference System, or seman-tically equivalently, Adaptive Neuro-Fuzzy InferenceSystem. We will describe primarily the ANFIS architec-ture and its learning algorithm for the Sugeno fuzzy model,

20y

y

2

1

2

1

2

1

2

1

layer 5

layer 4

layer 3layer 2

layer 1

yx

yxB

B

A

A

f2w2

1w 1f

w2

1w

w2

1wx

f

X

X

Y

Y

A B

A B

x

1w

w2

1f

f2

= p x +q y +r

= p x +q y +r =

w2w1

2w 2ff1w1 +

2w 2ff1w1f = ++

(a)

(b)

2

1

2

1

2

1

Fig. 35. (a) A two-input �rst-order Sugeno fuzzy model with tworules; (b) equivalent ANFIS architecture.with an application example of chaotic time series predic-tion.Note that similar network structures were also pro-posed independently by Lin and Lee [55] and Wang andMendel [106]).A. ANFIS ArchitectureFor simplicity, we assume the fuzzy inference system un-der consideration has two inputs x and y and one outputz. For a �rst-order Sugeno fuzzy model [98], [91], a typicalrule set with two fuzzy if-then rules can be expressed asRule 1: If x is A1 and y is B1, then f1 = p1x+ q1y + r1;Rule 2: If x is A2 and y is B2, then f2 = p2x+ q2y + r2:Figure 35 (a) illustrates the reasoning mechanism for thisSugeno model. The corresponding equivalent ANFIS ar-chitecture is as shown in Figure 35(b), where nodes ofthe same layer have similar functions, as described below.(Here we denote the output node i in layer l as Ol;i.)Layer 1: Every node i in this layer is an adaptive nodewith a node output de�ned byO1;i = �Ai (x); for i = 1, 2, orO1;i = �Bi�2 (y); for i = 3, 4, (52)where x (or y) is the input to the node and Ai (orBi�2) is a fuzzy set associated with this node. In otherwords, outputs of this layer are the membership valuesof the premise part. Here the membership functionsfor Ai and Bi can be any appropriate parameterizedmembership functions introduced in Section II. Forexample, Ai can be characterized by the generalizedbell function: �A(x) = 11 + [(x�ciai )2]bi ; (53)where fai, bi, cig is the parameter set. Parameters inthis layer are referred to as premise parameters.

Layer 2: Every node in this layer is a �xed node labeled�, which multiplies the incoming signals and outputsthe product. For instance,O2;i = wi = �Ai(x)� �Bi(y); i = 1; 2: (54)Each node output represents the �ring strength of arule. (In fact, any other T-norm operators that per-form fuzzy AND can be used as the node function inthis layer.)Layer 3: Every node in this layer is a �xed node labeledN. The i-th node calculates the ratio of the i-th rule's�ring strength to the sum of all rules' �ring strengths:O3;i = wi = wiw1 +w2 ; i = 1; 2: (55)For convenience, outputs of this layer will be callednormalized �ring strengths.Layer 4: Every node i in this layer is an adaptive nodewith a node functionO4;i = wifi = wi(pix+ qiy + ri); (56)where wi is the output of layer 3 and fpi, qi, rig isthe parameter set. Parameters in this layer will bereferred to as consequent parameters.Layer 5: The single node in this layer is a �xed nodelabeled �, which computes the overall output as thesummation of all incoming signals:O5;1 = overall output =Xi wifi = PiwifiPiwi (57)Thus we have constructed an adaptive network that hasexactly the same function as a Sugeno fuzzy model. Notethat the structure of this adaptive network is not unique;we can easily combine layers 3 and 4 to obtain an equivalentnetwork with only four layers. Similarly, we can performweight normalization at the last layer; Figure 36 illustratesan ANFIS of this type.w1

w2

w f11

w fi iΣ

wiΣ

x

y

A

A 1

2

B 2

B 2

Π

Π

Σ

Σ

w f22 fFig. 36. Another ANFIS architecture for the two-input two-ruleSugeno fuzzy model.Figure 37 (a) is an ANFIS architecture that is equivalentto a two-input �rst-order Sugeno fuzzy model with ninerules, where each input is assumed to have three associatedMF's. Figure 37 (b) illustrates how the 2-D input spaceis partitioned into nine overlapping fuzzy regions, each ofwhich is governed by fuzzy if-then rules. In other words,the premise part of a rule de�nes a fuzzy region, while theconsequent part speci�es the output within this region.For ANFIS architectures for the Mamdani andTsukamoto fuzzy models, the reader is referred to [29] and[34] for more details.

213

2

2

1

1

3

3

2

1

321

(b)(a)

963

852

741

987654321

11

Y

X

Y

X

B

B

B

AAA

premise parametersconsequent parameters

x

y

f

A

A

A

B

B

B

x

yFig. 37. (a) ANFIS architecture for a two-input �rst-order Sugenofuzzy model with nine rules; (b) partition of the input space intonine fuzzy regions.B. Hybrid Learning AlgorithmFrom the ANFIS architecture shown in Figure 35 (b),we observe that when the values of the premise parametersare �xed, the overall output can be expressed as a linearcombination of the consequent parameters. In symbols, theoutput f in Figure 35 (b) can be rewritten asf = w1w1+w2 f1 + w2w1+w2 f2= w1f1 +w2f2= (w1x)p1 + (w1y)q1 + (w1)r1 + (w2x)p2 + (w2y)q2 + (w2)r2;(58)which is linear in the consequent parameters p1, q1, r1,p2, q2, and r2. Therefore the hybrid learning algorithmdeveloped in the previous section can be applied directly.More speci�cally, in the forward pass of the hybrid learningalgorithm, node outputs go forward until layer 4 and theconsequent parameters are identi�ed by the least-squaresmethod. In the backward pass, the error signals propagatebackward and the premise parameters are updated by gra-dient descent. Table I summarizes the activities in eachpass. TABLE ITwo passes in the hybrid learning procedure for ANFIS.Forward Pass Backward PassPremise Fixed GradientParameters DescentConsequent Least-Squares FixedParameters EstimateSignals Node Outputs Error SignalsAs mentioned earlier, the consequent parameters thusidenti�ed are optimal under the condition that the premiseparameters are �xed. Accordingly, the hybrid approachconverges much faster since it reduces the dimension of thesearch space of the original back-propagation method.If we �x the membership functions and adapt onlythe consequent part, then ANFIS can be viewed as afunctional-link network [46], [71] where the \enhanced rep-resentations" of the input variables are obtained via themembership functions. These \enhanced representations",which take advantage of human knowledge, apparently ex-

press more insight than the functional expansion and thetensor (outer product) models [71]. By �ne-tuning themembership functions, we actually make this \enhancedrepresentation" also adaptive.From equations (49), (50), and equation (57), it is nottoo hard to see the resemblance between the radial basisfunction network (RBFN) and the ANFIS for the Sugenomodel. Actually these two computing framework are func-tionally equivalent under certain minor conditions [32]; thiscross-fertilize both disciplines in many respects.C. Application to Chaotic Time Series PredictionANFIS can be applied to a wide range of areas, such asnonlinear function modeling [24], [29], time series predic-tion [33], [29], on-line parameter identi�cation for controlsystems [29], and fuzzy controller design [26], [28]. In par-ticular, GE has been using ANFIS for modeling correctionfactors in steel rolling mills [6]. Here we will brie y re-port the application of ANFIS to chaotic time series pre-diction [33], [29].The time series used in our simulation is generated bythe Mackey-Glass di�erential delay equation [59]:_x(t) = 0:2x(t� � )1 + x10(t� � ) � 0:1x(t): (59)The prediction of future values of this time series is a bench-mark problem that has been used and reported by a num-ber of connectionist researchers, such as Lapedes and Far-ber [48], Moody [67], [65], Jones et al. [35], Crower [77], andSanger [81]. The simulation results presented here were re-ported in [33], [29]; more details can be found therein.The goal of the task is to use past values of the time seriesup to the point x = t to predict the value at some pointin the future x = t + P . The standard method for thistype of prediction is to create a mapping from D pointsof the time series spaced 4 apart, that is, (x(t � (D �1)4), ..., x(t�4), x(t)), to a predicted future value x(t+P ). To allow comparison with earlier work (Lapedes andFarber [48], Moody [67], [65], Crower [77]), the values D =4 and 4 = P = 6 were used. All other simulation settingswere arranged to be as similar as possible to those reportedin [77].From the Mackey-Glass time series x(t), we extracted1000 input-output data pairs of the following format:[x(t� 18); x(t� 12); x(t� 6); x(t);x(t+ 6)]; (60)where t = 118 to 1117. The �rst 500 pairs (training dataset) were used for training ANFIS, while the remaining 500pairs (checking data set) were used for validating the modelidenti�ed. The number of membership functions assignedto each input of the ANFIS was set to two, so the numberof rules is 16. The ANFIS used here contains a total of 104�tting parameters, of which 24 are premise parameters and80 are consequent parametersFigure 38 shows the results after about 500 epochs oflearning. The desired and predicted values for both train-ing data and checking data are essentially the same in Fig-

220.4

0.6

0.8

1

1.2

1.4

200 400 600 800 1000

time

(a) Mackey-Glass Time Series

-0.01

-0.005

0

0.005

0.01

200 400 600 800 1000

time

(b) Prediction ErrorsFig. 38. (a) Mackey-Glass time series from t = 124 to 1123 and six-step ahead prediction (which is indistinguishable from the timeseries here); (b) prediction error. (Note that the �rst 500 datapoints are training data, while the remaining are for validation.)ure 38 (a); the di�erences between them can only be seenon a much �ner scale, such as that in Figure 38 (b).TABLE IIGeneralization result comparisons for P = 6.Methods Training Data NDEIANFIS 500 0.007AR Model 500 0.19Cascade-Correlation NN 500 0.06Back-Prop NN 500 0.026th-order Polynomial 500 0.04Linear Predictive Method 2000 0.55Table II lists the generalization capabilities of othermethods, which were measured by using each method topredict 500 points immediately following the training set.The last four row of Table II are from [77] directly. Thenon-dimensional error index (NDEI) [48], [77] is de�nedas the root mean square error divided by the standard de-viation of the target series. The remarkable generalizationcapability of ANFIS is attributed to the following facts:� ANFIS can achieve a highly nonlinear mapping, there-fore it is well-suited for predicting nonlinear time se-ries.� The ANFIS used here has 104 adjustable parameters,far fewer than those used in the cascade-correlationNN (693, the median) and back-prop NN (about 540)listed in Table II.� Though not based on a priori knowledge, the initialparameter settings of ANFIS are intuitively reason-able and results in fast convergence to good parametervalues that captures the underlying dynamics.� ANFIS consists of fuzzy rules which are actually localmappings (which are called local experts in [36]) in-stead of global ones. These local mappings facilitatethe minimal disturbance principle [111], whichstates that the adaptation should not only reduce

the output error for the current training pattern butalso minimize disturbance to response already learned.This is particularly important in on-line learning. Wealso found the use of least-squares method to deter-mine the output of each local mapping is of particularimportance. Without using LSE, the learning timewould be ten times longer.Other generalization tests and comparisons with neuralnetwork approaches can be found in [29].The original ANFIS C codes and several exam-ples (including this one) can be retrieved via anony-mous ftp in user/ai/areas/fuzzy/systems/anfis atftp.cs.cmu.edu (CMU Arti�cial Intelligence Reposi-tory). V. Neuro-Fuzzy ControlOnce a fuzzy controller is transformed into an adaptivenetwork, the resulting ANFIS can take advantage of allthe NN controller design techniques proposed in the liter-ature. In this section we shall introduce common designtechniques for ANFIS controllers. Most of these method-ologies are derived directly from their counterparts for NNcontrollers. However, certain design techniques apply ex-clusively to ANFIS, which will be pointed out explicitly.As shown in Figure 39, the block diagram of a typicalfeedback control system consists of a plant block and a con-troller block. The plant block is usually represented by aset of di�erential equations that describe the physical sys-tem to be controlled. These equations govern the behaviorof the plant state x(t), which is assumed to be accessible inour discussion. In contrast, the controller block is usuallya static function denoted by g; it maps the the plant statex(t) into a control action u(t) that can hopefully achieve agiven control objective. Thus for a general time-invariantcontrol system, we have the following equations:_x(t) = f (x(t);u(t)) (plant dynamics);u(t) = g(x(t)) (controller):The control objective here is to design a controller functiong(�) such that the plant state x(t) can follow a desiredtrajectory xd(t) as closely as possible.dynamics

plantcontroller

u(t)x(t)Fig. 39. Block diagram for a continuous time feedback control sys-tem.A simple example of a feedback control system is theinverted pendulum system (Figure 40) where a rigid poleis hinged to a cart through a free joint with only one degreeof freedom, and the cart moves on the rail tracks to its rightor left depending on the force exerted on it. The controlgoal is to �nd the applied force u as a function of the statevariable x = [�; _�; z; _z] (where � is the pole angle and z isthe cart position) such that the pole can be balanced froma given non-zero initial condition.

23u

θFig. 40. The inverted pendulum system.For a feedback control system in a discrete time domain,a general block diagram representation is as shown in Fig-ure 41. Note that the inputs to the plant block include thecontrol action u(k) and the previous plant output x(k), sothe plant block now represents a static mapping. In sym-bols, we havex(k + 1) = f (x(k);u(k)) (plant);u(k) = g(x(k)) (controller):controller

plant

u(k)x(k+1)x(k)Fig. 41. Block diagram for a discrete-time feedback control system.A central problem in control engineering is that of �nd-ing the control action u as a function of the plant outputx in order to achieve a given control goal. Each designmethod for neuro-fuzzy controllers corresponds to a way ofobtaining the control action; these methods are discussednext.A. Mimicking Another Working ControllerMost of the time, the controller being mimicked is an ex-perienced human operator who can control the plant satis-factorily. In fact, the whole concept of mimicking a humanexpert is the original intention of fuzzy controllers whoseultimate goal is to replace human operators who can con-trol complex systems such as chemical reaction processes,subway trains, and tra�c systems. An experienced hu-man operator usually can summarize his or her control ac-tions as a set of fuzzy if-then rules with roughly correctmembership functions; this corresponds to the linguisticinformation. Prior to the emergence of neuro-fuzzy ap-proaches, re�ning membership function is usually obtainedvia a lengthy trial-and-error process. Now with learning al-gorithms, we can further take advantage of the numericalinformation (input/output data pairs) and re�ne the mem-bership functions in a systematic way. Note that the ca-pability to utilize linguistic information is speci�c to fuzzyinference systems; it is not always available in neural net-works. Successful applications of fuzzy controller based onlinguistic information plus trial-and-error tuning includessteam engine and boiler control [60], Sendai subway sys-tems [117], container ship crane control [116], elevator con-trol [54], nuclear reaction control [5], automobile transmis-sion control [40], aircraft control [14], and many others [90].

With the availability of learning algorithms, a wider rangeof applications is expected.Note that this approach is not only for control appli-cations. If the target system to be emulated is a humanphysician or a credit analyst, then the resulting fuzzy in-ference systems become a fuzzy expert system for diagnosisand credit analysis, respectively.B. Inverse ControlAnother scheme for obtaining desired control action isthe inverse control method shown in Figure 42. For sim-plicity, we assume that the plant has only one state x(k)and one input u(k). In the learning phase, a training setis obtained by generating inputs u(k) at random, and ob-serving the corresponding outputs x(k) produced by theplant. The ANFIS in Figure 42 (a) is then used to learnthe inverse model of the plant by �tting the data pairs(x(k); x(k+1);u(k)). In the application phase, the ANFISidenti�er is copied to the ANFIS controller in Figure 42for generating the desired output. The input to the AN-FIS controller is (x(k); xd(k)); if the inverse model (ANFISidenti�er) that maps (x(k); x(k + 1)) to u(k) is accurate,then the generated u(k) should result in x(k + 1) that isclose to xd(k). That is, the whole system in Figure 42 willbehave like a pure unit-delay system.This method seems straightforward and only one learn-ing task is needed to �nd the inverse model of the plant.However, it assumes existence of the inverse of a plant,which is not valid in general. Moreover, minimization ofthe network error jjeu(k)jj2 does not guarantee minimiza-tion of the overall system error jjxd(k) � x(k)jj2.Using ANFIS for adaptive inverse control can be foundin [42].ANFIS

identifier

eu

x (k)d controllerANFIS

plantu(k)x(k)

x(k+1)

+-

x(k)plant

x(k+1)

(a)

(b)

u(k)Fig. 42. Block diagram for inverse control method: (a) learningphase; (b) application phase.C. Specialized LearningThe major problem with the inverse control scheme isthat we are minimizing the network error instead of theoverall system error. An alternative is to minimize the sys-tem error directly; this is called specialized learning [76].In order to back-propagate error signals through the plantblock in Figure 43, we need to �nd a model representing

24the behavior of the plant. In fact, in order to apply back-propagation learning, all we need to know is the Jacobianmatrix of the plant, where the element at row i and col-umn j is equal to the derivative of the plant's i-th outputwith respect to its j-th input.If the Jacobian matrix is not easy to �nd, an alternativeis to estimate it on-line from the changes of the plant'sinputs and outputs during two consecutive time instants.Other similar methods that aim at using an approximateJacobian matrix to achieve the same learning e�ects canbe found in [41], [11], [103]. Applying specialized learningto �nd an ANFIS controller for the inverted pendulum wasreported in [27].desired model

+-

ex

x (k+1)d

+-

ex

x (k+1)d

controllerANFIS

plant

x(k)x(k+1)

u(k)

controllerANFIS

plant

x(k)x(k+1)

u(k)

(a)

(b)Fig. 43. Block diagram for (a) specialized learning; (b) specializedlearning with model reference.It is not always convenient to specify the desired plantoutput xd(k) at every time instant k. As a standard ap-proach in model reference adaptive control, the desired be-havior of the overall system can be implicitly speci�ed bya (usually linear) model that is able to achieve the controlgoal satisfactorily. This alternative approach is shown inFigure 43 (b), where the desired output xd(k+1) is gener-ated through a desired model.D. Back-Propagation Through Time and Real Time Recur-rent LearningIf we replace the controller and the plant block in Fig-ure 39 with two adaptive networks, the feedback controlsystem becomes a recurrent adaptive network discussedin Section III. Assuming the synchronous operation isadopted here (which virtually convert the system into thediscrete time domain), we can apply the same scheme ofunfolding of time to obtain a feedforward network, andthen use the same back-propagation learning algorithm toidentify the optimal parameters.In terms of the inverted pendulum system (pole only),Figure 41 becomes Figure 44 if the controller block is re-placed with a four-rule ANFIS and the plant block is re-placed with a two-node adaptive network. To obtain thestate trajectory, we cascade the network in Figure 44 toobtain the trajectory network shown in Figure 45. Inparticular, the inputs to the trajectory network are initial

controller block plant block

x (t)1

x (t)2

x (t+h)1

x (t+h)2

uFig. 44. Network implementation of Figure 41 .trajectorydesired

trajectoryactual

conditionsinitial

measureerror

FCplant

FCplant

FCplant

x

x

x

x

0

1

2

m-1

parameterset

k=0 k=1 k=m-1Fig. 45. A trajectory network for control application (FC stands for\fuzzy controller").conditions of the plant; the outputs are the state trajectoryfrom k = 1 to k = m. The adjustable parameters are allpertaining to the FC (fuzzy controller) block implementedas an four-rule ANFIS. Though there are m FC blocks,all of them refer to the same parameter set. For clarity,this parameter set is shown explicitly in Figure 45 and itis updated according to the output of the error measureblock.Each entry of the training data is of the following format:(initial conditions; desired trajectory),and the corresponding error measure to be minimized isE = mXk=1kx(k) � xd(k)k2;where xd(k) is a desired state vector at t = k � T (T is thesampling period). If we take control e�orts into considera-tion, a revised error measure would beE = mXk=1kx(k)� xd(k)k2 + �m�1Xk=0 ku(k)k2;where u(k) is the control action at time step k. By a properselection of �, a compromise between trajectory error andcontrol e�orts can be obtained.Use of back-propagation through time to train a neuralnetwork for backing up a tractor-trailer system is reportedin [69]. The same technique was used to design an ANFIScontroller for balancing an inverted pendulum [28]. Notethat back-propagation through time is usually an o�-linelearning algorithms in the sense that the parameters willnot be updated till the sequence (k = 1 to m) is over. If thesequence is too long or if we want to update the parametersin the middle of the sequence, we can always apply RTRL(real time recurrent learning) introduced earlier.

25E. Feedback Linearization and Sliding ControlThe equations of motion of a class of dynamic systems incontinuous time domain can be expressed in the canonicalform:x(n)(t) = f(x(t); _x(t); � � �x(n�1)(t)) + bu(t); (61)where f is an unknown continuous function, b is the controlgain, and u 2 R and y 2 R are the input and output ofthe system, respectively. The control objective is to forcethe state vector x = [x; _x; : : : ; x(n�1)]T to follow a speci�eddesired trajectory xd = [xd; _xd; : : : ; x(n�1)d ]T . If we de�nethe tracking error vector as e = x � xd, then the controlobjective is to design a control law u(t) which ensures e! 0as t!1. (For simplicity, we assume b = 1 in the followingdiscussion.)Equation (61) is a typical feedback linearizable sys-tem since it can be reduced to a linear system if f is knownexactly. Speci�cally, the following control lawu(t) = �f(x(t)) + x(n)d + kTe (62)would transform the original nonlinear dynamics into a lin-ear one: e(n)(t) + k1e(n�1) + � � �+ kne = 0; (63)where k=[kn; : : : ; k1]T is an appropriately chosen vectorthat ensures satisfactory behavior of the close-loop linearsystem in equation (63).Since f is unknown, an intuitive candidate of u would beu = �F (x, p) + x(n)d + kTe+ v; (64)where v is an additional control input to be determinedlater, F is an parameterized function (such as ANFIS, neu-ral networks, or any other types of adaptive networks) thatis rich enough to approximate f . Using this control law,the close-loop system becomese(n) + k1e(n�1) + � � �+ kne = (f � F ) + v: (65)Now the problem is divided into two tasks:� How to update the parameter vector p incrementallyso that F (x, p) � f(x) for all x.� How to apply v to guarantee global stability while Fis approximating f during the whole process.The �rst task is not too di�cult as long as F , which couldbe a neural network or a fuzzy inference system, is equippedwith enough parameters to approximate f . For the secondtask, we need to apply the concept of a branch of nonlin-ear control theory called sliding control [102], [86]. Thestandard approach is to de�ne an error metrics ass(t) = ( ddt + �)n�1e(t); with � > 0: (66)The equation s(t) = 0 de�nes a time varying hyper-plane in Rn on which the tracking error vector e(t)

= [e(t); _e(t); : : : ; en�1(t)]T decays exponentially to zero,so that perfect tracking can be obtained asymptotically.Moreover, if we can maintain the following condition:djs(t)jdt � ��; (67)then js(t)j will approach the hyperplane js(t)j = 0 in a �-nite time less than or equal to js(0)j=�. In other words,by maintain the condition in equation (67), s(t) will ap-proaches the sliding surface s(t) = 0 in a �nite time, andthen the error vector e(t) will converge to the origin expo-nentially with a time constant (n� 1)=�.From equation (66), s can be rearranged as follows:s = (� + ddt )n�1e = [�n�1; (n� 1)�n�2; : : : ; 1]e: (68)Di�erentiate the above equation and plug in e(n) from equa-tion (65), we obtaindsdt = e(n) + [0; �n�1; (n� 1)�n�2; � � � ; �]e= f � F + v � [kn; kn�1; � � � ; k1]e+[0; �n�1; (n� 1)�n�2; � � � ; �]e (69)By setting [kn; kn�1; � � � ; k1] = [0; �n�1; (n�1)�n�2; � � � ; �],we have dsdt = f � F + v;and djsjdt = dsdt sgn(s)= (f � F + v)sgn(s):That is, equation (67) is satis�ed if and only if(f � F + v)sgn(s) � ��:If we assume the approximation error jf � F j is boundedby a positive number A, then the above equation is alwayssatis�ed if v = �(A + �)sgn(s):To sum up, if we choose the control law asu(t) = �F (x;p)+x(n)d +[0; �n�1; (n�1)�n�2 : : : �]e�(A+�)sgn(s);where F (x;p) is an adaptive network that approximatesf(x) and A is the error bound, then the close-loop systemcan achieve perfect tracking asymptoticallywith global sta-bility.This approach uses a number of nonlinear control de-sign techniques and possesses rigorous proofs for globalstability. However, its applicability is restricted to feed-back linearizable systems. The reader is referred to [86] fora more detailed treatment of this subject. Applications ofthis technique to neural network and fuzzy control can befound in [82] and [104], respectively.

26F. Gain SchedulingUnder certain arrangements, the �rst-order Sugeno fuzzymodel becomes a gain scheduler that switches betweenseveral sets of feedback gains. For instance, a �rst-orderSugeno fuzzy controller for an hypothetical inverted pen-dulum system with varying pole length may have the fol-lowing fuzzy if-then rules:8<: If pole is short, then f1 = k11� + k12 _� + k13z + k14 _z,If pole is medium, then f2 = k21� + k22 _� + k23z + k24 _z,If pole is long, then f3 = k31� + k32 _� + k33z + k34 _z.(70)This is in fact a gain scheduling controller, where thescheduling variable is the pole length and the control ac-tion is switching smoothly between three sets of feedbackgains depending on the value of the scheduling variable. Ingeneral, the scheduling variables only appear in the premisepart while the state variables only appear in the consequentpart. The design method here is standard in gain schedul-ing: �nd several nominal points in the space formed byscheduling variables and employ any of the linear controldesign techniques to �nd appropriate feedback gains. Ifthe number of nominal points is small, we can constructthe fuzzy rules directly. On the other hand, if the numberof nominal points is large, we can always use ANFIS to �tdesired control actions to a fuzzy controller.Examples of applying this method to both one-pole andtwo-pole inverted pendulum systems with varying polelengths can be found in the demo programs in [31].G. OthersOther design techniques that do not use the learningalgorithm in neuro-fuzzy modeling are summarized here.For complex control problems with perfect plant mod-els, we can always use gradient-free optimization schemes,such as genetic algorithms [22], [19], simulated anneal-ing [44], [45], downhill Simplex method [68], and randommethod [63], [88]. In particular, use of genetic algorithmsfor neural network controllers can be found in [113]; forfuzzy logic controllers, see [39], [52], [38].If the plant model is not available, we can apply rein-forcement learning [2] to �nd a working controller directly.The close relationship between reinforcement learning anddynamic programming was addressed in [3], [110]. Othervariants of reinforcement learning includes temporal dif-ference methods (TD(�) algorithms) and Q-learning [108].Representative applications of reinforcement learning tofuzzy control can be found in [4], [51], [12], [56].Some other design and analysis approaches for fuzzycontrollers include cell-to-cell mapping techniques [13],[87], model-based design method [99], self-organizing con-trollers [75], [100], and so on. As more and more peopleare working in this �eld, new design methods are comingout sooner than before.

VI. Concluding RemarksA. A. Current Problems and Possible SolutionsA typical modeling problem includes structure deter-mination and parameter identi�cation. We addressthe parameter identi�cation problem for ANFIS in thispaper, which is solved via the back-propagation gradientdescent and the least-squares method. The structure de-termination problem, which deals with the partition style,the number of MF's for each input, and the number offuzzy if-then rules, and so on, is now an active researchtopic in the �eld. Work along this direction includes Jang'sfuzzy CART approach [30], Lin's reinforcement learningmethod [57], Sun's fuzzy k-d trees [93], Sugeno's iterativemethod [92] and various clustering algorithms proposed byChiu [15], Khedkar [43] and Wang [105]. Moreover, ad-vances on the constructive and destructive learning of neu-ral networks [18], [53] can also shed some lights on thisproblem.Though we can speed up the parameter identi�cationproblem by introducing the least-squares estimator intothe learning cycle, gradient descent still slows down thetraining process and the training time could be pro-hibitively long for a complicated task. Therefore the needto search for better learning algorithms hold equally truefor both neural networks and fuzzy models. Variants ofgradient descent proposed in the neural network litera-ture; including second-order back-propagation [72], quick-propagation [17], and so on, can be used to speed up train-ing. A number of techniques used in nonlinear regres-sion can also contribute in this regard, such as the Guass-Newton method (linearization method) and the Marquardtprocedure [61]. Another important resource is the rich lit-erature of optimization, which o�ers many better gradient-based optimization routines, such as quadratic program-ming and conjugate gradient descent.B. Future DirectionsDue to the extreme exibility of adaptive networks, AN-FIS can have a number of variants that are di�erent fromwhat we have proposed here. For instance, we can replacethe � nodes in layer 2 of ANFIS with the parameterizedT-norm operator [16] and let the learning algorithm decidethe best T-norm function for a speci�c application. By em-ploying the adaptive network as a common framework, wehave also proposed other adaptive fuzzy models tailored fordi�erent purposes, such as the neuro-fuzzy classi�er [94],[95] for data classi�cation and the fuzzy �lter scheme [96],[97] for feature extraction. There are a number of possibleextensions and applications and they are currently underinvestigation.During the past years, we have witnessed the rapidgrowth of the application of fuzzy logic and fuzzy set the-ory to consumer electronic products, automotive industryand process control. With the advent of fuzzy hardwarewith possibly on-chip learning capability, the applicationsto adaptive signal processing and control are expected. Po-tential applications within adaptive signal processing in-

27cludes adaptive �ltering [21], channel equalization [9], [10],[107], noise or echo cancelling [112], predictive coding [53],and so on. AcknowledgmentsThe authors wish to thank Steve Chiu for providing nu-merous helpful comments. Most of this paper was �nishedwhile the �rst author was a research associate at UC Berke-ley, so the authors would like to acknowledge the guidanceand help of Professor Lot� A. Zadeh and other membersof the "fuzzy group" at UC Berkeley. Research supportedin part by the BISC Program, NASA Grant NCC 2-275,EPRI Agreement RP 8010-34 and MICRO State ProgramNo. 92-180. References[1] K. J. Astr�om and B. Wittenmark. Computer Controller Sys-tems: Theory and Design. Prentice-Hall, 1984.[2] A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuron-like adaptive elements that can solve di�cult learning controlproblems. IEEE Trans. on Systems, Man, and Cybernetics,13(5):834{846, 1983.[3] A. G. Barto, R. S. Sutton, and C.J.C.H Watkins. Learning andsequential decision making. In M. Gabriel and J. W. Moore,editors, Learning and Computational Neuroscience. MIT Press,Cambridge, 1991.[4] H. R. Berenji and P. Khedkar. Learning and tuning fuzzy logiccontrollers through reinforcements. IEEE Trans. on NeuralNetworks, 3(5):724{740, 1992.[5] J. A. Bernard. Use of rule-based system for process control.IEEE Control Systems Magazine, 8(5):3{13, 1988.[6] P. Bonissone, V. Badami, K. Chiang, P. Khedkar, K. Marcelle,and M. Schutten. Industrial applications of fuzzy logic at gen-eral electric. The Proceedings of the IEEE, March 1995.[7] D. S. Broomhead and D. Lowe. Multivariable functional inter-polation and adaptive networks. Complex Systems, 2:321{355,1988.[8] S. Chen, C. F. N. Cowan, and P. M. Grant. Orthogonal leastsquares learning algorithm for radial basis function networks.IEEE Trans. on Neural Networks, 2(2):302{309, March 1991.[9] S. Chen, G. J. Gibson, C. F. N. Cowan, and P. M. Grant.Adaptive equalization of �nite nonlinear channels using multi-layer perceptrons. Signal Processing, 20:107{119, 1990.[10] S. Chen, G. J. Gibson, C. F. N. Cowan, and P. M. Grant.Reconstruction of binary signals using an adaptive radial-basis-function equalizer. Signal Processing, 22:77{93, 1991.[11] V. C. Chen and Y. H. Pao. Learning control with neural net-works. In Proc. of International Conference on Robotics andAutomation, pages 1448{1453, 1989.[12] Y.-Y. Chen. A self-learning fuzzy controller. In Proc. of IEEEinternational conference on fuzzy systems, March 1992.[13] Y.-Y. Chen and T.-C. Tsao. A description of the dynamicbehavior of fuzzy systems. IEEE Trans. on Systems, Man,and Cybernetics, 19(4):745{755, July 1989.[14] S. Chiu, S. Chand, D. Moore, and A. Chaudhary. Fuzzy logicfor control of roll and moment for a exible wing aircraft. IEEEControl Systems Magazine, 11(4):42{48, 1991. 1991.[15] S. L. Chiu. Fuzzy model identi�cation based on cluster estima-tion. Journal of Intelligent and Fuzzy Systems, 2(3), 1994.[16] D. Dubois and H. Prade. Fuzzy Sets and Systems: Theory andApplications. Academic press, New York, 1980.[17] S. E. Fahlman. Faster-learning variations on back-propagation:an empirical study. In D. Touretzky, G. Hinton, and T. Se-jnowski, editors, Proc. of the 1988 Connectionist Models Sum-mer School, pages 38{51, Carnegic Mellon University, 1988.[18] S. E. Fahlman and C. Lebiere. The cascade-correlation learningarchitecture. In D. S. Touretzky, G. Hinton, and T. Sejnowski,editors, Advances in Neural Information Processing SystemsII. Morgan Kaufmann, 1990.[19] D. E. Goldberg. Genetic algorithms in search, optimiza-tion, and machine learning. Addison-Wesley, Reading, Mas-sachusetts, 1989.

[20] G. C. Goodwin and K. S. Sin. Adaptive �ltering prediction andcontrol. Prentice-Hall, Englewood Cli�s, N.J., 1984.[21] S. S. Haykin. Adaptive �lter theory. Prentice Hall, EnglewoodCli�s, NJ, 2nd edition, 1991.[22] J. H. Holland. Adaptation in natural and arti�cial systems.The University of Michigan Press, 1975.[23] T. C. Hsia. System Identi�cation: Least-Squares Methods. D.C.Heath and Company, 1977.[24] J.-S. Roger Jang. Fuzzy modeling using generalized neural net-works and Kalman �lter algorithm. In Proc. of the Ninth Na-tional Conference on Arti�cial Intelligence (AAAI-91), pages762{767, July 1991.[25] J.-S. Roger Jang. Rule extraction using generalized neural net-works. In Proc. of the 4th IFSA World Congress, pages 82{86(in the Volume for Arti�cial Intelligence), July 1991.[26] J.-S Roger Jang. A self-learning fuzzy controller with ap-plication to automobile tracking problem. In Proc. of IEEERoundtable Discussion on Fuzzy and Neural Systems, and Ve-hicle Application, page paper no. 10, Tokyo, Japan, November1991. Institute of Industrial Science, Univ. of Tokyo.[27] J.-S. Roger Jang. Fuzzy controller design without domain ex-perts. In Proc. of IEEE international conference on fuzzy sys-tems, March 1992.[28] J.-S. Roger Jang. Self-learning fuzzy controller based on tem-poral back-propagation. IEEE Trans. on Neural Networks,3(5):714{723, September 1992.[29] J.-S. Roger Jang. ANFIS: Adaptive-network-based fuzzy infer-ence systems. IEEE Trans. on Systems, Man, and Cybernetics,23(03):665{685, May 1993.[30] J.-S. Roger Jang. Structure determination in fuzzy modeling:a fuzzy CART approach. In Proc. of IEEE international con-ference on fuzzy systems, Orlando, Florida, June 1994.[31] J.-S. Roger Jang and N. Gulley. The Fuzzy Logic Toolboxfor use with MATLAB. The MathWorks, Inc., Natick, Mas-sachusetts, 1995.[32] J.-S. Roger Jang and C.-T. Sun. Functional equivalence be-tween radial basis function networks and fuzzy inference sys-tems. IEEE Trans. on Neural Networks, 4(1):156{159, January1993.[33] J.-S. Roger Jang and C.-T. Sun. Predicting chaotic time se-ries with fuzzy if-then rules. In Proc. of IEEE internationalconference on fuzzy systems, San Francisco, March 1993.[34] J.-S. Roger Jang and C.-T. Sun. Neuro-fuzzy modeling: ancomputational approach to intelligence, 1995. Submitted forpublication.[35] R. D. Jones, Y. C. Lee, C. W. Barnes, G. W. Flake, K. Lee, andP. S. Lewis. Function approximation and time series predictionwith neural networks. In Proc. of IEEE International JointConference on Neural Networks, pages I{649{665, 1990.[36] M. I. Jordan and R. A. Jacobs. Hierarchicalmextures of expertsand the EM algorithm. Technical report, M.I.T., 1993.[37] A. Kandel, editor. Fuzzy expert systems. CRC Press, BocaRaton, FL, 1992.[38] C. L. Karr. GAs for fuzzy controllers. AI Expert, 6(2):26{33,February 1991.[39] C. L. Karr and E. J. Gentry. Fuzzy control of pH using ge-netic algorithms. IEEE Trans. on Fuzzy Systems, 1(1):46{53,February 1993.[40] Y. Kasai and Y. Morimoto. Electronically controlled continu-ously variable transmission. In Proc. of International CongressTransportation Electronics, Dearborn, Michigan, 1988.[41] M. Kawato, K. Furukawa, and R. Suzuki. A hierarchical neuralnetwork model for control and learning of voluntarymovement.Biological Cybernetics, 57:169{185, 1987.[42] D. J. Kelly, P. D. Burton, and M.A. Rahman. The applicationof a neural fuzzy controller to process control. In Proc. of theInternational Joint Conference of the North American FuzzyInformation Processing Society Biannual Conference, the In-dustrial Fuzzy Control and Intelligent Systems Conference, andthe NASA Joint Technology Workshop on Neural Networks andFuzzy Logic, San Antonio, Texas, December 1994.[43] P. S. Khedkar. Learning as Adaptive Interpolation in NeuralFuzzy Systems. PhD thesis, Computer Science Division, De-partment of EECS, University of California at Berkeley, 1993.[44] S. Kirkpatrick, C. D. Gelatt, andM. P. Vecchi. Optimizationbysimulated annealing. Research Report 9335 IBM T. J. WatsonCenter, 1983.

28[45] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimizationby simulated annealing. Science, 220(4598):671{680,May 1983.[46] M. S. Klassen and Y.-H. Pao. Characteristics of the functional-link net: A higher order delta rule net. In IEEE Proc. of theInternational Conference on Neural Networks, San Diego, June1988.[47] B. Kosko. Neural networks and fuzzy systems: a dynamicalsystems approach. Prentice Hall, Englewood Ci�s, NJ, 1991.[48] A. S. Lapedes and R. Farber. Nonlinear signal processing usingneural networks: prediction and system modeling. TechnicalReport LA-UR-87-2662, Los Alamos National Laboratory, LosAlamos, New Mexico 87545, 1987.[49] C.-C. Lee. Fuzzy logic in control systems: fuzzy logic controller-part 1. IEEE Trans. on Systems, Man, and Cybernetics,20(2):404{418, 1990.[50] C.-C. Lee. Fuzzy logic in control systems: fuzzy logic controller-part 2. IEEE Trans. on Systems, Man, and Cybernetics,20(2):419{435, 1990.[51] C.-C. Lee. A self-learning rule-based controller employing ap-proximate reasoning and neural net concepts. InternationalJournal of Intelligent Systems, 5(3):71{93, 1991.[52] M. A. Lee and H. Takagi. Integrating design stages of fuzzy sys-tems using genetic algorithms. In Proc. of the second IEEE In-ternational Conference on Fuzzy Systems, pages 612{617, SanFrancisco, 1993.[53] T.-C. Lee. Structure level adaptation for arti�cial neural net-works. Kluwer Academic Publishers, 1991.[54] Fujitec Company Limited. FLEX-8800 series elevator groupcontrol system, 1988. Osaka, Japan.[55] C.-T. Lin and C. S. G. Lee. Neural-network-based fuzzy logiccontrol and decision system. IEEE Trans. on Computers,40(12):1320{1336, December 1991.[56] C.-T. Lin and C.-S. G. Lee. Reinforcement structure/parameterlearning for neural-network-based fuzzy logic control systems.In Proc. of IEEE International Conference on Fuzzy Systems,pages 88{93, San Francisco, March 1993.[57] C.-T. Lin and C.-S. G. Lee. Reinforcement structure/parameterlearning for neural-network-based fuzzy logic control systems.IEEE Trans. on Fuzzy Systems, 2(1):46{63, 1994.[58] L. Ljung. System identi�cation: theory for the user. Prentice-Hall, Englewood Cli�s, N.J., 1987.[59] M. C. Mackey and L. Glass. Oscillation and chaos in physio-logical control systems. Science, 197:287{289, July 1977.[60] E. H. Mamdani and S. Assilian. An experiment in linguisticsynthesis with a fuzzy logic controller. International Journalof Man-Machine Studies, 7(1):1{13, 1975.[61] D. W. Marquardt. An algorithm for least squares estimation ofnonlinear parameters. Journal of the Society of Industrial andApplied Mathematics, 2:431{441, 1963.[62] S. J. Mason. Feedback theory { further properties of signal owgraphs. Proc. IRE, 44(7):920{926, July 1956.[63] J. Matyas. Random optimization. Automation and RemoteControl, 26:246{253, 1965.[64] M. Minsky and S. Papert. Perceptrons. MIT Press, MA, 1969.[65] J. Moody. Fast learning in multi-resolution hierarchies. In D. S.Touretzky, editor, Advances in Neural Information ProcessingSystems I, chapter 1, pages 29{39. Morgan Kaufmann, SanMateo, CA, 1989.[66] J. Moody and C. Darken. Learning with localized receptive�elds. In D. Touretzky, G. Hinton, and T. Sejnowski, edi-tors, Proc. of the 1988 Connectionist Models Summer School.Carnegie Mellon University, Morgan Kaufmann Publishers,1988.[67] J. Moody and C. Darken. Fast learning in networks of locally-tuned processing units. Neural Computation, 1:281{294, 1989.[68] J. A. Nelder and R. Mead. A simplex method for functionminimization. Computer Journal, 7:308{313, 1964.[69] D. H. Nguyen and B. Widrow. Neural networks for self-learningcontrol systems. IEEE Control Systems Magazine, pages 18{23, April 1990.[70] N.J. Nilsson. Learning machines: foundations of trainable pat-tern classifying systems. McGraw-Hill, New York, 1965.[71] Y.-H. Pao. Adaptive Pattern Recognition and Neural Networks,chapter 8, pages 197{222. Addison-Wesley Publishing Com-pany, Inc., 1989.[72] D. B. Parker. Optimal algorithms for adaptive networks: Sec-ond order back propagation, second order direct propagation,

and second order Hebbian learning. In Proc. of IEEE Interna-tional Conference on Neural Networks, pages 593{600, 1987.[73] N. P uger, J. Yen, and R. Langari. A defuzzi�cation strategyfor a fuzzy logic controller employing prohibitive information incommand formulation. In Proc. of IEEE international confer-ence on fuzzy systems, pages 717{723, San Diego, March 1992.[74] M.J.D. Powell. Radial basis functions for multivariable inter-polation: a review. In J. C. Mason and M. G. Cox, editors,Algorithms for Approximation, pages 143{167. Oxford Univer-sity Press, 1987.[75] T. J. Procyk and E. H. Mamdani. A linguistic self-organizingprocess controller. Automatica, 15:15{30, 1978.[76] D. Psaltis, A. Sideris, and A. Yamamura. A multilayeredneuralnetwork controller. IEEE Control Systems Magazine, 8(4):17{21, April 1988.[77] III R. S. Crowder _Predicting the Mackey-Glass timeseries withcascade-correlation learning. In D. Touretzky, G. Hinton, andT. Sejnowski, editors, Proc. of the 1990 Connectionist ModelsSummer School, pages 117{123, Carnegic Mellon University,1990.[78] F. Rosenblatt. Principles of Neurodynamics: Perceptrons andthe theory of brain mechanisms. Spartan, New York, 1962.[79] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learninginternal representations by error propagation. In D. E. Rumel-hart and James L. McClelland, editors, Parallel DistributedProcessing: Explorations in the Microstructure of Cognition,volum 1, chapter 8, pages 318{362. The MIT Press, 1986.[80] T. A. Runkler and M. Glesner. Defuzzi�cation and rankingin the context of membership value semantics, rule modality,and measurement theory. In European Congress on Fuzzy andIntelligent Technologies, Aachen, September 1994.[81] T. D. Sanger. A tree-structured adaptive network for func-tion approximate in high-dimensional spaces. IEEE Trans. onNeural Networks, 2(2):285{293, March 1991.[82] R. M. Sanner and J. J. E. Slotine. Gaussian networks for directadaptive control. IEEE Trans. on Neural Networks, 3:837{862,1992.[83] S. Shah, F. Palmieri, and M. Datum. Optimal �ltering algo-rithms for fast learning in feedforward neural networks. NeuralNetworks, 5(5):779{787, 1992.[84] S. Shar and F. Palmieri. MEKA-a fast, local algorithm fortraining feedforward neural networks. In Proc. of InternationalJoint Conference on Neural Networks, pages III 41{46, 1990.[85] S. Singhal and L. Wu. Trainingmultilayer perceptrons with theextended kalman algorithm. In David S. Touretzky, editor, Ad-vances in neural information processing systems I, pages 133{140. Morgan Kaufmann Publishers, 1989.[86] J.-J. E. Slotine and W. Li. Applied nonlinear control. PrenticeHall, 1991.[87] S. M. Smith and D. J. Comer. Automated calibration of afuzzy logic controller using a cell state space algorithm. IEEEControl Systems Magazine, 11(5):18{28, August 1991.[88] F. J. Solis and J. B. Wets. Minimization by random searchtechniques. Mathematics of Operations Research, 6(1):19{30,1981.[89] K. Stokbro, D. K. Umberger, and J. A. Hertz. Exploiting neu-rons with localized receptive �elds to learn chaos. ComplexSystems, 4:603{622, 1990.[90] M. Sugeno, editor. Industrial applications of fuzzy control. El-sevier Science Pub. Co., 1985.[91] M. Sugeno and G. T. Kang. Structure identi�cation of fuzzymodel. Fuzzy Sets and Systems, 28:15{33, 1988.[92] M. Sugeno and T. Yasukawa. A fuzzy-logic-based approach toqualitative modeling. IEEE Trans. on Fuzzy Systems, 1(1):7{31, February 1993.[93] C.-T. Sun. Rulebase structure identi�cation in an adaptivenetwork based fuzzy inference system. IEEE Trans. on FuzzySystems, 2(1):64{73, 1994.[94] C.-T Sun and J.-S. Roger Jang. Adaptive network based fuzzyclassi�cation. In Proc. of the Japan-U.S.A. Symposium onFlexible Automation, July 1992.[95] C.-T. Sun and J.-S. Roger Jang. A neuro-fuzzy classi�er andits applications. In Proc. of IEEE international conference onfuzzy systems, San Francisco, March 1993.[96] C.-T. Sun, J.-S. Roger Jang, and C.-Y. Fu. Neural networkanalysis of plasma spectra. In Proc. of the International Con-ference on Arti�cial Neural Networks, Amsterdam, September1993.

29[97] C.-T. Sun, T.-Y. Shuai, and G.-L. Dai. Using fuzzy �lters asfeature detectors. In Proc. of IEEE international conference onfuzzy systems, pages 406{410 (Vol I), Orlando, Florida, June1994.[98] T. Takagi and M. Sugeno. Fuzzy identi�cation of systems andits applications to modeling and control. IEEE Trans. on Sys-tems, Man, and Cybernetics, 15:116{132, 1985.[99] K. Tanaka and M. Sugeno. Stability analysis and design offuzzy control systems. Fuzzy Sets and Systems, 45:135{156,1992.[100] R. Tanscheit and E. M. Scharf. Experiments with the use ofa rule-based self-organizing controller for robotics applications.Fuzzy Sets and Systems, 26:195{214, 1988.[101] Y. Tsukamoto. An approach to fuzzy reasoning method. InMadan M. Gupta, Rammohan K. Ragade, and Ronald R.Yager, editors, Advances in Fuzzy Set Theory and Applications,pages 137{149. North-Holland, Amsterdam, 1979.[102] V. I. Utkin. Variable structure systems with sliding mode: asurvey. IEEE Trans. on Automatic Control, 22:212, 1977.[103] K. P. Venugopal, R. Sudhakar, and A. S. Pandya. An improvedscheme for direct adaptive control of dynamical systems usingbackpropagationneural networks. Journal of Circuits, Systemsand Signal Processing, 1994. (Forthcoming).[104] L.-X. Wang. Stable adaptive fuzzy control of nonlinear systems.IEEE Trans. on Fuzzy Systems, 1(1):146{155, 1993.[105] L.-X. Wang. Traning fuzzy logic systems using nearest neigh-borhood clustering. In Proc. of the IEEE International Con-ference on Fuzzy Systems, San Francisco, March 1993.[106] L.-X. Wang and J. M. Mendel. Back-propagation fuzzy systemsas nonlinear dynamic system identi�ers. In Proc. of the IEEEInternational Conference on Fuzzy Systems, San Diego, March1992.[107] L.-X. Wang and J. M. Mendel. Fuzzy adaptive �lters, withapplication to nonlinear channel equalization. IEEE Trans. onFuzzy Systems, 1(3):161{170, 1993.[108] C.J.C.H Watkins and P. Dayan. Q-learning.Machine Learning,8:279{292, 1992.[109] P. Werbos. Beyond regression: New tools for prediction andanalysis in the behavioral sciences. PhD thesis, Harvard Uni-versity, 1974.[110] Paul J. Werbos. A menu of designs for reinforcement learningover time. In III W. Thomas Miller, Richard S. Sutton, andPaul J. Werbos, editors, Neural Networks for Control, chap-ter 3. The MIT Press, Bradford, 1990.[111] B. Widrow and M. A. Lehr. 30 years of adaptive neural net-works: Perceptron, madline, and backpropagation. Proceedingsof the IEEE, 78(9):1415{1442, 1990.[112] B. Widrow and D. Stearns. Adaptive Signal Processing.Prentice-Hall, Englewood Cli�s, N.J., 1985.[113] Alexis P. Wieland. Evolving controls for unstable systems. InD. Touretzky, G. Hinton, and T. Sejnowski, editors, Proc. ofthe 1990 Connectionist Models Summer School, pages 91{102,Carnegie Mellon University, 1990.[114] R. J. William and D. Zipser. A learning algorithm for contin-ually running fully recurrent neural networks. Neural Compu-tation, 1:270{280, 1989.[115] R. R. Yager and D. P. Filev. SLIDE: A Simple Adaptive De-fuzzi�cation Method. IEEE Transactions on Fuzzy Systems,1(1):69{78, February 1993.[116] S. Yasunobu andG. Hasegawa. Evaluationof an automatic con-tainer crane operation system based on predictive fuzzy control.Control Theory and Advanced Technology, 2(2):419{432, 1986.1986.[117] S. Yasunobu and S. Miyamoto. Automatic train operation bypredictive fuzzy control. In M. Sugeno, editor, Industrial Ap-plications of Fuzzy Control, pages 1{18. North-Holland, Ams-terdam, 1985.[118] L. A. Zadeh. Fuzzy sets. Information and Control, 8:338{353,1965.[119] L. A. Zadeh. Outline of a new approach to the analysis of com-plex systems and decision processes. IEEE Trans. on Systems,Man, and Cybernetics, 3(1):28{44, January 1973.

Jyh-Shing Roger Jang was born in Taipei,Taiwan in 1962. He received the B.S. degreein electrical engineering from National TaiwanUniversity in 1984, and the Ph.D. degree inthe Department of Electrical Engineering andComputer Sciences at the University of Cali-fornia, Berkeley, in 1992.During the summer of 1989, he was a summer student in NASAAmes Research Center, working on the design and implementation offuzzy controllers. Between 1991 and 1992 he was a research scientistin the Lawrence Livermore National Laboratory, working on spec-trum modeling and analysis using neural networks and fuzzy logic.After obtaining his Ph.D. degree, he was a research associate in inthe same department, working on machine learning techniques usingfuzzy logic. Since September 1993, he has been with The MathWorks,Inc., working on the Fuzzy Logic Toolbox used with MATLAB.His interests lie in the area of neuro-fuzzy modeling, system iden-ti�cation, machine learning, nonlinear regression, optimization, andcomputer aided control system design.Dr. Jang is a member of IEEE.Chuen-Tsai Sun received his B.S. degree inElectrical Engineering in 1979 and his M.A. de-gree in History in 1984, both from NationalTaiwan University, Taiwan. He received hisPh.D degree in Computer Science from theUniversity of California at Berkeley in 1992.His Ph.D. research advisor was Professor Lot�A. Zadeh, the initiator of fuzzy set theory.During the period of 1989 and 1990 he worked as a consultantwith the Paci�c Gas and Electric Company, San Francisco, in chargeof designing and implementing an expert system for protective de-vice coordination in electric distribution circuits. Between 1991 and1992 he was a research scientist in the Lawrence Livermore NationalLaboratory, working on plasma analysis using neural networks andfuzzy logic. Since August 1992, he has been on the faculty of theDepartment of Computer and Information Science at National ChiaoTung University. His current research interests include computationalintelligence, system modeling, and computer assisted learning.Dr. Sun is a member of IEEE. He was the Arthur Gould TasheiraScholarship winner in 1986. He was also honored with the Phi HuaScholar Award in 1985 for his publications in history.


Recommended