Artificial Neural Networks an Econometric Perspective

7/27/2019 Artificial Neural Networks an Econometric Perspective

1/98


2/98

-2-INTRODUCnON

Artificial neural networks are a class of models developed by cognitive scientistsinterested in understanding how computation is performed by the brain. These networks arecapable of learning through a process of trial and error that can be appropriately viewed as sta-tistical estimation of model parameters.

Although inspired by certain aspects of the way infonnation is processed in thebrain, these network models and their associated learning paradigms are still far from anythingclo~e to a realistic description of how brains actually work. They nevertheless provide a rich,powerful and interesting modeling framework with proven and potential application across thesciences. To mention just a handful of such applications, artificial neural networks have beensuccessfully used to translate printed English text into speech (Sejnowski and Rosenberg, 1986),to recognize hand-printed characters (Fukushima and Miyake, 1984), to perform complex coordi-nation tasks (Selfridge, Sutton and Barto, 1985), to play backgammon (Tesauro, 1989), to diag-nose chest pain CEaxt, 1991), and to decode deterministic chaos (Lapedes and Farber, 1987;White, 1989; Ga11ant nd White ,1991).

Successes n these and other areas suggest that artificial neural network models mayserve as a useful addition to the tool-kits of economists and econometricians. Areas with particu-lar potential for application include time-series modeling and forecasting, nonparametric estima-tion, and learning by economic agents.

The purpose of this article is two-fold: first, to review the basic concepts and theoryrequired to make artificial neural networks accessible to economists and econometricians, withparticular focus on econometrically relevant methodology; and second, to develop theory for aleading neural network learning paradigm to a point comparable to that of the modem theory ofestimation and inference for misspecified nonlinear dynamic models (e.g., Gallant and White,1988a; Potscher and Prucha, 1991a,b).

As we hope will become apparent from our development, not only do artificialneural networks have much to offer economics and econometrics, but there is also considerable


3/98

-3 -

potential for economics and econometrics to benefit the neural network field, arising to a consid-erable degree from economic and econometric experience in modeling and estimating dynamicsystems. Thus, a larger goal of this article is to provide an entry point and appropriate back-ground for those wishi:tlg to engage n tl}e fascinating intellectual arbitrage required to fully real-ize the potential gains from trade between economics, econometrics and artificial neural net-works.

PART I: OVERVmW AND HEURISTICS1.1. ARTIFICIAL NEURAL NETWORK MODEL~

The simplest general artificial neural network (ANN) models draw primarily on threefeatures of the way that biological neural networks process information: massive parallelism,nonlinear neural unit response to neural unit input, and processing by multiple layers of neuralunits. Incorporation of a fourth feature, dynamic feedback among units, leads to even greatergenerality and richness. In this section, we describe how these features are embodied in nowstandard approaches to ANN modeling, and some of the implications of these embodiments.Because of the very considerable breadth of ANN paradigms, we cannot do justice to the entirespectrum of such models; instead, we focus our attention on those most easily related to and withgreatest relevance for econometrics.

Although not usua1ly thought of in such terms, para1lelism is a familiar aspect ofeconometric modeling. A schematic of a simple parallel processing network is shown in Figure1. Here, input unit ("sensors") send real-valued signals (Xi, i = 1, ..., r) in parallel over connec-tions to subsequent units, designate,d output units" for now. The signal from input unit i to out-put unit j may be attenuated or amplified by a factor r ji E IR, so that signals Xi r ji reach outputunit j, i = 1, ..., r. The factors r ji are known as "network weights" or "connection strengths."

In simple ANN models, the receiving units process parallel incoming signals in typi-cally simple ways. The simplest is to add the signals seen by the receiver, in which case the out-put unit produces output


4/98

4rL Xi r ji ,i=l j = 1, ..., v.

If, as is common, we permit an input, say Xo, to supply Xo = 1 to the network ( a "bias unit" innetwork jargon), output can be represented as

/j(x, r) =x'rj , j=l,...,v, orf (x, r) = (1!&1 x)r ,

where f = (f1, ..., fv)~, x =(1, X1, ..., xr)~, r =


5/98

-5 -

the first possibility, proposing networks with output unit activity given by

h(x, y) = G(x'yj) , j = 1, , v,where G(a) = 1 if a > 0 and G(a) = 0 if a $; 0. This choice for G implements the "Heaviside" orunit step function. Output unit j thus turns on when X'rj > 0, i.e. when input activity L;=l Xirjiexceeds he threshold -rjo. For this reason the Heaviside function is said to implement a "thres-hold logic unit" (1LU). G is called the "activation function" of the (output) unit

Networks with TLU's are appropriate for classification and recognition tasks: thestudy of such networks exclusively pre-occupied the ANN field through the 1950's and dom-inated the field through the 1960's. In retrospect, a major breakthrough in the ANN literatureoccurred when it was proposed to replace the Heaviside activation function with a smooth sig-moid (s-shaped) function, in particular the logistic function, G(a) = 1/(1 + exp(-a (Cowan,1967). Instead of switching abruptly from off to on, "sigmoidal" units turn on gradually as inputactivity increases. The reason why this constituted a breakthrough from the ANN standpoint will

With this modification, however, we observe thate discussed in the next section./j(x, r) = G(x'rj) = 1/(1 + exp(-x'rj is precisely the familiar binary logit probability model

(e.g. Amemiya, 1981; 1985, p. 268). Other choices for G yield other models appropriate forclassification or qualitative response modeling; for example, if G is the normal cumulative distri-bution function, we have the binary probit model, etc. As Amemiya (1981) documents in hisclassic survey, such models have great utility in econometric applications where binaryclassifications or decisions are involved.

Although biological networks with direct connections from input to output units arewell-known (e.g.. the knee-jerk reflex is mediated by direct connections from sensory receptorsin the knee onto motoneurons in the spinal cord that then activate leg muscles), it is much morecommon to observe processing occurring in multiple layers of units. For example, six distinctprocessing layers are at work in the human cortex. Such multilayered structures were introducedinto the ANN literature by Rosenblatt (1957, 1958) and by Gamba and his associates (palmieri


6/98

-6 -

and Sanna, 1960; Gamba, et. al., 1961). Figure 2 shows a schematic diagram of a network con-taining a single intemlediate layer of processing units separating input from output. Intermediatelayers of this sort are often caned "hidden" layers to distinguish them from the input and outputlayers.

Processing in such networks is straightforward. Units in one layer treat the units inthe preceding layer as input, and produce outputs to be processed by the succeeding layer. Theoutput function for such a network with a single hidden (as n Figure 2) is thus of the form

(1.1.1)h(X, ()) = F ({3hO + LJ=l G(X"rj){3hj), h = 1, ..., vHere F: 1R ~ 1R is the output activation function, and {3hj, j = 0, 1, ..., q, h = 1, ..., v are con.nection strengths from hidden unit j ( j = O indexes a bias unit) to output unit h. The vector

,j3'v,r'l, .."r'q) (with j3'h = (j3hO. ...,j3hq)) collects together all network weights.= ({3'1,Note that we have q hidden units.

As originally introduced, the hidden layer network activation functions F and Gimplemented 11..U's. However, modern practice permits F and G to be chosen quite freely.Leading choices are F(a) = G(a) = 1/(1 + exp(-a)) (the logistic) or F(a) = a (the identity) andG(a) = 1/(1 + exp(-a)). Because of its notational simplicity and considerable generality, weadopt the latter choice, and for further simplicity set v = 1. Thus we shall pay particular attentionto "single hidden layer" networks with output functions of the form

q -f(x, 0) = 130 + L G(x'rj)13jj=l

(1.1.2)

Although we have seen econometrically familiar models emerge in our foregoingdiscussion of ANN models (e.g. seemingly unrelated regression systems and logit models), equa-tion (1.1.2) s not so familiar. .It does bear a strong resemblance to the projection pursuit modelsof modem statistics (Friedman and Stuetzle, 1981; Huber, 1985) in which output response isgiven by


7/98

7-q -a

f(x, f) =/30 + L Gj(X'rj)fJj.j=l

However, in projection pursuit models the functions Gj are unknown and must be estimated fromdata (perrnittingf3 j to be absorbed nto Gj), whereas in the hidden layer network model (I.l.2), Gis given. The hidden layer network model is thus somewhat simpler than the projection pursuitmodel.

A variant of the single hidden layer network that is particularly relevant foreconometric applications is depicted in Figure 3. This network has direct connections from theInput to output layer as well as a single ruaaen layer. output tor this network can be expressedas

qf(x, 0) = F(x'a +f3o + L G(x'rj)f3j),j=l (1.1.3)

where a is a r x 1 vector of input-output weights, and () is now taken to be(} = a', f3o, , {3q, r'l , ..., r'q)'. By suitable choice of G, a and {3 = ({30, {31, , {3q)' we nest asspecial cases a1l of the networks discussed so far.

In particular, with F(a) = a (the identity) we have a standard linear model aug-mented by nonlinear terms. Given the popularity of linear models in econometrics, this form isparticularly appealing, as it suggests that ANN models can be viewed as extensions of, ratherthan as alternatives to, the familiar models. The hidden unit activations can then be viewed aslatent variables whose inclusion enriches the linear model. We shall refer to an ANN model withoutput of the form (1.1.3) as an "augmented" single hidden layer network. Such networks willplay an important role in the discussion of subsequent sections.

What originally commanded the attention and excitement of a diverse range of dis-ciplines was the demonstrated successes hat models of the form (1.1.1) and (1.1.2) had in solvingpreviously intractable classification, forecasting and control problems, or in producing superiorsolutions to difficult problems in orders of magnitude less time than traditional approaches. Untilrecently, a theoretical basis for such successes was unknown --artificial neural networks just


8/98

-8 -

seemed o work surprisingly well.Motivated by a desire either to delineate the limitations of network models or to

understand their diverse successes, a number of researchers independently produced rigorousresults establishing that functions of the form (1.1.2) can be viewed as I'universal approx.imators,"that is, as a flexible functional form that, provided with sufficiently many hidden units and prop-erly adjusted 'parameters, can approximate an arbitrary function 9 : IR r -:; IR arbitrarily well inuseful spaces of functions. Results of this sort have been given by Carroll and Dickinson (1989),Cybenko (1989), Funahashi (1989), Hecht-Nielsen (1989), Hornik, Stinchcombe and White(1989, 1990) (HSWa, HSWb) and Stinchcombe and White (1989), among others.

The flavor of such results is conveyed by the following paraphrase of part ofTheorem 2.4 of HSWa.THEOREMI.1.1: For r E IN, let Ir(G) be the class of hidden layer network functions};'(G)={f: JR'~ JRlf(x)={30+LJ=IG(X'rj){3j,XE JR';rjE JR'+I,{3jE JR,j=I,...,q;{30 E JR, q E 1]J\l}, where G: JR ~ [0,1] is any cumulative distribution function. Then};'(G) isunifonnly dense on compacta in C( 1R~, Le. for every 9 in C( 1R~, every compact subset K of

DRr, and every E > 0, there exists f E Ir(G) such that Supx E K I f(x) -g(x) I < E.

Thus, the biologically inspired combination of parallelism, nonlinear response andmultilayer processing leads us to a class of functions that can approximate members of the usefulclass C( 1R~ arbitrarily well.

Similar results hold for network models with general (not necessarily sigmoid)activation functions approximating functions in Lp spaces with compactly supported measures,and, as HSWb and Hornik (1991) show, in general Sobolev spaces. Thus, functions of the form(1.1.2) can approximate a function and its derivatives arbitrarily well, and in this sense are asflex.ible as Ga1lant's (1981) flex.ible Fourier form. Indeed, Ga1lant and White (1988b) construct asigmoid choice for G (the "cosine squasher") that nests Fourier series within (1.1.2), so that theflexible Fourier form is a special case of (1.1.3) even for sigmoid G.


9/98

-9 -

The econometric usefulness of the flexible form (1.1.2) has been further enhanced byHu and Joerding (1990) and Joerding and Meador (1990), who show how to impose constraintsensuring monotonicity and concavity (or convexity) of the network output function.interested reader s referred to these papers for details.

An issue of both theoretical and practical importance is the "degree of approxima-tion" problem: how rapidly does the approximation to an arbitrary function improvenumber of hidden units q increases? Classic results for Fourier series are provided by Edmundsand Moscatelli (1977). Similar results for ANN models are only beginning to appear, and so farare not as sharp as those for Fourier series. Barron (1991a) exploits results of Jones (1991) toestablish essentially that 11/- 9 112= O(l/q 1/2) ( 11.112 enotes an L2 norm) when I is an element

of };r(G) having q hidden units and continuously differentiable sigmoid activation function, and9 belongs to a certain class of smooth functions satisfying a summability condition on its Fouriertransform. An important open area for further work is the extension and deepening of results ofthis sort, especially as such results may provide key insight into advantages and disadvantages ofANN models compared to standard flexible function families. Degree of approximation resultsare also necessary for establishing rates of convergence for nonparametric estimation based onANN models.

Our focus so far on networks with a single hidden layer is justified by their relativesimplicity and their approximative power. However, if nature is any guide, there are advantagesto using networks of many hidden layers, as depicted in Figure 4. Output of an l-layer networkcan be represented as

h = 1, ..., 1 ,hi = Gh(Ahi(ah-l)) i = 1, ..., qh;

where ah is a qh X 1 vector with elements ahi, Ahi(.) is an affine function of its argument (i.e.Ahi(a) = tI'rhi for some (qh + 1) x 1 vector rhi, 2 = (1, a)), Oh is the activation function forunits of layer h, ao = x, qo = r, and ql = v. The single hidden layer networks discussed abovecorrespond to 1 = 2 in this representation.


10/98

-10

An interesting open question is to what extent networks with 1 ? 3 layers may bepreferable to networks with 1 = 2 layers. Specifically, for what classes of functions can a threelayer network achieve a given degree of accuracy with fewer connections (free parameters) thana two layer network? Examples are known in which a two layer network cannot exactlyrepresent a function exactly representable by a three layer network (Blum and Li, 1991), and it isknown that certain mappings containin2 discontinuities relevant in control theory ~~n hp. Imi-formly approximated in three but not two layers (Sontag, 1990). HSWa (Corollary 2.7) haveshown that additional layers cannot hurt, in the sense hat approximation properties of single hid-den layer networks (I = 2) carry over to multi-hidden layer networks. Further research in thisinteresting area s needed.

A further generalization of the networks represented by (1.1.4) is obtained by replac-ing the affine function Ahi(.) with a polynomial Phi(.) with degree possibly dependent on i and h.This modification yields a class of networks containing as a special case the so-called "sigma-pi"(Ell) networks (Maxwell, Ones, Lee and Chen. 1986: Williams. lCJRn) Stinrhrombe (1991) hMstudied the approximation properties of networks for which an arbitrary "inner function" IhireplacesAm n (!.1.4)

The richness of this class of network models is now fairly apparent. However, westill have not exploited a known feature of biological networks, that of internal feedback.Returning to the relatively simple single hidden layer networks, such feedbacks can berepresented schematically as in Figure 5. In Figure 5(a), network output feeds back into the hid-den layer with a time delay, as proposed by Jordon (1986). In Figure 5(b), hidden layer outputfeerl~ h~r.k into th~ hidd~n layer with a time del~y, as proposcd by Elma.n (1988). Thc outvulfunction of the Elman network can thus be represented as

qfr(xt, (}) = /30 + L atj /3j=l

atj = G(Xt'rj + a 't-I 8j), j=I,...,q;t=O,I,2,..., (1.1.5)


11/98

11-

where at = (atl' ..., atq)'. As a consequence of tl1is feedback, network output depends on the ini-tial value ao, and the entire history of system inputs, xt = (XI' ..., xt).

Such networks are capable of rich dynamic behavior, exhibiting memory and contextsensitivity. Because of the presence of internal feedbacks, these networks are referred to in theliterature as "recurrent networks," while networks lacking feedback (e.g., with output functionsG.l.3)) are desi2oated "feedforwarrl netwnrk~ II

In econometric terms, a model of the form (1.1.5) can be viewed as a nonlineardynamic latent variables model. Such models have a great many potential applications ineconomics and finance. Their estimation would appear to present some serious computationalchallenges (see e.g. Hendry and Richard, 1990, and Duffle and Singleton, 1990), but in fact somestraightforward recursive estimation procedures related to the Kalman filter can deliver con-sistent estimates of model parameters (Kuan, Homik and White, 1990; Kuan and White, 1991).We discuss this further in the next section.

Although we have covered a fair amount of grounrl in thi~ ~p~tion. we have onlyscratched the surface of the modeling possibilities offered by artificial neural networks. To men-tion some additional models treated in the ANN literature, we note that fully interconnected net-works have been much studied (with applications to such areas as associative memory and solu-tion of problems like the traveling salesman problem; see e.g. Xu and Tsai, 1990, and Xu andTsai, 1991), and that networks running in continuous rather than discrete time are also standardobjects of investigation (e.g. Williams and Zipser, 1989). Although fascinating, these networkmodels appear to be less relevant to econometrics than those discussed so far, and we shall nottreat them fiJrthp.r

As rich as ANN models are, they still ignore a host of biologically relevant features.Neural systems that have taken perhaps billions of years to evolve will take humans a little moretime to model exhaustively than the five decades devoted so far! To mention just a few items,biological neurons communicate over multiple pathways, chernical as well as electrical --the .l;in-gle communication dimension ("activation") assumed n most ANN models is quite incomplete.


12/98

-12-

Also, biological neurons respond to input activity stochastically and in much more complicatedways than as modeled by the sigmoid activation function --neurons output complex spike trainsthrough time, and are in fact not simple processing units. Of course, these and other lirnitationsof ANN models are daily being challenged by ANN modelers, and we may expect a continuingincrease in the richness of ANN models as the diverse interdisciplinary talents of the ANN com-munity are broueht to bear on these ssues.

Despite these limitations as descriptions of biological reality, ANN models aresufficiently rich as to present a potentially attractive set of tools for econometric modeling.Given models, the econometrician wants estimators. We take up estimation in the next section,where we encounter additional interesting tools developed by the ANN community in their studyof learning in artificial neural networks.

1.2. LEARNING IN ARTIFICIAL NEURAL NETWORKSThe discussion of the previous section establishes ANN models as flexible func-

tional fomls, extending standard linear specifications. As such, they are potentially useful foreconometric modeling. To fulfill this potential, we require methods for finding useful values forthe free parameters of the model, the network weights.

TO any econometriCian verse a m the standard tools of the trade, a multitude ofrelevant estimation procedures for finding useful parameter values present themselves, typicallydependent on the behavior of the data generating process and the goals of the analysis.

For example, suppose we observe a realization of a random sequence of s x 1 vec-tors {Zt = (yt, X't)'} (assumed stationary for simplicity), and we wish to forecast yt on the basisof Xt. The minimum mean-squared error forecast of yt given Xt is the conditional expectationg(XI} = E(YI I XI}. Although the function 9 is unknown, we can attempt to approximate it using aneural network with some sufficient number of hidden units. If we adopt (1.1.3) with F the iden-tity, we obtain a regression model of the form


13/98

-13-

q -f(x, 8) = x'a +f3o + L G(x'rj)f3j,j=l

where () = (a,/30'/31 , ...,/3 j, r'l, ..., r'q)' and for simplicity we choose q and G a priori.Because this model is only intended as an approximation, we must acknowledge

from the outset that it is misspecified. Nevertheless, the theory of least squares for ~sspecifiednonlinear regression models (White, 1981; 1992, Ch. 5; Domowitz and White. 1982: Gallant :\nciWhite, 1988a) applies immediately to establish that a nonlinear least squares estimator 9 n solv-ing the problem

nmin n-l L [yt -f(Xt. 8)]2/I," A ~=1

exists and converges almost surely under general conditions as n ~ ~ to 9., the solution to theproblem

where a~ = E([Yt -g(xJf). (See Sussman (1991) for discussion of issues relating toidentification.:

Further, under general conditions {;; (0 n -() *: converges in distribution as n ~ 00 toa multivariate nonnal distribution with mean zero and consistently estimable covariance matrix(White, 1981; 1992, Ch. 6; Domowitz and White, 1982).

Although least squares s a leading case, the properties of the dependent variable ytwill often suggest the appropriateness of a Qua."i-ma:ximllm ik~Jihood procedure different fromleast squares. For example, if yt is a binary choice indicator taking values O or 1 only, it may be

assumed to follow a conditional Bernoulli distribution, given Xto A network model to approxi-mate g(X,) = P[Yt = 1 I Xt] = E(Yt I Xt) can be specified as

q - .8(x, O) = F(x'a + .80 + L G(x'rj) j) ,j=l {1.2.1)


14/98

14 -

where F(.) is now some appropriate c.d.f. (e.g., the logistic or normal). The mean quasi-log likel-ihood function for a samDle of size n is then

nLn(Zn, f) = n-1 L[Yt logf(Xt' f) + (1- yt) log(l- f(Xt, f))].t=l

AA quasi-maximum likelihood estimator 8 n solving the problem

max. Ln(Zn, f)BE e

can be shown under general conditions to exist and converge to 0., the solution to the problem

max E[Yt Jog f(Xt. (1) + (1- YJ 1og(l- f(Xt. (1))].Bee .(See White, 1982; 1992, Ch. 3-5.)

The solution ()* minimizes the Kullback-Leibler divergence of the approximate pro-babillly Inul1el f(Xt, 0.) [rUIIl 111t;Uut; g(Xt). fu inl11t; It;~l :)4U(1lt;:) \;;~t;, -I;; (0 n -0 .) I.;UIIVC;lgC;~in distribution as n ~ 00 to a multivariate normal distribution with mean zero and consistentlyestimable covariance matrix (White, 1982; 1992, Ch. 6).

If Ye represents count data, then a Poisson quasi-maximum likelihood procedure isnatural (e.2. Gourieroux. Monfort and Tro2non. 1984a.b). where fis as in G.2.1) with F chosen toensure non-negativity (e.g. F(a) = exp(a, so as to permit f(Xt, J) to plausibly approximateg(Xt) = E(Yt I X,). If Yt represents a survival time, then a Cox proportional hazards model (e.g.A:r:nemiya,1985, pp. 449-454) is a natural choice, with hazard rate of the form )..(t) f(Xt, 9).

From an econometric standpoint, then, ANN models can be used anywhere onewould ordinarily use a linear (or transformed linear) specif ication, with estimation proceedingvia appropriate quasi-maximum likelihood (or, alternatively, generalized method of moments)techniques. The now rather well-developed theory of estimation of misspecified models (White,1982, 1992; Gallant and "Whitc, 1988a; POt~chcJ: nd P1-ucha,1991a,b) applic~ immcdiatcly toprovide interpretations and inferential procedures.


15/98

15-

The natural instincts of econometricians are not the instincts of those concerned withartificial neural network learning, however. This is a double blessing, because it means not onlythat econometrics has much to offer those who study and apply artificial neural networks, but alsothat econometrics may benefit from novel techniques developed by the ANN community.

In considering how an artificially intelligent system must go about learning, ANNmodelers from the outset viewed learnin2 as a ~eql1f'.ntiI11 roce~s. Viewins learning ~ thc; l1lU-cess by which knowledge is acquired, it follows that knowledge accumulates as learning experi-ences occur, Le. as new data are observed.

In ANN models, knowledge is embodied in the network connection strengths,(}." "Given knowledge (} t at time t, knowledge (} t+ 1 at time t + 1 is then

+l=Ot+Lltt

where dt embodies incremental knowledge (1eaming). A successful learning procedure musttherefore specify some appropriate way to fonn thf'. llpd~te A, from previous knowlcdgc CI.11U

Thus we seek an appropriate function Vlt for whichurrent observables, Zt = (yt, X't)"..

~t = 1fItCZt. () t).

Current leading ANN learning methods can trace their history from seminal work ofRosenblatt (1957, 1958, 1961) and Widrow and Hoff(1960). Rosenblatt's learning network, thea-perceptron, was concerned with pattern classification and utilized threshold logic units.Widrow and Hoffs ADALINE networks do not require a nu, as they are not restricted to beingclassifiers. As a consequence, the Widrow-Hoff (or "delta") learning law could be generalized injust the right way to pccmit (tJ:JJ:Jlil.;auunu nonlinear networks.

For their linear networks (with output for now given by f(x, 8) = x' 8) Widrow andHoff proposed a version of recursive least squares (itself traceable back to Gauss, 1809 --seeYoung, 1984),

Ot+1 = Ot + a XtCft -X't Ot). (1.2.2)


16/98

16 --A -AHere Et = yt -X't O t is the "network error" between computed output X't O t and the "target"

value yt. The scalar a > O is a "learning rate" to be adjusted by trial and error. This recursionwas motivated explicitly by consideration of minimizing expected squared error loss.

For networks with nonlinear output f(x, 8) the direct generalization of the delta ruleis

" " " "Ot+1 = Ot + a Vf(Xt. ) (yt -f(Xt. . (1.2.3)

where V f(x, .)is the gradient of f(x, .)with respect to (J (a column vector). In the ANN literature,this recursion is called the "generalized delta rule" or the method of "backpropagation" (a terminvented for a related procedure by Rosenblatt, 1961). Its discovery is attributable to many (Wer-bos, 1974; Parker, 1982,1985; Le Cun, 1985), but the influential work of Rumelhart, Hinton andWilliams (1986) is perhaps most responsible for its widespread adoption.

This apparently straightforward generalization of (1.2.2) in fact caused a revolutionin the ANN field, spurring the explosive growth in ANN modeling resDonsible for its vi2or todayand the appearance of an article such as this in a journal devoted to econometrics. The reasonsfor this revolution are essentially two. First, until its discovery, there were no methods known toANN modelers for finding good weights for connections into the hidden units. The focus onthreshold logic units in multilayer networks in the 1950's and 1960's led researchers away fromgradient methods, as the derivative of a TLU is zero almost everywhere, and does not obviouslylend itself to gradient methods. This is why the introduction of sigmoid activation functions byCowan (1967) amounted to such a significant breakthrough --straightforward gradient methodsbecome possible with such activation functions. Even so, it took over a decade to sink into thecollective consciousness of the ANN community that a solution to a problem long consideredintractable (even impossible, viz. Minsky and Papert, 1969) was now at hand. The second reasonis that once feasible methods for training hidden layer networks were available, they wereapplied to a vast range of problems with some startling successes. That this should be so is allthe more impressive given the considerable difficulties in obtaining convergence via (1.2.3). For


17/98

17 -

a period, ANN models coupled with the method ofbackpropagation came to be viewed as magic,with considerable accompanying hype and extravagant claims.

In 1987 one of us (White, 1987a) pointed out that (!.2.3) is in fact an application ofthe method of stochastic approximation (Robbins and Monro, 1951; B1um, 1954) to the nonlinearleast squares problem (as in Albert and Gardner, 1967). The least squares stochastic approxima-tion recursions are in fact a little more ~eneral. havin~ the form

" " " "9t+l = 9 t + at Vf(Xt, 9 t) (yt -f(Xt, 9 t)), t= 1, 2, 00 (!.2.4)The difference is that here the learning rate at is indexed by t, whereas n (1.2.3) t is a constant.

This is quite an important difference With a constant learning rate, the recursion(1.2.3) can converge only under extremely stringent conditions (there must exist eo such thaty = f(X, eo) almost surely, where Zt has the distribution of Z = (Y; X')' t = 1, 2, ). When thiscondition fails, the recursion of (1.2.3) generally converges to a Brownian motion (see Kushnerand Huang. 19R1: Homik ~nr1 Kll~n, 1QQO),not an appealing behavior in this context. Howevcr,whenever at depends on t appropriately (e.g. at > 0, L~=l at = 00, L~=l a; < 00, for which itsuffices that at oc t-IC 1/2 < 1( ~ 1), standard results from the theory of stochastic approximationcan be applied (e.g., White, 1989a) to establish the almost sure convergence ofe t in (1.2.4) o ()*,a local solution of the least squares problem

mill E([Y -f(K, 8)]1BE 8ARepeated initialization of the recursion (1.2.4) from different starting values e (e.g., following

the parameter space partitioning strategy of Morris and Wong, 1991) can lead to rather goodlocal solutions.

This fact is significant. The recursion (!.2.4) provides a computationally very simplealgorithm for getting a consistent estimator for a locally mean square optimal parameter vector ina nonlinear model with just a single pass through the data. Multiple passes through the data(which can be executed in parallel) permit exploration for a global optimum. Thus, in addition to


18/98


19/98


20/98


21/98


22/98

-22 -

where the sieve en(G) is given by en(G) = T(G, qn' An),

q -T(G, q, 11) = {(} E E> I (}( .) = fl( ., ~) , fl(x, 8q) =f3o + L G(x'rj)f3 j ,j=l

x E m.'

q rL L Irji I .S:qL\}j=li=OqL if3j i ~A,j=O

G is a given hidden layer activation function, {qn E IN} and {~n E JR+ are sequences endingto infinity with n, e is the space of functions square ntegrable with respect to the distribution of

s:q -rR /3 /3 ' , , ) 'Xt,and now u =\fJO, 1,..., q,rl,r2,...,rq .Given this setup, the estimation problem (1.2.7) s equivalent to the constrained non-

linear east squares roblem

nmill n-1 L [f, -f1'(X" ~')f , n = 1,2, """,8"' e D, '=1 (1.2.8)

where Dn = {$1. :L;:o l{3j I ~lln, L;:l L;=o Irji I ~qnlln}. The idea is that for a sample ofsize n, one performs a constrained nonlinear least squares estimation on a model with qn hiddenunits, satisfying certain sumrnability restrictions on the network weights. B y letting the numberof hidden units qn increase gradually with n, and by gradually relaxing the weight constraints, thenetwork model becomes increasingly flexible as n increases. Proper control of qn and l1n elim-inates overfitting asymptotically, a1lowing consistent estimation of 80, 80(Xt) = E(Yt I Xu, to

" presult, i.e. 118 -80112 -7 0. White (1990a) shows that for bounded i.i.d. { Zr } , consistency isachieved with !!J.n,qn ~ 0:1as n ~ 0:1, !J.n o(nl/4) and qn!!J.~ og qn!!J.n= o(n). For bounded mix-ing processes of a specific size, ~n = o(n 1/4) and qn~; log qn~n = o(n 1/2) suffice for consistency.

In practice, determining appropriate network complexity is precisely analogous todetermining how many terms to include in a nonparametric series regression. As in that case,either cross-validation or information-theoretic methods can be used to determine the number ofhidden units optimal for a given sarnple. Information-theoretic methods in which one optimizes acomplexity-penalized quasi-log likelihood (closely related to the Schwartz Information Criterion,


23/98

-23-

Sawa, 1978) have been shown to have desirable properties by Barron (1990). Extension ofanalysis by Li (1987) as applied by Andrews (1991a) to cross-validated selection of the numberof terms in a standard series regression may deliver appropriate optimality results for cross-validated selection of network complexity , and is an interesting area for further research.

Also an open question is that of the asymptotic distribution of nonparametric neuralnetwork estimators. Results of Andrews (199lb) for series estimators may also be extendable totreat nonparametric estimator of ANN models. Additional interesting insights should arise fromthis analysis.

13. SPECIFICATION TESTING AND INFERENCE

Consider the nonlinear regression model based on (1.1.3) with F the identity func-tion,

The standard linear model ocl::urs as the special case in which {31 = {32 = ..{3q = 0. Thus,correct specification of the linear model can be tested as

Hq .fi =0 H.. .{I; = 0~

where {3 = ({310 ...0 {3q)'. A motnent's reflection reveals an interesting obstacle to straightforwardapplication of the usual tools rf statistical inference: the "nuisance parameters" rj, j = 1, ..., q,are not identified under the nu1l hypothesis, but are identified only under the alternative.tunately, there is now availabl~ a variety of tools that permits testing of Ho in this context.

The simplest, mo$t naive procedure is to avoid treating the rj as free parameters,instead choosing them a priori in some fashion (e.g., drawing them at random from someappropriate distribution) and I hen proceeding to test Ho using standard methods, e.g. viaLagrange multiplier or Wald statistics, conditional on the values selected forrjo A procedure of


24/98

24-precisely this sort was proposed by White (1989b), and the properties of the resulting "neural net-work test for neglected nonlinearity" were compared to a number of other recognized proceduresfor testing linearity by Lee, White and Granger (1991). (See White, 1989b, and Lee, White andGranger, 1991, for implementation details.) The network test was found to perform well in com-parison with other procedures. Though no one test dominated the others considered, the networktest had good size, was often most powerful, and when not most powerful, was often one of themore powerful procedures. It thus appears to be a useful addition to the modem arsenal ofspecification testing procedures.

A more !;ophi!;ticated prncedllre i~ tn ~hnfil:p. "1 vfll11~S that optimize the direction inwhich nonlinearity is sought. Bierens (1990) proposes a specification test of precisely this sort.First, the model is estimated under the null hypothesis (linearity), yielding residualsEt = yt -i't ~n where ~n is an estimator of ([:Jo,a')'. For given r one can show under generalconditions that

under the linearity hypothesis, where with { Zt } i.i.d. we have

b*(r) = E(G(X'tr) X't)

A. = E(Xt X't)

pwhere ~n ~ !!., Bierens (1990) specifies G( , ) = exp( .), but as we discuss below, this is not theonly possible choice.

It follows that


25/98

25 -

A A 2 dW(r) = nM(r)/Gn(r) ~xi

under correct specification of the linear model, where a~(r) is a consistent estimator ofa2(r).AUnder the alternative, W(r)/n -717(r) > 0 Q.s. for essentially every choice ofr, as Bierens (1990,

Theorem 2) shows."To avoid picking r at random, Bierens proposes maximizing W( r) with respect to

r E r Can appropriately specified compact set), yielding Wcr), say. As Bierens notes, this max-imization renders the xi distribution inapplicable under Ho. However, a xi statistic can be con-structed by the following device: choose c > 0, /l E (0, 1) andy n independently of the samDleand put

if W(r) -W(ro) ~ cn)..r=roA

=r if

" dBierens (1990, Theorem 4) shows that WCr) ~xi under correct specification whileAW(r)/n ~ SUpre r 7J(r) > O a.s. under the alternative. Bierens' result hold') essentially regardlessof how r is chosen.

In recent related work, Stinchcombe and White (1991) show that Bierens' conclu-sions are preserved if G is chosen to belong to a certain wide class of functions includingG( .) = exp( .). Other members of this class are G(a) = 1/(1 + exp(-a and G(a) = tanh(a).

The choice of c, }., and r o in Bierens' construction is problematic. Two researchersusing the same data and models but using different values for c, ).. and r o can arrive at differingconclusions in finite samples regarding correctness of a given specification. One way to avoidsuch difficulties is to confront the problem head-on and determine the distribution of W( r). Someuseful inequalities are given by Davies (1977, 1987), but these are not terribly helpful when r ?; 3(recall r is the number of explanatory variables). Recently, Hansen (1991) has proposed a com-putationally intensive procedure that permits computation of an asymptotic distribution for W( r)under Ho'


26/98

-26-

An interesting area for further research is a comparison of the relative performanceand computational cost of the procedures discussed here: the naive procedure of picking rj'S at

" "random; Bierens' W(r) procedure; and use of Hansen's (1991) asymptotic distribution for W(r).The specification testing procedures just described extend to testing correctness of

nonlinear models, as well as testing the specification of likelihood or method of moment-basedmodels. For testing correct specification of a nonlinear model, say yt = h(Xt, a) (which for con-venience includes an intercept) one can test Ho: /3 = O vs. Ha: /3 * O n the augmented model

q -v, = h(X" a) 4- ~ G(X','Yj)f3j 4- I;,j=l

(1.3.2)

pIf an is the nonlinear least squares estimator under the null (with an ~ a* under Ha; see White,1981), then with Et = ft -h(Xt, an) we have

where now

(J"2Cr) = var([GCXtr) -b*Cr) A *-1 VahCXt. a*)]e;)

b*(r) = E(G(X'tr) V'ah(Xt, a*))

*) V'ahCXt. a*)). = E(Vah(Xt. a" " 2 d "

We again have W(r) = nM(r)/a- n(r) -7xi under Ho. while W(rYn -717(r) > O a.s. under Ha

(mlsspecification) for essentially all r. A consistent specification test is therefore available.A

Optimizing W( r) over choice of r leads to considerations regarding asymptotic testing identicalto those arising in the linear case.

For testing correct specification of a likelihood-based model, a consistent m-test(Newey, 1985; Tauchen, 1985; White, 1987b, 1992) can be performed. The starting point is thefact that if 1 (Zt. 0) is a correctly specified conditionallo~-likelihood for y, 2iven X, Ci.e. for some


27/98

-27 -

() 0' exp 1 (Zt, () 0) is the conditional density of yt given Xt), then

E(s(ZtJ (Jo) I Xt) = O ,

where s is the k x llog-likelihood score function, s(Zt, 0) = V 8 1 Zt, 0). It follows from the lawof iterated expectations that with correct specification

E(s(Zt' () 0) G(X't r)) = 0Afor all ye r. Under standard conditions (e.g. White, 1992, Ch. 9) it follows that with en the

(qua3i-) mll.Ximum likc;lihood c;~timator c;oroi~tc;nt undc;r mi~~pc;cifica.tiOll fOl (}*, wc 11(1VC

where

};(y) = var([(G(Xt'y) @ Ik) -b *(y)A *-l]S;)

b*(r) = E([G(i'tr) (8) k] V'9 S;)

A. = E(V'9 S;)

s: = s(Zt. (}*,\7'6 S; = \7'6 S(Zt. e*).

A A A A dConsequently, W(r) = n M(r)'[Ln(r)]-l M(r) ~xi under correct specification. ArgumentAanalogous to that of Bierens (1990, Theorem 4) delivers W(r)/n ~17(r) > 0 a.s. under

misspecification for essentially all r, given an appropriate choice of G, e.g. G(a) = exp(a) as inBierens (1990), or G(a) = 1/(1 + exp(-a, G(a) = tanh(a), as in Stinchcombe and White 0991).

"A consistent m-test is thus available. Optimizing W(r) over choice ofr leads to considerationsregarding asymptotic testing identical to those arising in the linear model.

Because ANN models must be recognized from the outset as misspecified, one


28/98

-28-

cannot test hypotheses about estimated parameters of the ANN model in the same way that onewould test hypotlleses about correctly specified nonlinear models (e.g. as in Gallant, 1973, 1975).Nevertheless, one can test interesting and useful hypotheses within the context of inference formisspecified models (White, 1982, 1992; Gallant and White, 1988a). In this context, two issuesarise: the first concerns the interpretation of the hypothesis itself; and the second concerns con-struction of an appropriate test statistic. Both of these issues can be conveniently illustrated inthe context of nonlinear regression, as in White (1981).

AThe nonlinear least squares estimator () n solves

nmin n-l L [yt -f(Xt. (J)r .ee8 t=l

where, for concreteness we take f(Xt, (}) to be of the fonn (!.1.3) with F the identity function.A Q.S.White (1981) provides conditions ensuring that (} n ~ (}*, where (}* is the solution to

rnin E([E(Yt I Xt) -f(Xt. (J)J2)BE e

Thus (). is a parameter vector of a minimum mean squared error approximation f(Xt. ().) "toE(Yt I Xt). One can therefore test hypotheses about the parameters of the best approximation.

A leading case is that in which a specified explanatory variable (say the rth variable,Xtr) is hypothesized to afford no improvement in predicting yt, within the class of approximationspermitted by f This hypothesis and its alternative are specified as

Ho: S, e* = 0 *Ha: S, () ;!: 0s.where s r is a q + 1 x k selection matrix that picks out the appropriate elements of () .(i.e.

* * *)r,rlr,...,rqr. Testing Ho against Ha in the context of a misspecified model can be conveniently

done using either Lagrange multiplier (LM) or Wald-type test statistics, but not likelihood ratiostatistics, for reasons described in Foutz and Srivastava (1977), White (1982, 1992) and Gallant


29/98

-29-

and White (1988a). The likelihood ratio statistic requires for its convenient use as a X~+l statis-tic the validity of the information matrix equality (White, 1982, 1992), which fails undermisspecification. The classical LM or Wald statistics also require the validity of the informationmatrix equality, but can be modified by replacing classical estimators of the asymptotic covari-

Aance matrix of en with specification robust estimators (White, 1981, 1982, 1992; Gallant andWhite, 1988a). Thus, a test of Ho against Ha can be conducted using the Wald statistic

.." ..Wn = n O 'n S'r(Sr Cn S'r)-l Sr O n ,

where~ --1 ---1Cn = An En An ,

"The covariance estimator Cn given here is consistent when {4} is i.i.d., but modificationspreserving consistency are available in other contexts. Under the hypothesis that Xtr is irrelevant

" d(and with consistent Cn), one can show that Wn ~ X~+l' and that the test is consistent for thealternative. Similar results hold for the LM test statistic. Details can be found in Gallant andWhit~ (19&&a,CQ 7) and White (19&2; 1992, Ch. 8).

1.4. CHAOS-MODELING EXAMPLESIn this section we illustrate methods for estimating ANN models by fitting single

hidden layer feedforward networks to time series generated by three deterministic chaosprocesses. The generating equations for these time series are:


30/98

30 -

(a) The logistic map (Thompson and Stewart, 1986, p. 162):

Yt+l = 3.8 YtCl -Yt)

(b) The circle map (Thompson and Stewart, 1986, pp. 164, 285-6):

Y, 11 = Y, + (22/1l') ~in(21l' Y, + ~~)

(c) The Bier-Bountis map (Thompson and Stewart, 1986, po 171):

Yt+l = -2 + 28.5 Yt/(l + yf)

Chaos (a) is by now a familiar example to economists and econometricians. Chaos (b) and chaos(c) are less familiar, but these three examples, representing polynomial, sinusoidal and !ationalpolynomial functions, provide a modest range of different functions with which to demonstrateANN capab1Unes. Time-series plots or me mree senes are given in Figures 6,7 and 8.

Because we shall not be adding observational error to the chaotic series, our exam-pIes will provide direct insight into the approximation abilities of single hidden layer feedfor-ward networks.

In each case, we fit ANN models of the form

-q -f(Xt. 0) = X't a + /30 + L G(X't rj) /3 jj=l (1.4.1)

to the target chaos, yt, where G(a) = 1/(1 + exp(-a)), the logistic. Several models are examinedin p~('h in~ance- Specifically, the input X, iE:a E:insle as of the torgct scrics yt, whilc thc numbclof hidden units (q) varies from zero to eight. The best model is chosen from these alternativesusing the Schwartz Information Criterion (SIC).

For each network configuration, we estimate model parameters by a version of themethod of nonlinear least squares, Le., we attempt to solve


31/98

-31-

Optimization proceeds in two stages. First, the parameter estimates an are obtained by ordinaryleast squares, with parameters {3 constrained to zero. (Note that an contains an intercept.) Thenif q > 0, second stage parameter estimates fi n and r n are obtained in such a way as to exploit thestructure of (1.4.1); the an estimates are not subsequently modified, forcing the hidden layer toextract any available structure from the least-squares esiduals.

Inspecting (1.4.1), we see that for given rj..s, ordinary least squares gives fullyoptimal eGtimntos or /3. Thus, wc choosc a largc numbcr of ralldol1l vi1luc:) ur tIle elementS orrj, j = 1, ..., q, and compute the least squares estimates for /3. This implements a form of global

random search of the parameter space. The best fitting values of.8 and r are then used as startingvalues for local steepest descent with respect to {:3and r. Within steepest descent, the step size isdynamica11yadjusted to increase when improvements to mean squared error occur, and otherwiseto decrease until a mean squared error improvement is found. Convergence is judged to occurwhen (mse(k) -mse(k -1)/(1 + mse(k -1)) is sufficiently small, where mse(k) denotes samplemean squared error on the kth steepest descent iteration. Once a local minimum is reached, theprocedure terminates. This algorithm has been found to be fast and reliable across a variety ofapplic~tions investigated by the authors.

The re~lllt~ nf lp.~~t~'111~rp.~~tim~tion of a linear model are given in Table 1. Thosimple linear model explains only 12% of the target variance for the circle map, while explaining84% of the target variance for the Bier-Bountis map. The logistic map is intermediate at 36%,

Results for the single hidden layer feedforward network are given in Table 2. Ineach case the hidden layer network chooses to take as many hidden units as are offered (8), andwith this number of hidden units, nearly perfect fits are obtained. Because the relationships stu-died here are noiseless, the SIC starts to limit the number of hidden units chosen essentially onlywhen machine imprecision begins to corrupt the computations. This lirnit was not reached inthese examples.

Our examples show that single hidden layer feedforward networks do have


32/98


33/98


34/98


35/98

-35-

1/I(z. ()) :5 b(O) h 1 Z) + h 2 (Z); and

there exist functions PI: /R+ ~ /R+ and h3 : 1Rs ~ 1R+ such thatp I (U) ~ 0 as u ~ 0, h3 is measurable- D3$, and for each (z, (}I , (}2) in IR$ x e x e

1jI(Z. }I) -1jI(Z.(}z) ~Pl( I (}1-(}2 )h3(z),where denotes the Euclidean norm.

ASSUMPTION A.3: E 1/f(Zt, () < 00 for each () in 0, and there exists a function 'l' : 0-7 IRkcontinuous on e such that for each (} in e '(}) = limt -+ ~ E 1fI(Zt. (}).

ASSUMPTION A.4: { at} is a sequence of positive real numbers such that at ~ O as t ~ 00 andL;=oat ~ 00 as n ~ 00,

ASSUMPTION A.5:

(a) For each () in e. L;=o at [1jf(Zt. )) -Evr(Zt. ())] converges a.s,-P; and(b) For j = 1,2,3, there exist bounded non-stochastic sequences {17jt} such that

L;=o at[hj(Zt)-1Jjt] converges a.s.-P.Assumption A.l introduces the data generating process, and Assumption A.2 imposes some

suitable and relatively mild restrictions on the growth and smoothness properties of the measure-ment function 11/ Assumption A.3 is a mild asymptotic mean stationarity requirement InAssumption A.4, the condition at -7 O ensures that the effect of error adjustment eventually van-ishes; the condition ~n at-:; 00 allows the adjustment to continue for an arbitrarily long time,

""'t=1

so that the eventual convergence of (lI.l.l) is always plausible.Assumption A.5 imposes mild convergence conditions on the processes depending on Z:.

Below we consider more primitive mixingale conditions that ensure the validity of this assump-tion.

Let 1C IRk ~ e be a measurable projection function (for f) E e, 1t"(f) = f). We thenA A Ahave that for all RM estimates e t, 7r(e J E e. In what follows, e t will also denote the projected


36/98


37/98

This result generalizes classical results (e.g., Blum, 1954) in several respects. First, Zr isnot required to enter the function 1/1 dditively. Second, the learning rate at is not required to besquare summable. Most importantly, general behavior for Zr is allowed, provided that Assump-tion A.5 holds. As examples, KC consider martingale difference sequences and moving averageprocesses.

A general class of stochastic processes satisfying the convergence conditions of Assump-tion A.5 is the class of mix.ingales (McLeish, 1975). Let .llp denote the Lp-norm,IIXllp=(E x IP)llp. WhenllXllp -,p- and a-mixing processes, finite and certain infinite order moving average processes, and sequences ofnear epoch dependent functions of infinite histories of mixing processes (discussed further in thenext section). Mixingales thus constitute a rather broad class of dependent heterogeneousprocesses.

In our applications, we always assume that the relevant random variables are measurable-IF', so that the second mixingale condition holds automatically. This avoids anticipativity ofthe RM algorithm.


38/98

-38-

The following conditions permit application ofMcLeish's mixingale convergence theorem(McLeish, 1975, Corollary 1.8) to verify the conditions of Assumption A.5.

ASSUMPTIONA.4': {at} is a sequence of real positive integers such that L~=l a; < 00 andL;=l at ~~ as n ~~,

ASSUMPTIONA.5':(a) For each 0 in e, SUpt II I'(Zt, 0) 112~ ~e < 00 and {'1'(Zt, 0) -E'1'(Zt, 0), IFt} is a mix-

ingale of size -1/2, where IFt = a(Zl, ..., 2,);(b) Forj = 1,2,3, SUPtllhj(ZJ112 ~tJ. < 00 and {hj(ZJ-Ehj(ZJ, IFt} is amixingale of size

Assumption A.4' implies Assumption A.4. Note also that SUpt 111fI(Zt, ) 112 L\e < 00 isimplied by Assumptions A.5'(b) and A.2(b.i), and that we may take 17jt = Ehj(Zt). We have thefollowing result.

ACOROLLARY 11.2.3: Given Assumptions.A.1-A.3, A.4' and A.5', let { ()t} be given by (II.l.l)Awith eo chosen arbitrarily. Then the conclusions of Theorem II.2.1 hold. 0

..This provides general and fairly primitive conditions ensuring the convergence of () t. OnlyAssumption A.5' is a reasonable candidate for further specialization to achieve additional simpli-

This is most conveniently done by placing conditions on h 1, h2, h3 and {2, } sufficient toensure hat the mixingale property is valid. We give examples of this in the next section.

The present result gives a very considerable generalization of a convergence result ofWhite (1989a, Proposition 3.1). There Zr is taken to be an i.i.d. uniformly bounded sequence.Corollary II.2.3 also generalizes results of Englund, HoIst and Ruppert (1988), who assume that{ Zt } is a stationary mixing process and that 1/' is a bounded function.

Asymptotic normality follows as a consequence of Theorem 2 of KH. As KH show, thefastest rate of convergence obtains with at = (t + 1)-1; we adopt this rate for the rest of this sec-


39/98

-39-

For given 0* E IRk we writeut = (t + 1)Y2 8 t -8*).

Straightforward manipulations a11ow s to write

Ut+1 = [Ik + (t + 1)-1 Ht] Ut + (t + l)-Yi q; , (II.2.4)where

Ht = \"761{/; + [((t +2)/(t+ 1))Yl-1] \"761{/; + Ik/2 + O((t + 1)-1) Ik

(II.2.".)

and

q; = ((t + 2) / (t + 1Y' 1//; ,

with 111;= 1II(Zt, 8*), V 9 111;= V 9 1II(Zt, (1*). The piecewise constant interpolation of Ut on [0, 00)with interpolation intervals { at} is defined as

UO('l") = ut, 'r E ['rt, 'rt+l)'

and the leftward shifts are defined as

't"?o.t('t") = UO('t"t+'l"),AThe asymptotic distribution of et is found by showing that Ut( converges to the solution of a

stochastic differential equation (SDE) and then characterizing the weak limit of Ut ( .).We adopt tlle following conditions:

ASSUMPTION B.l: Assumption A.1 holds and {Zr, t = 0, :!:1, :!:2, ...} is a stationary sequenceon (.0., IF , P).

ASSUMPTION B.2:(a) Assumption A.2(a) holds; and


40/98

40-(b) For each z E IRS, 1f/(z, .) is continuously differentiable such that there exist functions

P2 : /R+ -? /R+ and h4 m.s-:; m.+ such that P2(U) -:; as u -:; 0, h4 is measurable-1Bs,and for some 0 interior to e and each (z, 0) in IRs x eo, eo an open neighborhood in e of 0

v 9 1jf(Z,e) -V 9 1jf(Z, eo ) ~P2( 10-0 ) h4(z).ASSUMPTION B.3: There exists e* E int e such that e* = eo in Assumption B.2, E1f/; = 0,1Jf; E L6(P), V 91Jf; E L2(P), and the eigenvalues of H =H" + Ik /2 (with H" = E(V 91/';)) havenegative real parts.

ASSUMPTION noS:(a) Let IFo = a(Zt) t :5 0) and suppose

IFo)lk -* *' .IF )-O"jI12, O"j=E(1JIt1Jlt+j),

(b) For some 17 4 E 1R,L;=o (t + 1)-1 [h4(Zt) -114] converges a.s. -P; and(c) L;=o (t+ 1)-1 [V91f/; -H*

where h * = E I V 9 1f/; I.*Vo1flt -h* converge a.s. -P,

The stationarity imposed in Assumption B.l is extremely convenient; without this, theanalysis becomes exceedingly complicated. Assumption B.2(b) imposes a Lipschitz conditionon v B 11/analogous to that of A.2(b.ii) for 11/AssumDtion B.3 imposes additional momp;nt rondi-tions and identifies (}* as a candidate asymptotically stable equilibrium. As we takeat = (t + 1)-1, there is no analog to Assumption A.4 or A.4'. Finally, Assumption B.5 imposessome further convergence conditions beyond those of A.5. Assumption B.5(a) restricts the localfluctuations (quadratic variation) induced by (t + l)-Y'q; in (II.2.4) to be compatible with those ofa Wiener process. Assumption B.5(b,c) (together with B.2) ensures hat the effects of the secondterm and the last term in (II.2.5) eventually vanish.

The asymptotic normality result can be stated as follows.ATHEOREM 11.2.4: Suppose Assumptions B.I-B.3 and B.5 hold, and that et ~e* a.s.-P,


41/98

-41 -A Awhere {(J t} is generated by (II.1.1) with (J o arbitrary, at = (t + 1)-1, and (J* is an isolated element

Then:{ Ut} is tight in IRk,a)

(b) I=~C:O a.


42/98

-42

(b) .V81f1th4(Z,)-E(h4(Zt))' IFt}, {V1f/; -H*, IFt} and -h*, IFf} are mix-ingales of size -1/2.We have the following result.

ACOROLLARY II.2.5: Suppose Assumptions B.I-B.3 and B.5' hold and that et -1e* a.s.-P

where {et} is generated by (1.1) with eO arbitrary, at = (t+ 1)-] and ()* is an isolated element ofEJ+ "men the conclusions of'lheorem ll.2.4 hold 0This considerably generalizes an analogous result of White (1989a, Proposition 4.1) from thei.i.d. unifornlIy bounded case to the stationary dependent case. EngIund, HoIst and Ruppert(1988) also give a result for i.i.d. observations.

11.3. RECURSIVE NONLINEAR LEAST SQUARES ESTIMATION

Suppose the nonlinear model I(Xt, 8) (I: IRr x D -:; IR, Xt a random r x 1 vector,8 E D C IRk) is to be used to forecast the random variahlp. y. It i.l: ~nmmon to seek 8* , a solu.tion to the problem

mill E([Yt -f(Xt. 8)r).oe D

and foml a forecast f(Xt, 8*). The solution 8* is also a solution to the problem' (8) = E(Vo f(Xt, 8) [yt -f(Xt, 8)]) = 0,

where V b' s the gradient operator with respect to 8 yielding a k x 1 column vector. The simpleRM algorithm for this problem in nonlinear least squares regression is the algorithm (II.l.l) with

1/f(Zt, () = v b' f(Xt, 8) [yt -f(Xt, 8)],

where Zr = (yt, X;)' and e = 8. The updating equation isA A " "8 t+1 =8 t + at Vo!t[yt -It] , {II.3.1)

.--where we have written it = f(Xt. 8,). Vo ft = Vo f(Xt. 8,). This is known as a "stochastic gra-went method." In tl1is section we consider the properties of tl1is algorithm and two useful


43/98

43variants, the "quick" and the "modified" RM algorithms.

A disadvantage of the simple RM algorithm is that it may converge very slowly (e.g.White, 1988). To improve the speed of convergence, a natural modification is to take an approxi-mate Gauss-Newton step at each stage. This yields the modified RM algorithm, also known asthe "stochastic Newton method" The algorithm is given by (1.1) with

[0/1 (Zt,1fI2(Zt,

1JI(Zt, 8) =

1fIl (Zt. 8) = vec [Vo f(Xt. 8) Vo f(Xt. 8)' -G].

'/f2(Zt. 0) = G 1 Vo f(Xt. 0) [rt -f(Xt. 0 )J

where e = vec G)', 8')' The updating equations are then

(II.3.2a)1 ---

8t+1 =8t+atGt+1 Vo!t[ft-ft]. (II.3.2b)AWe take Go to be an arbitrary positive-definite symmetric matrix.

"The difficulties of applying this algorithm are: (1) the inversion of Gt+l is computation allyAdemanding, and (2) the updating estimates Gt need not be positive-definite, pointing the algo-

rithm in the wrong direction.The first problem can be solved by use of the rank one updating formula for the matrix

inverse. Let Pt+l = Gt"ll and }.t = (1- at)/ at. The modified RM algorithm is algebraicallyequJvalem 10

(II.3.3a)A A " " "Ot+l =Ot + at Pt+l Volt [ft-It], (II.3.3b)

Aof. Ljung and Sodorstrom (1983, Ch. 2 & 3). Thc; c;hoil;c; Po = Ik j1) uflclll;ullvcnlem.


44/98

44-ATo ensure that Gt is positive-definite, we may use the following modification of (1I.3.2a):" " ", "

Gt+l = Gt + at [V 8ft V 8ft -Gt], (II.3.4a)

(II.3.4b)"Gt+l =

Awhere E is some predetermined positive number, and Mt+l (E) is chosen so that Gt+l -El ispositive-semidefinite. Some practical implementations of this can be found in Ljung and Soder-

Astrom (1983, Ch. 6). A similar device can be applied to Pt. Implementation of this algorithm willA Abe understood to employ a projection device restricting 8 t to a compact set D and Gt to a com-

.--pact convex set r such that the max-imum and minimum eigenvalues of Gt lie in a boundedstrictly positive interval.

A simplification of the modified RM algorithm is to choose G to be a diagonal matrix. Inparticular, we take G = e Ik, where e is a positive scalar, so that matrix inversion is avoided.This yields the quick RM algorithm, the algorithm (II.l.l) with 1[/ = [1[/~ 1[/;]', where now

V'1 (Zt, f)) = Vof(Xt' 8)' Vo f(Xt, 8) -e,

1JI2(Zt, ()) = e-l Vo f(Xt, 8)[Yt -f(Xt, 8)],

so that the updating equations becomeA A A, A Aet+l = et + at[Vc5ft Vc5f, -et] (II.3.5a)

(II.3.5b)The scalar et can be easily modified to be positive in a manner analogous to (3.4); we also restrictet to be bounded. The quick RM algorithm is a compromise of the other two algorithms in that ittakes a negative gradient direction with a scaling factor utilizing some local curvature informa-tion. Consequently, the quick algorithm ought to converge more quickly than the simple algo-rithm but more slowly than the modified algorithm. When al = (t + 1)-1, the quick algorithm


45/98

-45 -

then reduces to the "quick and dirty" algorithm of Albert and Gardner (1967, Ch. 7).It is straightforward to impose conditions ensuri~g the validity of all assumptions required

for the convergence results of the preceding section. Only the mix.ingale assumptions A.5' andB.5' require particular attention. We make use of a convenient and fairly general class of mix-ingales, near epoch dependent (NED) functions of mixing processes Billingsley, 1968, McLeish,1975, Gallant and White, 1988a).

Let { Vt} be a stochastic process on (.0., IF , P) and define the mix.ing coefficients

P(G F) -P(G)1> = SUp't" SUP{F E 1F' , G E IF;.m : P(F) > 0}

P(G ('\ F) -P(G) P(F)m = SUp't" SUP{FeF-. GeJF;.m}where /F~ =a(V'C, ..., Vt). Whentpm ~ O or am ~ O as m ~ ~ we say that {Vt} istp-mixing (uni-fonn mixing) ora-mixing (strong mixing). Whenlf>m = O(mA.) for some)., < -a we say that {Vt}is tf>-rnixing of size -a, and sirnilarly for amo We use the following definition of near epoch

1F~.!,':.').ependence. where we ~d()pt the n()t~ti()n P~:!:,'::() ~ E( .

DEFINITION 11.3.1: Let {Zt} be a sequence of random variables belonging to L2(P), and let{ vt } be a stochastic process on (0, IF 1 P). Then {Zt} is near epoch dependent (NED) on { Vt} of

Dize -a ifv m = SUpt II Zt -E~~:::(Zt) 112 s of size -a.

The following three results make it straightforward to impose conditions sufficing forAssumptions A.5' and B.5'. The first is obtained by following the argument of Theorem 3.1 ofMcLeish (1975). The second simplifies a result of Andrews (1989). The third allows simpletreatment of products of NED sequences.

PROPOSITION 11.3.2: Let {Zt E Lp(P)}, p ?2 be NED on {Vt} of size -a, where {Vt} is amixing sequence with If>m of size -ap /(P -1) or am of size -2ap / (p -2), p > 2.

0Zr -E(Zr) } is a mixingale of size -a.PROPOSmONII.3.3: Let {Zr} satisfy the conditions ofProposition II.3.2. Let 9 : IRs


46/98

-46-satisfy a Lipschitz condition, g(Zl)-g(ZV I ~L ,L < oo,Zl,Z2, E lRs. Thenl-Z2{g(Zt) E Lr(P) } is NED on { Vt} of size -a. If { Vt} satisfies the conditions of Proposition II.3.2,

Dhen {g(Zt)-E(g(ZJ)} is amixingale of size -a.

PROPOSITION 11.3.4: Let {Ut} and {Wt} be two sequencesNED on {Vt} ofsize-a.(a) If SUpt wt $ il < 00 and SUpt II u t 114 $ il < 00, then SUpt II Ut wt 114 $ il2 and { Ut Wt} is

NhU on t Vt} of size -a /2.(b) If SUpt I w t 118 ~ < 00 and SUpt I u t 118 ~ < 00 then SUpt II u t w t 114 ~ 2 and { U W } is

NED on { Vt} of size -a /2.(c) If SUpt I u t 118 L\ < 00 and { Vt } satisfies tile conditions of Proposition 3.2, tilen tilere

D'Qxist KX) and a sequence of real numbers {bt} such that supj~oIIE(UtUt+jE(Ut Ut+j)l!2 ::; Kbt and bt is of size -all. 0Our subsequent results will make use of Proposition ll.3.4(a), requiring SUpt II ft 114$: !!. and abound on tho olomonts of Xto Part (b ) i11ustra.tcs usc of thc Ca.uchy-Schwart.Z; incqu0.1ity to cclaxthe boundedness condition; the price for this is a corresponding strengthening of moment condi-tions on ut (corresponding to yt). Here we sha11 dopt boundedness conditions on Xt to minim-ize moment conditions placed on yt and facilitate verification of the Lipschitz condition of Pro-position II.3.3. Part (c) permits verification of Assumption B.5' (a.ii).

We impose the following conditions.

ASSUMPTION C.l: Assumption A.l holds, and {Zt} is NED on {Vt} of size -1, whereZr=(Yt,X;)' with Xt bounded and suPtIlYtIlp~L1


47/98


48/98

48 -

methods thus coincide, so that the RM estimators tend to the same imit(s) as the nonlinear leastsquares estimator (cf. Ljung and Soderstrom, 1983).

Corollary 11.3.5 s more general than the i.i.d. case treated by White (1989a) and the exam-pIes given in KC (Ch. 2), as we allow the data to be moderately dependent and heterogeneous.This result differs from those of Metivier and Priouret (1984) in that we require neither "condi-tional independence" nor stationarity .

Corollary II.3.5 also generalizes a result of Ruppert (1983). Ruppert assumes hat for someO yt = f(Xt, 8*) + Et and that (Xt, EJ is strong mixing of size -p / (p -2), a condition that mayfail when Xt contains lagged ft, because ft need not be mixing when it is generated in thismanner, even when Et and other elements of Xt are mix.ing. Indeed, this fact partially motivatesour usage of near epoch dependence. Also, we do not require that Ye s generated n the mannerassumed by Ruppert (Le., we may be estimating a "rnisspecified" model). Compared to the resultof Ljung and Soderstrom (1983), we allow more dependence n the data, as the data need not bege:ne:r~te:d y ~ line'Jr filter.

The modified RM algorithm can be identified with the extended Kalman filter for the non-linear signal model

yt = f(Xt. 8t) + Et

8t = 80 for all t." "The Kalman gain is at Pt+l V 6 t. Corollary 11.3.5 hus provides conditions more general than

previously available ensuring consistency of the filter. In particular, the model can bemisspecified and the data can be NED on some underlying mixing sequence.

Because the quick RM algorithm includes Albert and Gardner's quick and dirty algorithm,Corollary 11.3.5 directly generalizes their consistency result to the case of dependent observa-tiODS.

To obtain asymptotic normality results for the case of nonlinear regression, we impose thefollowing conditions.


49/98


50/98


51/98


52/98

-52-

For this we impose appropriate conditions. In particular, we adopt Assumption C.l. Theassumption of uniformly bounded XI causes no loss of generality in the present context. This is aconsequence of the fact that E(Yt I Xt) = E(Yt Xt) where Xti = ~(Xti)' i = 1, ..., r and~ IR ~ [0, 1] is a strictly increasing continuous function. If Xt is not unifonnly bounded thenit is, and we seek an approximation to g(iJ = E(Yt I it). We revert to our original notation inwhat follows, with the implicit understanding that Xt has been transformed so that AssumptionC.l holds. Note, however, that yt is not assumedbounded, providing the desired generality.

ASSUMPTION E.l: I: ]R' x D ~ IR is given by (4.1) where D = A x B x r, with A, B and rcompact subsets of IRr, IRq+l and IRq(r+l) respectively, and with G: IR ~ IR a boundedfunction continuously differentiable of order 3

The conditions on G are readily verified for the logistic c.d.f. and hyperbolic tangent "squashers"commonly used in neural network applications.

ACOROLLARY 11.4.1: Given Assumptions C.l, E.l, C.3 and A.4', let {()t} be given by (II.3.1),A(II.3.2) or (II.3.5) (the simple, modified and quick algorithms, respectively) with (} o chosen arbi-

trarily. Then the conclusions of Theorem II.2.1 hold. 0Thus the method ofback-propagation and its generalizations converge to a parameter vector giv-ing a locally mcan 3quarc optimal approximation to thc (;ond1Llonal cAllcl,;laLlull [Ulll,;l!UllE(Yt Xt) under general conditions on the stochastic process {2, }. This result considerably gen-eralizes Theorem 3.2 of White (1989a),

For the asymptotic distribution results, we impose the following condition.

ASSUMPTION F.l: Assumption E.l holds with G continuously differentiable of order 4.ACOROLLARY 11.4.2: Suppose Assumptions D.l, D.2 and F.l hold and that et ~e.a.s.-P

where {ot} is generated by (II.3.1), (II.3.2) or (II.3.5) with 00 chosen arbitrarily, at = (t + 1)-1,and 0. is an isolated element of e. .Then the conclusions of Theorem 11.2.4hold.


53/98


54/98

-54 -

considered here. For many choices of 1/' the analysis parallels that for the least squares caserather closely. These results are within relatively easy reach for estimation procedures.

For neural network models, it is desirable to relax the assumption that q is fixed. Lettingq -7 00 as the available sample becomes arbitrarily large permits use of neural network modelsfor purposes of non-parametric estimation. Off-line non-parametric estimation methods for thecase of mixing processes are treated by White (1990a) using results for the method of sieves(Grenander, 1981, White and Wooldridge, 1991). On-line non-parametric estimation methodsappear possible, but will require convergence to a global optimum of the underlying least squaresproblem, not just the local optimum that the present methods deliver. Results of Kushner (1987)for the method of simulated annealing provide hope that convergence to the global optimum isachievable for the case of dependent observations with appropriate modifications to the RM pro-cedure.

Finally, it is of interest to consider RM algorithms for neural network models that general-ize the feedforward networks treated here by a11owing certain intema1 feedbacks. Such"recurrent" network models have been considered by Jordan (1986), Elman (1988) and Williamsand Zipser (1989). For example, in the Elman (1988) set up, hidden layer activations feed back,so that network output is Ot = F(At' /3), Atj = G(Xt' Oj + At-l ' Oj), j = 1, ..., q, where At =CAto.Atl. ",.Atq)-.Ato = 1 This a11ows or intema1 network memory and for rich dynamicbehavior of network output. Learning in such models is complicated by the fact that at any stageof learning. network output depends not only on the entire past history of inputs Xt. but also on

..the entire past history of estimated parameters e t. Results of KC are relevant for treating suchinternal feedbacks. Convergence of RM estimates in recurrent networks is studied by Kuan(1989) and Kuan, Hornik and White (1990).


55/98


56/98


57/98

-57

II.2.1(b) follows from Theorem II.2.1(c).Finally, we show that cycling between two asymptotically stable equilibria is impossible.

It is easy to see that points in e. must be isolated. Let O ~ and 0; be two isolated points in e. tand let NEI and NEz be neighborhoods of 8~ and 8;, respectively, such that NEl ~ dCe*),

" "N el ~ d(e* ), and N el f'\ N el = 0. If the path of e t cycles between e ~ and e;, e t must movefrom, say, N 1 to N z infinitely often. Let {ti} be an infinite subsequence of {t} such thate ti E N El Then (}t; ( .) is a subsequence of (}t ( .) and has limit 8 ( .) satisfying the

---e = 7t'['(8)] .But for every T there is a t > T such that 8 (t) E N 1. Hence 8 (0) E N I but 8 ('r)caWlot conycrgc; to O i as .-:. ~. 'I1ll~ vlo1atc~ IllC; ~y11lpt.Ul1l; ~t.(1billt.y uf{J i and proves Theorem2.1(d). D

PROOF OF COROLLARY 11.23: The result follows from Theorem ll.2.1 because thesummability condition of at in Assumption A.4' implies at ~ O as t ~ ~ and Assumption A.5'implies Assumption A.5 by the mixingale convergence theorem (McLeish, 1975, Corollary1.8). D

PROOF OF THEOREM 11.2.4: We verify the condition." for Theorem 2 nf KH

We first observe that the conditions [A1], [A4], [A7] and [AS] of KH are directly assumed,and that [A3] ofKH is ensured by Assumption B.5(c) and Lemma Al.

Second, we show that the consequence of [A2] of KH holds under Assumptions B.2(b) andB.5(b, c). This amounts to showing that the second assertion in Lemma 1 of KH holds. ByAssumption B.2(b) we have

L:learly, the integral on the RHS of (a. 10) converges to zero a.s. because e t -7 () .a.s. Let {Ek} bea sequence of positive real numbers such that LkEk < 00, and let {Nk} be a sequence ofintegers

tending to infinity as k -7 00. Define measurable sets Ak, Bk, Ck, Dk and F k as:


58/98


59/98


60/98

60 -

PROOF OF COROLLARY 11.2.5: Only Assumption B.5 needs to be verified. We observe thatAssumption B.5'(b) is a mixingale condition ensuring Assumption B.5(b, c) by the mixingaleconvergence theorem. To establish Assumption B.5(a), we see that Assumption B.5'(a.i) ensuresthat for K < 00

](t =IIE(l/f; lFQ ) 112 $: K ~ IL",t . (a.15)where; ~,' is the mixinsale memory coefficient. The fact that; K;,r s of size -2 implies that

~

This establishes AssumDtion B.5(aj). Similarly, AssumDtion B.5' Ca.ii) mDoses; t = SUPj?:O IIE(1f/; 1f/;:j -E(1f/; 1f/;+j ) IIFo )112 ~ Khto

= 0hat bt is of size -2 ensures hat Lt=o ~r < 00. This establishes Assumption B.5(a.ii).PROOF OF PROPOSITION 11.3.2: See Gallant and White (1988a, Lemma 3.14). D

PROOF OF PROPOSITION 11.3.3: See Andrews (1989, Lemma 1). 0PROOF OF PROPOSITION 113.4: (a) We first observe that

EIUtWt-E~.::::(UtWt) 12

where Ut.m= E~.:!:.:::(Ut)nd Wt.m= E~.:!:.:::(Wt). ere we employ the fact that E~.:!:.:::(Ut t) is thebest L2-predictor of Ut Wt among all IF~~::: measurable functions. Hence,

II Ut Wt-E~::::(Ut WJI12

~IIUtWt-Ut,mWt,mI12


61/98


62/98


63/98

-63 -

are Lipschitz continuous in x. Therefore,11jf(z, ())I = I Vof(x, li)[Y-f(x, li)]1

~Q2(x)[lyl +Ql(X)],so that Assumption A.2(b.i) holds for b(())=l, h1(z)=1, and h2(z)=Q2(X)[lyl +Ql(X)].

11//(z, ()l)-1//(z,()vl

Vof(x, 81)[y -f(x, 81)] -Vof(x, 82)[y -f(x, 8z)] I1

~ I Vof(x, 81)y -Vof(x, 8vy I + I Vo!(x, 82)!(x, 8v -Vof(x, b'l)!(X, 81) I (a.18)It follows from Assumption C.2 that

IVof(x, 81)y-Vof(x, 8vyl ~ lyIL2(X)181-821 ,

I Vof(x, 82)f(x, 8v -Vof(x, 81 )f(x, 81) I

~ I Vof(x, ~)f(x, 82) -Vof(x, 8vf(x, 81) I + I Vof(x, 8vf(x, 81) -Vof(x, 81)f(x, 81) I

:s: v c5f(x,82) L1 (X) 181-821+ If(x, 81) L2(x) 181-82

Hence (a.18) becomes

Thi;s c;;sta.bli;shc;~ A~~UUlVUUU A.2(lJ.li).


64/98

64Because Iy I, L1(x), L2(x), Ql(X) and Q2(X) satisfy Lipschitz conditions, Proposition

11.3.3ensures hat IYtl , Ll(Xt)' L2(Xt) Ql(XJ and Q2(Xt) are NED on {Vt} of size -1. BecauseXt is bounded, Ql(Xt), Q2(X,), L1(X,) and L2(Xt) are bounded. Because IlYtl14 ~Ll, it followsfrom Proposition 1I.3.4(a) and Corollary 4.3(a) of Gallant and White (1988a) (i.e., sums of ran-dom variables NED of size -a are also NED of size -a) that h3(ZJ is NED on {Vt} of size -1/2.The mixing conditions of Assumption C.l then ensure that { h 3 (Zt) -Eh 3 Zt) } is a mixingale ofsize -112 by Proposition 3.2. Similarly, {h2(ZJ -Eh2(ZJ} is a mixingale of size -112, establish-ing Assumption A.5'(ii).

We next verify that for each 8 E e, {1jI(Zt, 8) } is a mixingale of size -1/2. Fix 8 ( = 8).Observe that the Lipschitz condition on f( .,8) and the conditions on {2,} imply by Proposition

Vt} of size -1 The triangle inequality implies thatl.3.3 that {f(Xt. 8) } is NED on{Yt-fCXt, 8)} is NED on {Vt} of size -1, and the boundedness ofXt, the continuity offC ., 8),and the fact that II yt 114~ 11 < 00 implies that II yt -f CXt. 8~14 ~ 11 < 00. The Lipschitz condition on{Vol( ..8)} and the conditions of {Zt} imply by Proposition 11.3.3 that {Vol(Xt. 8)} is alsoNED on {Vt} of size -1. Further, the elements ofV f(Xt, 8) are bounded, so that by PropositionII.3.4(a) {1f/(Zt. ()) = Vof(Xt. 8)[ft- f(Xt. 8)] } is NED on Vt} of size -1/2. It follows from Pro-position ll.3.2 that { Vo!(Xt. 8)[yt -!(xt, 8)] } is a mixingale of size -lll, given the mixing con-ditions imposed on {Vt} by Assumption C.l. Thus, Assumption A.5'(i) holds, and the result forthe simple RM procedure follows.

For the modified RM estimates we first note that every element of 0-1 is bounded above sothat I G-1 I < 11 for some 11.

Now.

~ IG-l Vof(x, 8)[y -f(x, 8)] I

~~ Q2(x)[\yl + Ql(X)]


65/98

65 -

I Vll(Z, 8)1 = \vec(Vof(x, 0) Vof(x, O)'-G)\

vec (Vof(x, 8) Vof(x, 8)') I + Ivec GI

= I Vof(x, 8) 12 + I vec G 1

where we use the fact that I vec A = [tr(A ' A)]Y%. Hence Assumption A.2(b.i) holds, as

= h:1(7)

We now establish a mean-value expansion result for G-1. Recall that G is restricted to aconvex compact set r, so the mean value theorem applies. A matrix differentiation result showsthat when c is symmetric and nonsingular, dC-l/dg/J a-l Sij a-1 , whccc gij is tl"lC j-U.element of G and Sij is a selection matrix whose every element is zero except that the ij-th andji-th elements are one; see Graybi11 (1983, p. 358). Hence we can write

rl ("0... r:,-l'l-l 1 0 -1 )\vec '-' J = -vec (0- Sij ,a~;j


66/98


67/98

The first term of (a.19) is less than

+

VofCx, 81)VofCx, 8v' -vec [ v ofCx, 8z) vofCx, 8z)'ec

.Vof(x, 81)-Vof(x, 8V]I@VoJ(x,81)) +

v 6 f (x, 81) -V 6 f (x, 8V]Vof(x, 82) ~ 1)

~ Vof(x, 81) -Vof(x, 8v

~ VofCx, O2) @ Iwhere we used the fact that vec (ARC) = (C !8> ) vec B. It can be verified that

and Vof(x, c5z)@ I, where k is the dimension of 8. Thus, (a.19) becomes

1Jf1 (z, (J1) -1Jf1 (z, (Jv

~ 2K Q2 (X) L2 (X) 01-021 + I vec(G2-G1)

~ h3' (z) I f)1 -f)2

where hJ (z) = 2K Q2 (X) L2 (X) + 1 .Hence Assumption A.2(b.ii) holds, as

$; 11fIl(Z, 91)-V'1(Z, 92)1 + IV'2(Z, 91)-V'2(Z, 9vl

~ h3(z) 181 -821

hj(z) = h; (~) + h3' (;).ith Using thc befored1nc (11; Wll~ll~ as we nave that{h2(Zt) -Eh2(Zt) }, {h3(Zt) -Eh3(Zt) }, and {1/f(Zt, 0) -E1/f(Zt' 0) } are mixingales of size -112.


68/98

-68-

Hence Assumption A.5' also holds. This yields the desired results for the modified RM esti-mates.

The conclusions for the quick RM estimates follow because he quick algorithm is a specialcase of the modified algorithm. D

PROOF OF COROLLARY 11.3.6: We verify the conditions of Corollary ll.2.5. For the simpleKM estimates we neea to ShOw hat Assumptions B.2(b) ana B.5' hold In this case

v 9 1jI(Z,e) = V 0(\70 (x, 8) [y -f (x, 8)])

= Voof(x, 8) [y- f(x, 8)] -Vof(x, 8) Vof(x, 8)'hence for 9 in int G and 9 in Go

I v 6 1jf(Z, (}) -V 6 1jf(Z, (}a) I

Voof(y-f)-VofVof'-Voor(y-r) + vor Vor'

~ IVqo!Y-Voof'y\ + I(Voof')f'-(Voof)!1 + lVof'Vof"-VofVof'1

where we have written f = f(x, 8), r = f(x, ~ ), etc. By Assumption D.2, 0 = ~ = 8* .Apply-Voo!Y-Voary\ ~ ly\L3(X)18-50 I,andng Assumption D.3 we get

I (Voof' )I' -(Vool)11 ~ I (Voof' )I' -(Voof' )iI + I (Voof' )1- (Voof>ll

~ lVoorl Ll(X) 18-00 I + Ql (X)L3 (x) 18-00 I

18-~1

Voor I .$: Q3(X), with Q3 Lipschitz-continuous in x by straightforward arguments.ince

Funher,I Vorvar' -Vof Vof' I ~ I varvar' -VorVof' I + I vor Vof' -Vof Vof' I

~ 2Q2(X) L2(x) 18 -l)O I,

so hat


69/98


70/98


71/98

-71-

I ( G-1 -( Go)-l [Voor (y -r) -Vor Vor'] I (a.22)It follows from Ca.20) that the fir.1;t enn in (a22) i.l; f';.1;.I;h~n

Iy I L3 (x) + Q3(X) Ll (X) + Ql (X) L3 (X) + 2 Q2(X) L2 (X)] 10-00 I1

It can also be verified that the second tenn in (a.22) is less thanQ3(X) ( Iy I + ~1 tX + tQ2(XZG 1 -(G~) 1 I

Iy I + Ql (X)) + (Q2(X))2 vec(G-GO)

Thus (a.22) becomes

0-1 [Voof(y- .n -VofVof'] -(00)-1 [Voor (y-r) -Vor Vor']$: h;' (z) 18-8 I,

whereIy I L3(X) + Q3(X) L1(x) + Ql(X) L3(X) + 2Q2(X) L2(x);' (z) E 11

+ Q3(X) ( Iy I + Ql (X + (Q2 (X2We also note the fact that IA I ~ I vec A I ~};j}; j I ajj I, where A is a square matrix and ajj

are its elements. Combining these results we immediately get~ h4 (z) 18-81,8 1jI (Z, e) -V 8 1jI (z, eO)

where h4 (z) = h~ (z) + h; (z) + h;' (z) .This establishes Assumption B.2(b). A11 other con-ditions can be verified as the proof for the simple RM algorithm. Thus the asymptotic distribu-

..tion result of 8 t follows from Corollary 1I.2.5 with

H; = E(V 91f1; ,where v 611' (z, e) IS given by (a.:Zl), ana


72/98


73/98

-73

where the first equality follows from the fact that1xp[(-Ik/2)c] = exp(-c/2) 1 = [exp(-c/2)]Ik,

For the quick RM algorithm,z E (Val; .Vaol;)-I

H * -3 - 0 *-1 a *-e

is also block triangular, and the lower ~ght kxk block ofI3 is

It follows that the lower right kxk block of F; is

so that...d -.(t + 1)Y; (8t -8.) ~ N (0, F3)

We now show that F~ -G *-1 i~ G *-1 is a positive semidefinite matrix. From Theorem1I.2-4(c) we 2et

-};1 =HIFl +F1 HI =(HI +I/2)Fl +F1(Hl +1/2)

=HI FI +FI HI +FI .Hence,

-(G* )-1 I1(G* )-1

=(G*)-I(H~F~ +F~H~ +F~)(G*)-1

-F~(G.)-l -(G*)-IF~ + (G*)-IF~(G*)-1


74/98


75/98

-75 -

is positive semidefinite, where (F; )y, is such that (F; )Vi (F; )Vi = F; .Since i~ = }; 1, the resultholds. D

PROOF OF COROLLARY 11.4.1: Owing to the compactness of the relevant domains, the spe-cia} structure of f in (II.4.1) and the continuous differentiability of G, it is straightforward to verify the domination and Lipschitz conditions required for application of Corollary 11.3.5. D

PROOF OF COROLLARY 11.4.2: Direct application of Corollary II.3.6. D


76/98


77/98


78/98


79/98


80/98


81/98


82/98


83/98


84/98


85/98

77 -

TABLE 2

DETERMINISTIC CHAOS APPROXIMAlED BYSINGLE IllDDEN LAYER FEEDFORW ARD NETWORK

Logistic Map Circle Map Bier-Bountis Map8 8 8

N 250 250 2502.68 x 10-4 1.35 x 10-3 2.34 x 10-2

R2 .9999 .9999 .9999SIC -7.93 -6.32 -3.46

t q. = SIC-optimal number of hidden units; remaining symbols as in Table 1.


86/98

-78-

REFERENCES

Albert, A.E., and L.A. Gardner (1967): Stochastic Approximation and Nonlinear Regression,Cambridge: M.I. T. Press.

Amemiya, T. (1981): "Qualitative Response Models: A Survey," Journal of Economic Literature19, 1483-1536.

Amemiya, T. (1985): Advanced Econometrics. Cambridge: Harvard University Press.

Andrews, D. W.K. (1989): "An Empirical Process Central Limit Theorem fo

Date post:	14-Apr-2018
Category:	Documents
Upload:	hector-aguilar-torres
View:	227 times
Download:	0 times

Artificial Neural Networks an Econometric Perspective

Documents