TOPICS IN ADVANCED TIME
SERIES ANALYSIS
Victor Solo
Preface
These notes were prepared for a series of 20 lectures given at
the invitation of Professors Rolando Rebolledo and Guido del Pino in
July 1982 at the Primera Escuela de Invierno de Probabilidad y
Estadistica held at the Universidad Catolica de Chile. I thank them
for an enjoyable and useful time in Santiago.
One aim of these lectures is to provide a discussion of some
issues not easily found elsewhere. It is hoped some parts of the
notes will benefit both theoreticians and practitioners. I have
tried as much to put ideas across as to give technical detail: thus
where appropriate scalar and bivariate time series have been used to
illustrate issues and motivate the general multivariate case.
I would like to thank Narie Sheehan who did an
amazing job of typing an enormous amount of material
in a very short time.
i.
CON £ENTS
INTRODUCTION
I.A Motivation
I.B Some Ideas on Time Series Modelling
I.C A Selection of Time Series and Econometric
Models
APPENDIX A.I Matrix Calculations
page
171
2. SOME PROPERTIES OF TIME SERIES MODELS
2.A Some Aspects of the Spectrum
2.B Minimum Phase and Minimum Delay
2.C Spectral Factorization
179
3. REVIEW OF PREDICTION I
3.A Prediction: General Ideas
3.B The Wiener Filter
3.C The Levinson, Whittle Algorithm
APPENDIX A.3 Projection
185
REVIEW OF PREDICTION II
4.A The Kalman Filter
4.B The Output Statistics Kalman Filter
4.C The Kalman Filter for an ARMA Model
4.D Likelihood Functions for Time Series
192
5.
6,
THEORY OF IDENTIFIABILITY I
5.A Introduction and Examples
5.B Basic Ideas
5.C Identifiability and Estimability
IDENTIFIABILITY II
6.A Simultaneous Equation Models
6.B Examples
6.C Local Identifiability
APPENDIX A°6 A Basic Lemma of Identifiability
197
201
7.
8,
9.
10.
11.
12.
13.
168
IDENTIFIABILITY III
7.A Identifiability and Kullback-Liebler Information
7.B Consistency and Identifiability
IDENTIFIABILITY OF SOME TIME SERIES MODELS
8,A ARMA Models
8,B Transfer Function Models
APPENDIX A.8 Basic Lemmas in Time Series Identifiability
LAGS IN MULTIVARIATE ARMA MODELS I
9.A Lags, Kronecker Indices and McMillan Degrees
9.B Determination of AR Parameters
9oC Construction of a State Space Form
APPENDIX A.9 Matrix Polynomials
LAGS IN MULTIVARIATE ARMAMODELS II
10.A Matrix Fraction Descriptions
APPENDIX A°IO The Predictable Degree Property of Row Reduced Polynomial Matrices
MARTINGALE LIMIT THEOREMS
II.A Martingales
II.B Martingale Central Limit Theorem
APPENDIX A.II Basics of Martingales
APPENDIX B.II Some Useful Results in Probability
APPENDIX C.II A Strong Law of Large Numbers
LEAS TSQUARES ASYMPTOTICS IN REGRESSION MODEL~
12.A Introduction
12.B Regression Without Feedback
12.C Regression With Feedback
12,D A General Central Limit Theorem
APPENDIX A.12 Convergence of a Series
LEAST SQUARES ASYMPTOTICS IN AR AND ARX MODELS
13.A Consistency in AR Models
13.B Central Limit Theorem in AR Models
13.C Consistency in ARX Models
210
214
224
237
243
255
263
14.
15.
169
LEAST S~UARES ASYMPTOTICS FOR ARMA MODELS
14.A Preliminaries
14.B Consistency of Least Squares
14.C A Central Limit Theorem
APPENDIX A.14 Consistency of Sample Autocovariances
ASYMPTOTICALLY EFFICIENT ESTIMATORS IN ARMA MODELS
16.
17.
18.
19.
20.
15.A The One-Step Gauss Newton Scheme
15.B Adjoint Variables for Quick Gradient Calculation
HYPOTHESIS TESTS IN TIME SERIES
16.A Lagrange Multiplier Tests
16.B Testing for Autocorrelation in a Regression
16.C Choice of Order and AIC
IDENTIFIABILITY OF CLOSED LOOP SYSTEMS I
17.A Introduction
17.B Basic Issues in Closed Loop Identifiability
17.C Identifiability of the Forward Loop
IDENTIFIABILITY OF CLOSED LOOP SYSTEMS II
18.A Identifiability of the Closed Loop
APPENDIX A,18 Closed Loop Models from Spectral Factors
APPENDIX B.18 Nongeneric Pole Zero Cancellations
LINEARLY FEEDBACK FREE PROCESSES
19.A Introduction
19.B Linear Exogeneity
19.C Linear Feedback Free Processes
TESTS FOR ABSENCE OF LINEAR FEEDBACK
20.A Tests for Linear Feedback
20.B Identifiability and Weak Linear Feedback
270
280
288
295
303
308
315
References 320
NOTATION
AR
ARX
ARMA
ARMAX
a(L), ~(L)
b(e), ~(L)
~,b
c(L)
6
~k
e k
f
f (~), ~(~)
F
L
m
P
Pi
R -S
SEM
Z
S k
r(e)
v k, w k
x k
-Yk' Yk
z k , Z k
z~ Z
autoregression
autoregressive, exogenous
autoregressive moving average
autoregressive moving average exogenous
AR polynomials; a(L) = E~ a i L i etc.
exogenous polynomials
coefficients on regressors
noise or disturbance polynomial
McMillan degree
white noise
prediction error, innovations, residual
frequency; ~ = 2~f
spectrum
coefficients on endogenous variables
lag or backwards operator
dimension of Z, Y
dimension of ~,
order of AR; maximum Kronecker index
Kronecker indices
state covariance matrix; reduced term parameters
autocovariance sequence
simultaneous equation model
covariance matrix
s igna 1
rational transfer function
correlated noise sequence
state vector; regression vector
observed sequence
exogenous or input sequence
z transform or forwards operator
LECTURE i: INTRODUCTION
IA. Motivation
In many areas of statistical application such as economics, business,
engineering, earth sciences, environmental sciences, data are collected
over time; typically quarterly, daily, hourly, in seconds etc. Often sucb
data is available on several related variables. An example is the daily
measurement of dissolved oxygen concentration and biochemical oxygen demand
in a river as indicators of water quality. Another example concerns the
study of the relation between in-migration and out-migration from a region
of a country as they relate to economic variables such as capital growth.
There are two basic reasons for analyzing the data jointly as opposed
to singly:
(i) To understand the dynamic relations among the series. The effect of
a change in one series may show in another series simultaneously or
over a period of time and with varying amplitude of effect. Some
series may lead others and there may be feedback connections between
them.
(2) To provide more accurate forecasts. The joint modelling allows the
use of joint information; intuitively this enables forecast errors
to be reduced.
Note.
IB.
See also Tiao and Box (1979).
Some Ideas on Time Series Modelling
There seem to be three threads in the time domain modelling of
time series. One old one, one more recent and one just developing.
(i) A traditional view
(ii) The view popularized by Box and Jenkins
(iii) A very new idea
172
(i)
form
In the traditional view the data YI' "''' Yn is modelled in the
= + ~t + vt Yt Tt
where T t is a trend such as a mean or a polynomial; ~t is a seasonal com-
ponent e.g., a sum of sinusoids of various frequencies; v t is a correlated
noise sequence or disturbance sequence of bounded variance. The trend is
supposed to capture the "smooth" part of Yt; the seasonal component captures
the "regular" part; while the noise captures the "rough" or unpredictable
part. The model is fitted by a combination of regression and smoothing to
remove the trend and the seasonal component. The noise is modelled as a
stationary process. In general, the noise Vl, ...,v n is one realization
(i.e. a sample of size I) from a multivariate distribution with covarianee
1 matrix having ~n(n+ i) elements. The stationarity assumption reduces this
to n elements (which puts the analysis in the realm of classical nonpara-
metric statistics). Finally, finite parameter models (AR, MA, ARMA) reduce
this to a fixed dimension. A careful discussion of this methodology is the
book of T. W. Anderson (1971).
(ii) The idea here is to model time series in terms of random
wandering (RW) models: this term which connotes random walk is preferred
to nonstationary since confusion has resulted with time varying parameter
models of bounded variance. The data is differenced to produce a stationary
model which is then fit by an AR, ~ or A~MA scheme. Typical differencing
operators are 1 - L, (I-L) 2, 1 - L4; this last one removes quarterly
effects from monthly series. Fuller (1976) has provided tests for deter-
mining how many if any factors of the form 1 - L should be employed. At
first this approach seemed to contradict the older one. More recently
(Abraham and Box (1978)) it has been realized that data containing deter-
ministic components such as sinusoids can be handled by allowing cancella-
tion of AR and MA factors with roots on the unit circle viz. the model
173
Yt = A cos wt + gt i.i
is equivalent to the ARMA (z,2) model
(I-2L cosw+L2)Yt = (I-2L cosw+L2)~ t 1.2
L is the backward or lag operator. If such a cancellation is found
(fitting must be done with an exact likelihood routine) then it was
recommended to fit the model (i.i). The fact that this is not necessary
is the substance of some very recent ideas. (Some related ideas to this
section are in Parzen (1981)).
(iii) Recently Harvey (1981) pointed out that if a unit root cancel-
lation is observed in fitting an ARMA model then the Kalman filter can be
used to produce forecasts without refitting (a model such as (i.i)). The
idea (see Lecture 4A) is that the Kalman filter applies to RW models.
Actually, the standard stability theory of Kalman filtering does not
apply to unit root cancellation models. Fortunately, recently Goodwin
and coworkers working independently (Goodwin et al. (1982), Chan et al.
(1982)) have extended the stability theory to deal with just such an issue.
There are still some issues to be resolved. There is another related
issue, since the Kalman filter does not require a stable model for its
validity there is no reason why general AR~ models cannot be fitted with
unit root factors in the AR, MA terms (some cancelling, some not). In
particular orders of differencing would not need to be specified, just an
overall model order. These ideas need some sorting out: some recent
pertinent work is Tiao and Tsay (1981).
IC. A Selection of Time Series and Econometric Models
Some models are listed here togehter with some miscellaneous comments.
It is sometimes useful to write the models in the general form
data = signal + noise.
(i) Autoregressive Model of Order i AR(p).
174
Yk = Sk + gk
s k = a(L)Y k
where L is the lag or backwards operator LY k Yk-l; a(L) = ~ aiLi = ; C k
is a white noise sequence or uncorrelated sequence E(gkS t) = ~kt ~2,
E(C k) = O. A minimal additional requirement to allow statistical inference
is that E(eklgk_ 1 ..... Cl) = O. This means that gk is uncorrelated with,
or orthogonal to, any function of the past y's; it is called a martingale
difference sequence or nonlinear innovations process. The behavior of
the process Yk is determined by the roots of the polynomial equation
zP(I+a(Z-I)) = O
zP + ~ aizP-i = 0
-i where Z = L is the Z transform or forwards operator ZY k = Yk+l" The
basic behaviors are captured by the second order model
1 + a(L) = 1 + all + a2L2.
If the roots of the equation Z 2 + alZ + a 2 = 0 are real the behavior is
geometric decay after each new "shock" gk; if the roots are a complex
conjugate pair the behavior is oscillatory. Readable accounts are given
in Fuller (]976, Chapter 2) and Box and Jenkins (1976).
(ii) Moving Average Model of Order ~ MA(q).
q L i where C(L) = ~ici "
in Lecture 2B.
Yk = Sk + gk
S k = C(L)~ k
Some issues relating to this model are discussed
Remark. Roughly speaking we can describe the difference between AR and ~
models as follows. AR models are smooth, MA models are "rough". To see
what is meant here consider the signal + noise decomposition Yk = Sk ÷ gk"
Intuitively we think as follows.
175
S k = signal = the part of Yk predictable from the past
Sk = noise = unpredictable part of v k.
This "predictability" or smoothness can be measured by the signal to noise
ratio (SNR)
SNR = var(Sk)/var(g k)
= (var(Yk)/var(Ek)) - I
viz. we have the comparison
MA(1) : Yk = Sk + 0£k-I
AR(1) : Yk = eYk-i + Ek
: SNR = 02
: SNR = 82/(1-82).
For .5 < le] < i the SNIR for AR(1) is much higher than that for the MA(1).
For higher order models it is harder to discriminate on this basis but the
idea is a useful intuitive one. Box and Tiao (1977) have mentioned SNR.
It is a co~on measure in the engineering literature.
<iii) Autoregressive Exogenous Mode=]_ ARX
Yk = Sk + ek
Sk = a(L)Yk + ~'~k
Here ~k is a sequence of deterministic trends such as a constant, polynomials,
sinusoids. Alternatively, ~'~k is of the form (or contains components of the
form b(L)Z k where Z k is an exogenous or input sequence.
(iv) Autoregressive Moving Avera&eModel ARMA
Yk = Sk + Ck
S k = a(L)Y k + C(L)g k
The appropriate method for generating the autocovariances of this model
seems to be the scheme suggested by Hwang (1978) and Wilson (1979).
176
(v) ARMA + Exogenous_ (ARMAX)
Yk = Sk + Ek
Sk = a(L)Yk + ~'~k + c(e)~k-
Two related models are the TFg (transfer function white noise) model
= (I +a(L))-Ib(L)Z k S k
and the TFARMA (TF + A~MA noise) model
Yk = Sk + Vk
= (i +a(L))-ID(L)Z k S k
v k = (l+d(L))-l(l +c(L))Sk.
One of the advantages of the ARMAX form is its linearity.
(vi) State Space or Markov Model
Yk = Sk + Vk
Sk = ~'~k
~k+l = ~k + ~k
where Vk, ~k are white noises with
Provided M # 0 there is a i -i relation between Markov models and ARMA
models. Now while the A~MA model can be written in state space or Markov
form in infinitely many ways only a handful turn out to be of practical
ase.
One particular representation is the observer form
Yk = Sk + Ek
177
S k = h'_x k
~k+l = [0Xk + ~gk
F_O =
al10 -a k 0 0
k 1
k= kp
c I - a 1
Cp - ap
Notice that v k = ~k and ~k = ~gk are contemForaneously correlated. Other
forms are discussed in Kailath (1980, Chapter 2).
(vii) Simultaneous Equation Model (SEM)
+ = ~t' i = i, ..., N
where Y is an £ × 1 vector of endogenous variables i.e. variables whose --t
relations among themselves are determined within the system (whether it be
a market, group of markets, or whole economy) being studied• Then Z is -t
an m × 1 vector of exogenous variables i.e. variables whose values are
determined outside the system being studied. Finally, ¢ is an £ × 1 -t
vector of uncorrelated noises or disturbances with covariance matrix E:
N is the sample size•
These equations are called structural equations because they represent
the structure of the system being studied• They might for example be pro-
vided by economic theory. An excellent account of econometrics exhibiting
the links between the economic models and the econometric equations is the
book by Intriligator (1978).
The equations can be collected together as
YF + ZB = E
where Y = (~i .... , YN )' is N × £, etc. If the matrix of endogenous co-
efficients F is of full rank the equations can be written as a multivariate
regression in "reduced" form
Y = Z~+W
178
R :-~[-i w:K -1
The covariance matrix of Y is ~ ~N (where ~is the Kronecker product:
see Appendix AI). The dynamic simultaneous equation model allows lagging
on the Y and Z sequences viz.
~'(L)Yt + B'(L)z = 8 -t -t
_ FI L i F'(L) = ~8 -l etc.
Further background for this model is available in Intriligator (1978) or
Theil (1971).
APPENDIX AI: MATRIX CALCUALTIONS
To deal easily with multivariate models it is useful to employ the
"vec" or stacking operator. If M is a p × q matrix with columns
t
~i ..... Mq then vec M = vec(M) = (~i ..... ~p)' ~s'pq'× 1 i.e. vec stacks
the columns of M one under another. We also need the Kronecker product
of two matrices. If A is p × q and B is r x s then A~B is the pr x qs
block matrix with i, j block entries [aijB] , 1 ! i ! p, 1 ! j ! q.
The vec operator has the following easily verified properties (we
write vec' X = (vec X)').
vec ABC = C' ~A vec B
tr(_XY_) = trace(_XY_) = vec' X vec X'
tr(_A~_Cp = vec' ABC vec D' = vec'B C QA vec D'
LECTURE 2
SOME PROPERTIES OF TIME SERIES MODELS
2A. Some Aspects3f the Spectrum
A p dimensional time series of multivariate time series is a collec-
= X p t ) ' tion of random vectors ~t (Xlt , ..., where t belongs to an index
set T. We call X discrete if T is discrete. If T = 0, i, 2, ... (where -t
the sampling interval A has been suppressed) the series is equispaced.
Otherwise it is unequispaced; if then T = {tl, t2, ...} often Xt I is de-
noted ~1 e t c .
A multivariate time series X is strictly stationary if the joint -t
distributions
[XI...X (~I ..... ~n ) = P(~l!~l ..... ~n!~n ) -n
(where e.g., ~i ! ~i means Xjl ! Xjl, j = i, ..., p) obey the shift in-
variance property
•X "'"
_tl+h'''~tn+h (~I' '~n ) = rx ...x (~l ..... Xn)
-t I -t n
for all t~,± ..., t and h c T. A multivariate time series _X t n
weakly (or wide sense) stationary if for all t,h c T
is called
c°v(~t' ~t+h ) = ~h"
Then _~ is called the autocovariance matrix of ~t" Observe that
T
_R h = R_h. (2.1)
The block matrix [_Ri_j]l!i,j< m is positive semidefinite for any m. (2.2)
To see t h i s c a l c u l a t e t h e v a r i a n c e o f an a r b i t r a r y l i n e a r c o m b i n a t i o n
m E 1 -:o~LX i . _ A p a r t of s t a t i s t i c a l t ime s e r i e s m o d e l l i n g i n v o l v e s b u i l d i n g
parametric modelling for autocovariance sequences.
180
Some further comments are worth making. To keep things clear we
work with the scalar case. The vector arguments are nearly identical.
We introduce the covariance generating function
¢(z) = ~:~' z -h ~-oo ~ (2.3)
In view of (2.1), (2.2) this has the properties
~(z) = ~(z -I) (2.4a)
~(z) is positive on i zl = I. (2.4b)
The function
oo ¢(e i~) = R 0 + 2 ~iRkCOSk~ (2.5)
is the spectrum of X t. We now investigate conditions under which the
series is finite. In fact the following square summability condition
is enough
Zt < o~. (2.6)
To see this regard ~k(C0) = cos k~ as a sequence of random variables on
the sample space {[-~, z), Borel o fields, Lebesgue measure}. They are
I uncorrelated with variance ~ : i.e.
e(~k~ ~) = F l cos i~ cos k~0 df =- 6ki; f = ~/2"~. -r 2
Thus we deduce mean square convergence of
~n(CO) = R 0 + 2 ~,~R k c o s k ~
i . e .
F [ _ j-~ I* n ¢I 2df -' 0 as n ÷ ~o.
Further this is enough to establish the inversion formula
= lim /~_~,n(L0)e-i°Ohdf : /n_~¢(C°)e-i°°hdf. n-+Oo
We simply observe that
I f~ (~n- ~)e-iWhdf[ ~ f i~n- ¢[df ~ ( / i¢ n-¢I2df)½.1 -+ 0.
2B.
181
Minimum Phase and Minimum Delay
Let gk be a unit variance white noise sequence and consider the two
MA(1) models
Yk = cOCk + Clek-1 = (Co +elL)sk
Yk = ClCk + C0Sk-i = (Cl +c0L)sk"
First observe that Yk' Yk have the same autocovariances
var(Yk) = C2o + c~ = var(Y k)
cov(Y k, Yk_l ) = c0c I = cov(Yk, Yk_l ).
The ordered pair (Co, c I) is called a wavelet (or an impulse response).
The energy in the wavelet is defined to be cg + c~. Thus the wavelets
(c o , Cl), (c I, c o ) have the same energy and same autocovariances. They
are distinguished in other ways.
A minimum delay wavelet is one which delivers its energy as early
as possible. A maximum delay wavelet delivers its energy as late as
possible. So if we take ic01 > iCll then (c o , c I) is minimum delay while
(Cl, c o ) is maximum delay.
Now consider the amplitude and phase properties of these two wavelets.
We write
The amplitude is
Ao(eiW ) = e 0 + cleiW
= c o + c I cosw + i c Isinw.
2 2 ½ A0(W) : ~IA0 (eiw) I = (c o +c I +2c0c I cos to)
• + c I cos w c I sin 0~ => ~ + i A0(elW) = X0(~)[ c0 A0(w ) ~ J
: ~o(~)e i~o6°)
where TO(W) = tan-l[cl sin~/(c 0 +c I cosw)].
182
Similarly we find
2 cos 2 AI(~) = (c0+2c0e I ~+Cl) 2 = A0(~)
• i(~) = tan-l[c0 sin~/(c l+c 0 cos~)].
Thus the two wavelets have the same amplitude but different phases. The
two phase spectra are plotted on (0, ~) (sufficient since T(~) is an odd
1 function) for the case (c0, Cl) = (i, ~)
(~) degrees
180
135
90
45
0
~/2 "~
t0-+
We observe that TI(~) exceeds T0(w). Thus the wavelet (Co, c I) is
also the minimum phase wavelet.
Finally, observe that the minimum delay, minimum phase wavelet is
characterized by being the one with all its zeros inside Izl < i. Thus
its inverse is a stable filter. When MA models are fitted it is the
minimum phase or minimum delay or stably invertible models that are fit.
An interesting application of these ideas is revealed by the all pass
filter
(c O + Clh)/(c I + CoL)
which produces the ARMA(I,I) model
(c I +c0L)Y k = (c 0+clL)C k.
Then Yk is stationary and is itself a white noise. The issue of minimum
phase/delay is important in feedback detection.
Notes. The discussion here is adapted from Robinson and Treitel (1980).
183
See also Robinson (1962) and Robinson (1967, Appendix i). The ideas here
readily extend to higher order MA models.
2C. Spectral Factorization
As mentioned earlier the covariance generating function
-k ~(Z) = ~_~E(YoYk)Z has two basic properties (2.4). The spectral
factorization problem is to represent the process Yk as the output of
a linear system driven by white noise. That is to show
THEOREM 2.1. If ~(Z) is a real rational full rank covariance generating
function then there exists a unique factorization
¢(z) = 9(z)~[*(z)
where W(Z) is real, rational, stable, minimum phase i.e.
W-I(z) is analytic in !Z 1 ~ i (2,7)
with W(~) = I. Q is symmetric positive definite. W(Z) is called the
normalized minimum phase spectral factor (NMSF) of ~(Z). Further
1 deg W(Z) = ~ deg #(Z).
Proof. A very readable (multivariate) proof is given by Whittle (1953).
With ¢(Z) as a rational function an (easier) proof is given by Robinson
and Treitel (1980, p. 228), Hannah (1970). See also Ng et al. (1977).
Remark. The significance of (2.7) is the following. If and only if (2.7)
holds then by a standard result in complex variables W-I(z) has a (one-sided)
Taylor series expansion valid in [Z 1 ~ 1 (so W-I has no poles there)
W-I(z) = E0~ CkZ-k'
to put this in more familiar terms. If and only if (2.7) holds then
W-I(z) must be the Z transform of a stable causal (=one-sided) filter.
This is the relevance of analycity; it enables us to calculate the white
noise causally stably as gk = W-I(Z)Yk"
184
If we remark that this problem is the same as finding a Cholesky
factor for the infinite Toeplitz matrix
R 0 R 1 R 2 °.•
R_ 1 R 0 ...... Too =
R_2 .........
then some of the methods used to find the polynomial of Theorem 2.1 are
understandable.
If we look at Cholesky factorization as generalized square rooting
it is not surprising that there is a Newton Raphson scheme for finding
W(Z) from @(g) which generalizes exactly the well known scheme for finding
the square root of a number namely
1 an+ 1 = ~ (a n+a/a n ) ÷ a2
see e.g., G. T. Wilson (1972) and B. D. O. Anderson (1978)•
We might also expect to find a spectral factor by doing Cholesky
factorizations of increasing order on the Toeplitz matrices
T n
R 0 ..... Rn_ 1
=
K n+ I ..... R 0
The celebrated Levinson, Whittle algorithm (Lecture 3C) finds Cholesky
factors of T-I• A similar algorithm that finds Cholesky factors of T -I n n
is given by Justice (1972).
If @(Z) is rational then the (output statistics) Kalman filter per-
forms spectral factorization (see Lectures 4A, 4B). Alternatively, we
1 ~ -k can write ~(Z) g(Z) + g(z-l), g(Z) ~ R 0 + = = EI~Z • Then g(Z) is
called positive real since Re(g(ei~)) > O. It is possible to do spectral
factorization of ~(Z) from an ARMA formula or state space formula for g(Z)°
This is accomplished via the so called positive real lemma (see Anderson
et al. (1974)). All the above discussion extends naturally to the multi-
variate case•
LECTURE 3: REVIEW OF PREDICTION I
3A. Prediction and Smoothing: General Ideas
Consider the problem of estimating one time series x from measure- -t
ments on a r e l a t e d t i m e s e r i e s I t : c o l l e c t t h e m e a s u r e m e n t s ~ i = ~ t . ' 1
i = i, ..., n (which need not be equispaced) into a vector
n ! v Y = ~1 = ( Y l ' " ' ' ' Yn )" I f t > t t h e p r o b l e m i s c a l l e d p r e d i c t i o n ; i f - - - n
t < t n it is called smoothing. The basis of most time series estimation
f o r m u l a s i s l i n e a r l e a s t s q u a r e s ( l t s ) e s t i m a t i o n . We c o n s i d e r l i n e a r
c o m b i n a t i o n s ( w e i g h t e d a v e r a g e s ) o f t h e o b s e r v a t i o n s s u c h a s
and choose the weights ~ti
estimate is then given by
^ n ~ t l n = ~I ~ t i X i
tO ensure EIIxt-~tln I12 is minimized. The lls
-Xt!n = E(xtl-Yl)
E-I = E(_xtx') y Y-
E = E(Z~'). (The wide sense projection or conditional expectation symbol -y
is discussed in Appendix A3. If the data is Gaussian it is conditional
expection .) Thus
(~tl .... Htn) = E(~tX') E-I _ -y
= {E(xtZl) ..... E(XtZ~)} ~yl.
Observe that to solve the estimation problem we need only to know the
covariance of the data y i.e. ~ and the cross covariance between the process - -y
x and the process yt. This basic point is often forgotten. There is a -t
problem with the solution as it stands in that it involves a matrix inversion.
There is however a second approach.
We use the data one step at a time as follows
186
where
~t]n = E(~tlZl ) = E(~t lZn ' n-1 X1 )
= E(xtlS n, Z~ -1)
n-i
en : Yn - E(YnlY-I ) = Y-n - -Yn[n-l"
This follows by the basic property of (wide sense) conditional expectation.
The computation of e is discussed in a moment. -n
n-i uncorrelated with v_ so
hl
Now by definition e is -n
_~tl n : ~(xtl_en ) + ~(xt n-l) ly I •
Applying this projection repeatedly yields
ei = -Yi - E(vilY[ -I); el = -YI"
(3.1)
The sequence e. is called the innovation sequence since by definition it --I
contains the new information in the data Yi not predicted from previous data
i-i -Yl Finally, note that since the innovations sequence is an uncorrelated
sequence (3.1) is
n
xtl n = E 1 E(_xtei)E_~.l e i
v
z. =E(e_e ) -z i i "
(3.2)
Actually, (3.2) is a generalized finite data Wold decomposition: we
see that it terminates if ever E. = O. -I -
It is not clear in the above calculation how the ~i are to be produced.
In fact they are produced by a Cholesky factorization of the covariance
matrix ~ as -y
E = LDL' --y -----
where L is a block lower triangular matrix with identities on its diagonal;
D is block diagonal with entries E . The e. sequence is generated from the - -i -i
y by
From this we verify that
187
(el, ...,e ) = L -I -n -Y °
E(_e I ..... _e n)(e I ..... en)' = _D = diag (~I' "''' -nE ).
The inversion of triangular matrices is very easy. Also since the
inverse of a triangular matrix is triangular we see that e I , ..., e i are
linearly equivalent to -YI' "''' ~i i.e. the e i and Yi are causally equiva-
lent. Further consequences of these ideas are expanded on in various
articles by Kailath (see e.g., (1974)). It should be emphasized that
these formulae apply whether x evolves in continuous or discrete time. -t
Now given a covariance matrix there are standard numerical techniques
for generating Cholesky triangular decompositions (see e.g., Stewart (1973)).
However, the special structure of Y~ that arises in time series models means -y
that more efficient methods can be found.
There are basically two ideas involved.
(i) Stationarity. If the -Yt process is stationary then the covariance
matrix E is a Toeplitz matrix -y
~0 ~n- 1 =
-y
~-n+l "'" 20
The Levinson, Whittle algorithm (Lecture 3C) generates th~ inverse
-i triangular factor L by using this Toeplitz structure.
(ii) Markov Structure. The Kalman filter (Lecture 4A) generates the
inverse factor L -I directly even for nonstationary models by using
the Markov structure of Zt.
Finally, it is possible to combine these structures and produce an
algorithm for stationary Markov processes that improves on (i), (ii).
This is the so-called Chandrasekhar algorithm (see Lindquist (1974) and
Sidhu and Kailath (1974)) and Rissanen (1973a, 1973b). The aut:~or has
recently used this algorithm to construct likelihood functions (Solo (1982)).
There is one other general point worth making here. If we deal with
188
processes ~t' Zt having a joint Markov structure then the problem of
minimizing E[Ixt-~tlnll 2 can be converted to a variational problem. This
point which enables the estimation problem to be cast as a regression
problem has been discovered by a number of authors: see e.g. Rauch et al.
(1965), Sage (1968), Kimeldorf and Wahba (1970), Duncan and Horn (1972),
Paige and Saunders (1977), also Silverman (1976). The variational view
underlies the relation between splines and time series.
3B. Wiener Filter
Using the above ideas it is possible to give a very simple derivation
of the discrete time Wiener filter formulae (cf. Anderson and Moore, 1979,
p. 255). For simplicity we take the scalar case. Suppose ~, Yk are
jointly wide sense stationary. Consider the infinite data problem where
the infinite past of Yk is available. Introduce the autocovariance gen-
erating function of Yk by
~y(Z) = ~7~E(YoYk)Z-k"
Now the quantity that corresponds to the Cholesky factor of ~ is the -y
spectral factor of ~y(Z). According to Theorem 2.1 ~y(Z) has a factoring
~y(Z) = W(Z)W(Z-I)d 2 where W-I(z) is analytic in IZl ~ i (W(m) = l) so
that W-I(z) is a causal operator. Thus we can introduce the innovations
sequence
~k = W-I(Z)Yk"
We rapidly check that the autocovariance generating function for V is
E ~ z-k ~(Z) = -~E(VkV 0)
= ~ W-I(z)E(YkY0)W-I(z-I)z -k
= W-I(Z)~y(Z)W-I(z -I) = 02 "
This confirms that ~k is a white noise sequence. Next by the generalized
189
Wold decomposition
~I k = £(XklYkYk_ 1 -..) = E(XklVkVk_ I ...)
= E 0 E(Xk~k_j)Vk_ j
oo
= E 0 E(Xo~ j)~k_ j by stationarity.
Introduce Dj = E(Xo~ j) then
co co --"
~Ik_ I = Z OD j~Jk_ j = Z 0 DjZ J~k = D+(Z-I)vk
where the generating function D(Z) has been written as a sum of causal
D+(Z -I) and non causal D_(Z) parts viz.
D(Z) = Z °o D.Z -i = E°°D Z -j + Zl D .Z j - ~ ] 0 j - . l
= D+(Z -1) + D_(Z).
^ D+(Z-I) That is to say the Z transform linking Dk to Y~ik_ 1 is . Thus
the transform linking Yk to ~ik_ I is
Xklk_ 1 = D+(z-l)~jk
= D+(Z-I)w-I(Z)yk ,
Finally, we need to obtain D+(Z -I) in terms of more easily available
quantities: observe that
co -j D(Z) = E_~E(XoV j)e
= Z °o E(XoW-I(z)y_j)z-J -co
= w-l(z)¢xy(Z -I)
Cxy(Z -l) ~ -j where = Z_o E(XoY j)Z is the cross covariance generating function.
Thus we obtain the celebrated formula
^ c ~x (Z-l) Yk
Xklk-i = I i~W(Z) }+ W(Z) °
t90
3C. Levinson, Whittle Algorithm
The Levinson algorithm is a means of producing the inverse Cholesky
factor of the data covariance matrix when the process is stationary.
There are a number of ways of deriving the algorithm (each having their
advantage). The idea is that we can produce the triangular factor by
fitting increasing order autoregressions.
An important aspect of the Levinson algorithm is the fact that the
AR polynomial it produces is stable. In fact this idea can be used in
reverse to produce a stability test for an arbitrary polynomial. It is
the Cohn criterion (see Veiera and Kailath (1981)).
A derivation of the Levinson algorithm using properties of Toeplitz
matrices is available in Anderson and Moore (1979, p. 258); also Robinson
and Trietel (1980, p. 1963). A derivation using projection arguments is
given by Findley (1981). A discussion of the algorithm from the lattice
point of view is given by Makhoul (1978).
APPENDIX A3: WIDE SENSE CONDITIONAL EXPECTATION
A basic role is played in the linear estimation theory by the following
elementary considerations. Let X, Z be two zero mean correlated random
vectors. Suppose initially they are jointly Gaussian. Then the following
conditional mean expression is well known (and easily derived)
E(ZIX) -I = EZX ~XX X
ZZX = E(ZX'); ~XX = E(XX').
Now suppose X, Z are not Gaussian. We introduce the wide sense conditional
expectation
~(zlx) -1 = ~ZX ExxX"
The label conditional expectation is justified in view of the basic prop-
erties of
E(Z-E(ZIX))X' = 0 (A3.1)
191
This is called the orthogonality condition.
~(zlx, Y) : ~(zlx) + ~(z[~)
~(z]x, Y,w) = ~(z[x, Y,~y)
= Y - E(YIX); Wy = W - E(WIY).
Also E can be viewed as a projection. Denote by R(X) the column space of
X i.e. R(X) consists of all vectors that are linear combinations of columns
of X. Then
EI]Z-qll 2 = E(ZIX) : r]C R(X). (A3.4) arg m~n
That is E(ZIX ) is the projection of Z onto R(X). We can reinterpret
(A3.2) as showing that the projection of Z onto R(X, Y) can be accomplished
by first projecting onto ~(X) and then onto that part of R(Y) that is
orthogonal to R(X).
Remark I. It is a useful exercise to derive the following well known
matrix inversion formulae by using (A3.2)
-i
~>
Z = A + BCD
= A -I _ A-IB(DA-IB +C-I)-IDA -I
= A -I _ Z-IBcDA -I
= A -I _ A-IBcDZ -I
C -I _ DZ-IB = (CDA-IB +I) -I.
If Y, W are other random vectors
(A3.2)
(A3.3)
Remark 2. If X, Z are not stochastic we can still define E as
E(XIZ ) = (XZ')(ZZ')-Iz
provided the inverse exists. The properties derived above continue to
hold, This definition of E is useful in regression analysis,
Notes. Discussions of wide sense conditional expectation are given by
Doob (1953) and Parzen (1967).
LECTURE 4: REVIEW OF PREDICTION II
4A. The Kalman Filter
where
Here we suppose that x, y are related through a joint Markov model
_Xk+ 1 = Fk_X k + w k
-Yk = h k ~ + Vk
[ _wk] _v k
] _N k -Nk
(4.1)
Denote the data Y-I .... 'Y-k by _yk. Now apply the basic projection lemma
to see
~ k-I = E(-~+ii-ek'-Yl )
-~k ~k ~klk-1 -Yk ~Zkl k-1 . . . . Yl )"
(4.2)
We deduce then
-~+llk = -~+11e-i + ~(~+iiek )
T
= E(_~k+le k)
~k = E(eke k).
To proceed further we need to use the Markov property
-Xk+11k-1 -k+l Yl )
~ k-i = E(Fx, +w, Yl ) --E -K I ±
= FkE(XklY k-l)
(4.3)
(4.4)
(4.5)
(4.6)
193
= Fk_Xk i k_l
Thus we obtain the first part of the Kalman filter
(4.7a)
since m~Yl Wk ) = O.
k+llk = rk Ik-i +
In the sequel replace the subscript k+llk by k+l. Introduce the error
covarianee matrix
!k = E(~k- ~k ) (~k- ~k )'
= ~k - ~k
= A ^, ~k (4.8) ~k E(~kXk )' = E(XkXk )
Take covariances in (4.7a) to see
^ ! --1 T
~k+l = ~k~kgk + -~k -~" (4.9)
On the other hand taking covariances in (4.1) gives
v
~k+l = ~k~k~k + ~k" (4.10)
Thus from (4.9), (4.10) we find on subtraction
--1!
~k+l = ~k~k~k + ~k - ~k ~k -~" (4.7b)
We need to add that
!
~k = E(-Xk+l~k)
!
= E(FkXk+Wk)(Zk + (_Xk-_~)'_~)
= ~k~k)k + ~k (4.11a)
as well as deducing
!
~k = ~k~k)k + -~ (4.11b)
The Kalman filter is the equation set (4.7), (4.11) (Kalman (1960).
Remark i. If ~ obeys a continuous time model rather than the discrete
one (4.1) the same argument will produce the so-called continuous discrete
194
filter (see Jazwinski (1970)).
Remark 2. An important aspect of this setting is that stationarity is
not assumed so unstable models are allowed. A very simple discussion of
the behavior of the filter in such cases, using only mean square ideas is
given by Aasnes and Kailath (1974). This contrasts with complicated tradi-
tional analyses (Jazwinski (1970)).
Remark 3. It appears that to operate this filter it is necessary to know
the quantities Q, N, R and much work has been devoted to methods for esti-
mating these off-line as well as on-line. This is unfortunate since these
quantities are not identifiable nor in fact are they needed to operate the
filter. This is discussed in the next section.
4B. The Output Statistics Kalman Filter
Return to equation (4.5) and observe
T
~k = E(~k+lek)
v ~ v
= FX_k+l(Yk -_Xk~ k)
v
= E(Xk+iZk) - Fk~khk.
Similarly we find
(4.12a)
v ~A
-k % = E([kZk) - _hkPkh k._ _ (4.12b)
The so-called output statistics Kalman filter consists of equations (4.7a),
(4.9), (4.12). It is clear that to operate the filter what is necessary is
the variance sequence of yk, the cross variance between the state x and the
output or observation Z and the transition matrix F. These quantities can
be obtained in a model fitting exercise (see Solo (1982)).
Notes. The output statistics Kalman filter is due to Son and Anderson (1971).
The use of covariance information to produce a filter is due to Rissanen and Barbosa (1964).
195
4C. Kalman Filter for an ARMA Model
Now suppose the Markov model comes from a scalar ARMA model. Suppose
in particular F = ~0' the observer form matrix. It follows immediately
by inspection that if we join
Yk = ~'~k + ek
to (4.7a) we can write
- - ... = e k + + ,.. + c e (4.13) Yk alYk-i - apYk-p el,k-lek-i p,k-p k-p
where c.j,k_j (Kj,k_j -aj)/Zk_j. This exhibits the Kalman filter as a
time varying ARMA model. It is an interesting exercise to show that
Ci,k_ i = E(Wkek_ i) ~ E(Wkgk- i) = c i
w k = (i -a(l,))y k.
Notes. This filter form was first derived by Rissanen and Barbosa (1969).
A simple, direct derivation is given by Aasnes and Kailath (1973).
4D. Likelihood Functions for Time Series Models
Assuming a Gaussian distribution for the data [, the log likelihood
function is
1 1 y, -i log L n = -~ log l~y I - ~ _ ~y
= E(yy'). Now we can simplify this by using the Cholesky n where y : Yl; ~y
factorization to see
i ~n 1 n ~ -i log L = -~ L 1 log I~il ~ Z 1 - !i~i ~i" (4.14) n
Another way to obtain this is to observe that since the likelihood
is just the joint distribution of the data we have by the definition of
conditional density
L n = f(Zl ..... Yn ) = f(Y~)
196
f(ynly~-l) n-I = - f(Yl )"
However since yn is Gaussian log f(ynl n-i - Yl ) must have the form
1 ] e' E-le -~ log ]~nl - ~ -n-n -n
mince -n ~ = E(~n~ ) and ~n = In - E(!nlY . Now iterating the conditional
calculation gives the form (4.14). The advantage of (4.14) is that we can
generate ~i' -iZ" from the Kalman filter: see Gardner et al. (1980); an alter-
native method using Chandrasekhar equations is discussed in Solo (1982).
LECTURE 5: THEORY OF IDENTIFIABILITY I
5A. Introduction and Examples
The identifiability problem refers to the fact that two different
parameter values in a probability model may give rise to the same distri-
bution and hence be indistinguishable. The issue arises especially in
econometric modelling of simultaneous relations (such as demand and supply
relations) and in statistics in the analysis of variance. The problem is
illustrated with two simple examples.
Example (i). Econometrics
Consider a pair of equations to describe the demand and supply of a
qt = 60 + ~iPt + gt demand
qt = $0 + 6'iPt + g't supply
qt = quantity sold at time t
Pt = price of the good at time t
~t~ ~ are white noises. Now clearly an attempt to estimate the ~'s by t
linear regression will return only two values b0, b I. The problem is that
the two equations
E(qt) = ~0 + ~l(Pt )
T t
E(qt) = 60 + BI(P t)
are identical. So DO, BI; BO, B 1 cannot be distinguished. The way to
resolve the problem is to build a larger model incorporating additional
exogeneous variables i.e. variables describing goods whose prices are
determined in other markets vizo,
E(qt) = ~0 + ~l(Pt) + ~2(rt )
single good
198
I t v
E(qt) = 80 + BI(p t) + ~2(ct)
where c is the price of a raw material used to produce the good while r t t
is the price of a complementary good. In larger systems of equations we
can see that the basic issues involve the existence of linear dependencies
between the structural equations. A readable discussion (on which this
example was based) is in Theil (1971, p. 446-449).
Example (ii). Analysis of Variance
Consider the following one way classification model
Y.. = ~ + ~. + g.., i = i, 2; j = i, ..., n 13 I l j
where Y.. denotes the response to treatment i on the jth trial, ~. measures lj l
the effects of treatments, ~.. are noises. The aim is to compare the treat- lj
ments through ~I - a2" The overall mean effect ~ is not identifiable or
estimable from the data.
5B. Basic Ideas
In the ensuing discussion it may be helpful to think about the identi-
fiability problem form two points of view. (i) What conditions are needed
on the parameters of a model to ~nsure it can be simulated with prescribed
properties, (ii) How much can the data reveal about parameter values.
To formalize the identifiability issue recall the definition of a
model. Suppose y denotes some time series data. A model M for ~ is a
family of probability distribution Po(y) or P(ylO) indexed by a d-dimensional
_ c ~d; IR@ is an parameter Q. The parameter belongs to a parameter space IR e
open set. A member of M i.e, a particular probability distribution is
called a structure or (a parameter value) and denoted M@ (or ~). (The
bar under y, @ will be dropped when no confusion results.)
DEFINITION i. Two structures M6, M@, (or parameter values Q, 0') are
called observationally equivalent if
Po(y) = e ~ , ( y ) .
199
DEFINITION 2. (Global Identifiability). The model M is identifiable
at O 0 (or MOO i s i d e n t i f i a b l e ) (or 00 i s i d e n t i f i a b l e ) i f
P0(y) = Pg0(y) => 0 = 80
i . e . t h e r e i s no o t h e r 9 ¢ IR e which i s o b s e r v a t i o n a l l y e q u i v a l e n t to 90 .
DEFINITION 3. (Local Identifiability). The parameter value O 0 is locally
identifiable if there is an open neighborhood of @ 0 containing no other
9 c IR 9 which is observationally equivalent to 80 .
Often we are only interested in specific functions of the parameters
(e.g., ~I - ~2 in the analysis of variance).
DEFINITION 4. A function g(0) is identifiable if
PgI(y) = P92(y)
=> g(91) = g(92)
i.e. observationally equivalent structures give the same value to g(!).
There are now two basic approaches to identifiability, one via the
theory of estimable functions, the other through Kullback-Liebler information.
5C. Identifiability and Estimabili~
DEFINITION 5. A function g(0) is called estimable if there exists a
function of the data ~g(y) with g(9) = Eg(~g(X)) = f~g(y)Pg(dy). The
use of this notion appears in the following result.
LEMMA 5.1. A function g(9) is identifiable if it is estimable.
Proof. Let (y) (y). Then Pgl = P92
g(01 ) = E g l ( ~ g ( y ) ) = f ~ g ( y ) P 0 1 (dy)
f~g(Y)P92(dY) = g (92) .
200
Remark i. This le~aa forms the basis of the common intuitive method of
investigating identifiability.
Remark 2. In linear models the converse holds as follows
LEMMA 5.2. In the linear model y = ~ + 8, a ~ N(O, I), ~ = XS. Then if
g(~) = £'~ is identifiable it is estimable.
Proof. The function c'@ is clearly identifiable iff it is uniquely deter-
mined by ~ since ~ determines P@(Z). That is there exists a with £'2 = a'q.
Now however ~'G is estimable by @g(Z) = ~'l"
Notes. Lepta 5.1 generalizes a lemma in Rothenberg (1971). Estimable
functions were introduced into econometrics by Richmond (1974). Lemma 5.1
provides a very powerful method for generating identifiability results.
Lemma 5.2 is in an unfortunately forgotten paper by Riersol (1963). The
theory of estimability in linear models is clearly discussed in Scheffe
(1959). Some application is now given to SEM that will show how these
le~m~as are used as well as motivate some general theorems later on.
Remark 3. To apply the above theory we will usually require that all
distributions are multivariate Gaussian (MVG). Since a MVG is determined
by its mean vector and covariance matrix we can avoid the MVG assumption
by instead defining observationally equivalent structures as those whose
means and covariances agree. With this in mind the above formulation
will be still used.
LECTURE 6: IDENTIFIABILITY OF ECONOMETRIC MODELS
6A. Simultaneous Equations Mode~l
Consider the SEM
YF + ZB = E
where Y is N x %, F is i x %, Z is N × m, B is m x ~. E has zero mean,
covariance ~Q I. The parameters are ~' = (vec' [, vec' B). The reduced
form is the multivariate regression
Y= ZI~+U
!:-Bf-I [= ~f-l.
Now apply Lemma 5.1 to the equalities
E(Z,Z)-Iz,y =
E(! [)(x-i)' = z = E(<U')
where Y = Z~ = Z(Z'Z)-Iz'Y. We deduce that the reduced form parameters
If, E are identifiable. Also observe that they determine P@(Y). Now the
idea is to obtain part or all of [, B from ~. We may need restrictions
on @ to ensure this can be done. We write the restrictions generally as
_~_~ = $, @, g known. (6.1)
The restrictions might typically be that 8. = 0 i.e. e! 8 = 0, l 0 -I 0 -
e. = vector of O's with 1 in i position. To (6.1) we add the relation -i
HF = -B or
I~®~ vecr :-vec B. (6.2)
To summarize (6.1), (6.2) we can write
202
[A] where A = A(]I) = (-19~ ~ ~ : Igm).
ability of a particular linear function say t'_@ = g(_G).
THEOREM 6.1. The function t'9 is identified iff
Rank(A' : 0' : t) = Rank(_A' :_~').
Now let us investigate the identifi-
(6 .4)
Proof. By definition t'@ is identified if
PoI(Y) = P02(v) :> t'61 = t'~ 2.
Now P01(Y) = P~2(Y) ~=> ~i = ~2 => ~(~i ) = ~(~2 ) = ~ say. Thus we have
0 !2"
Now by Lemma A6.3 ~ '~1 = £ '~2 i . e . ~ '~ i s unique i f f (6 .4) ho lds .
Remark. This theorem is given in terms of reduced form parameters. It
is more useful (especially for simulation) to restate it in terms of
structural parameters.
THEOREM 6.2. Introduce M = (I£ ~If' : I~ ~B'). Then t'~ is identified
iff
Rank(M~' : M~) = Rank(M~'). (6.5)
Proof. Observe that, on partitioning 0 appropriately
W =
01 0 2 I 0M'
:l£m I T +2
(6.6)
T =
~ ® r -1 ] _ : 0
J
203
Also, T is nonsingular and
T-i: M' £m
Now by Theorem 6.1 t'0 is identified iff
Rank(W':t) = Rank(W')
i.e.
Rank = Rank(W). T
Now multiply on the rhs by T -I (this does not change the ranks). Thus
(6.4) holds iff
Rank
0 : I
¢H' : 9 2
t'M' : t 2
= Rank 0 : I I.
[~M' @2 )
That is iff
I @M'
Rank i t'M' = Rank <¢M')
i.e. iff (6.5) holds.
These results can now be specialized to obtain identifiability of
v individual parameters. The ith element of 0 is 0 i = _ei0, i = i, ..., £(£+m).
From W form the matrix W. obtained by deleting the ith column; call the ith 1
column w.. Then -I
THEOREM 6.3. The ith element of @ is identified iff
Rank (W)_ = Rank (W i)_; + i.
t Proof. By Theorem 6.2, @i = -l-e'0 is identified iff ]%_ ) (A' :~')I_ = ~i'
! t i.e. (Wilwi)A- = -le" i.e. iff~'.l_~ = 0 and w_'il = I. But this occurs iff there
is no solution to Wi_ ~ = w i i.e. iff the ith column is linearly independent
of the remaining columns.
204
If we call m. the ith column of M we have
THEOREM 6.4. The ith element of ~ is identified iff
Rank (M}' : mi) = Rank (M~').
Proof. Straightforward.
From Theorem 6.3 it follows by contradiction
THEOREM 6.5. The whole vector @ is identified iff W has full column rank
i.e. Rank (W) = ~(~+m).
For structural parameters the result is
THEOREM 6.6. The whole vector ~ is identified iff }j~' has full column
rank i.e. Rank (M4') = ~2.
Proof. It follows from equation (6.6) that:
Rank (W) = Rank [ 0
[ < 4M'
l~m] .
4 2 ]
Then @ is identified iff Rank (W) = ~2 + ~m i.e. iff
Rank (M~') = ~2.
Notes• The above discussion is based on Richmond (1974). See also
Hsiao (1980).
Remark. Derivation of classical result where the constraints are on the
parameters of a particular equation. Here ~ takes the form
4 = ii 0 °°
~g
• 0
0 0 I • . ~g is mg x
Sg is m x m 1 g
0 " O~
then
205
W =
0
~i 0 I
0 ~r 0
0
Cg " ;g
" 0
0
I
SO
Rank = Rank ¢ ¢g Sg
Now apply Theorem 3 by removing % + m columns from W corresponding to the
gth column of v, gth column of B. We find that the (parameters of) gth
equation (are) is identified if
Rank W = Rank "~ : I
leg : ~g =0+~+m.
In terms of structural parameters this is
Rank (F'~)g + B'~'g : I) = Raak (F'¢'g + B' )
<=> Rank ~g ~ = f,
6B. Examples
The following example is based on Hsiao (1980).
equation system (m = ~ = 2)
[Ylt Y2t]I BII ~:i11 + [Zlt z2t]I vFYII
$21 B22-I L'2i
Consider the two
~I~ = (Clt E2t) Y22 ~
The parameters are
206
(vec' g :vec' B) = [711 721 712 722 : BII 821 912 ~22 ].
Clearly normalization restrictions are ~ii = 1 = ~22" For the identifi-
ability of the gth equation we require
[Sg g] x 4 matrix, Clearly we must have where :$ is an mg × (~+m) = mg
m ~ 2 to allow identifiability at the gth equation. This entails adding g
one constraint to the normalization constraint: e.g., 721 = 0 => equation i
is not identified; equation 2 maybe, We rapidly find
(as expected)
Rank [@i :@l]I'~ = Rank [0010] IBI = I < 2
o p Rank [@2 :$2 ] = Rank 0 0 ~
= Rank ~ 721 T221 = Rank ~ 0 7221 = 2
hg21 g22 i Lg2] I j
So equation 2 is identified. We can also consider doing this through
cross-equation restrictions. We have
M = F ' 0 : B' 0 I
F' : 0 B'
and require Rank (MS') = Rank (SM') = L2 = 4o For this we need at least
4 restrictions; there are 2 normalization restrictions leaving 2 independent
conditions viz., try YI2 = 0; Yll + 721 = 0. Then
S =
li ° 0
0
i
0 0 I i
0 0 0
I 0 0
0 0 0 °°il 0 0
0 0
0 0
Then
207
Rank (}M') = Rank
I ~!i 612 0 0 ;
0 621 822
0 YII YI2
LYII+Y21 YI2 +Y22 0 0
0 621 = Rank = 4
0 YII
Y12 +Y22 0
Thus both equations are identified.
6C. Local Identifiability
Guided by the theorems of 6A we can formulate some general theorems
for nonlinear situations. Suppose we are interested in a function t(~)
of parameters 0 describing a probability model P@(Y). Suppose there are
some "reduced form" parameters ~ that determine Pe(Y) and a relation
a(9, ~) = 0 is known. Also, 9 is known to obey some restrictions ~(2) = O.
Then the identifiability problem becomes one of finding under what condi-
tions t(2) is unique for each ! satisfying a(0, N) = 0, ~(~) = 0. The
following result intuitively agrees with Theorem 6.1.
THEOREM 6.7. (Richmond (1976)). Suppose
Rank -- :- De ~
is constant in a neighborhood of 20.
at g0 iff
~a' ~' Rank -- :--
~e 0 ~_0 0
Then t(0) is locally identifiable
= Rank -- : - -
Proof. See Richmond (1976).
208
Remark i. From this result, results analogous to Theorems 6.2 -6.6 can be
developed. See Richmond (1976).
Remark 2. Hsiao (1980) shows how Theorem 6.7 can be used to deal with
identifiability for errors in variables problems.
APPENDIX A6: A BASIC LEMMA OF IDENTIFIABILITY
First recall some basic results in the solution of linear equations.
LEPTA A6.1. (Fredholm Alternative).
p × 1 vector b
but not both.
For any p × q matrix A and any
either
or
Ax = b has a solution x
~X with ~'Z = 0 and b'y = 1
Proof. This is a classical result. An excellent intuitive discussion of
this and many other topics in applied linear algebra is given in the book
of Strang (1974). A short proof is now given.
q xia i = b. This Call ~I' "'''-qa the columns of A_ then Ax = b says E 1 _
says b is a linear combination of columns of A. That is Ax = b has a
solution iff b c R(A) = column space of A = space spanned by columns of A.
Now by the fundamental theorem of linear algebra (Strang (1974, p. 69, p. 81))
either b c R(A) or b c N(A') = null space of A' = {v :A'y = 0} in the second
case choose y e N(A') and scale it so ~'Z = i.
There is another classical condition for solution of equations
LEMMA A6.2. Ax = b has a solution x iff
Rank (A : b) = Rank (A) = dim (R(A)).
Proof. Ax = b has a solution iff b e R(A) which is what the condition says.
209
Finally, the following result, which is used in the theory of
estimability of linear models in another guise, is basic to the identi-
fiability results.
LEMMA A6.3. (Richmond, 1974, 1976).
x satisfying Ax = b iff
The function t'x is unique for each
i.e. iff
~_y ) A'y = t
Rank (A' : t) = Rank (A').
Proof.
(i)
(ii)
If 3 Y ~ ~'! = ~ then for each x~ Ax = b, ~'~ = !'~ = ~'~ which
does not depend on x.
If ~'x is unique observe that if ~0 is a particular solution of
_Ax_ = b then all solutions are of the form x = ~0 + ~ where
E N(A) i.e. AZ = 2" But then ~'x = ~'~0 + ~'~ so ~'x unique
=> t'Z = 0. Thus the equation AZ = 0 and t'Z = 1 has no solution
A ~ so by Lemma A6.1 ~ _y~ _ y = t.
LECTURE 7: IDENTIFIABILITY AND KULLBACK-LIEBLER INFOrmaTION
7A. Identifiability b~_Kullback-Liebler Information
It is possible to express the identifiability issues in a form that
does not involve the fu]l distribution function by using Kullback-Liebler
information. Introduce
H(8, ~') = E0, [ In(Ps(dY)/P 8,(dY))]
= f l u ( P s ( d y ) / p g , ( d Y ) ) P e , ( d Y ) .
Then we have
LEMNA 7.1. pc(y) ¢ Ps,(y) iff H(e, 8') < O.
Proof. By Jensen's inequality
with equality if
H(8, 8') ! In I = 8
We can thus say
(i)
(ii)
P@(Y) = Pg,(Y).
9, 9' are observationally equivalent iff H(8, 8') = 0.
80 is identifiable iff H(9, 80 ) = 0 ~> 8 = @ 0 •
Now since 0 is the maximum value of H(8, 80 ) if H(8,80) is differenti-
able near 60 we can check the local identifiability of @ 0 by looking for
maxima of H(8, 80). In particular the following result is reasonable
THEOREM 7.1
(i) The parameter O 0 is locally identifiable if the information matrix
I N In ps(Y) ~ in ps(Y) ]
92t{/3838 IO 0 I80eo ~8~ , , = -Es0 980
is of full rank (Ps(Y) is the density of Ps(Y)).
211
(ii) If ~2H/~839' has constant rank in a neighborhood of ~0 and ~0
is identified then 7~000 has full rank.
Remark, The second result is useful in studying central limit theorems
for efficient estimators.
Proof.
here.)
density ps(Y). We have to investigate solutions of the equation
0 = ~H(@, @0)/~@ = / poo(Y)8 In p@(Y)/~@dY
= f (pso(Y)/p@(Y))~ps(Y)/~8 dY.
Note that
(This proof lacks a little rigor but all the essential ideas are
For ease of discussion take Ps(Y) to be absolutely continuous with
Thus, ~H(@, 80)/28180 0.
$2H(8,@0)/~@$8'[90
/p@(Y)dY = i => / l~p@(Y)/$@ dY = 0.
Also it follows easily
= -E(:~ ]np@(Y)/98)($ Inp0(Y)/~8')I@ 0
= _78088 (the information matrix).
is negative definite then H(8, 80 ) has a unique local Hence if ~@090
maximum at !0"
(7.1)
Now expand ~H/~@ in a Taylor series to find
~H/~e = 0 + ~2H/~0' (9-e 0)
where I18-0011 ! 119-9011. Now provided ~2H/~8~@' has constant rank in a
neighborhood of 80; if 80 is identified so that ~H/$8 = 0 for 8 # 90 cannot
occur we must have ~2H/$93~' of full rank. By assumption this entails
~2H/$@~@'180 = 18080 of full rank.
Constraints such as i(~) = $ can be allowed by searching for constrained
maxima of H(@, 80). We do this by constructing the Lagrangian function
Lie, e o) = H(e, e o) + _z'(}(o) - g ) .
212
Again we must investigate conditions for unique local solution to the
equation
~L/~O : 0 = :~H/~ ! + ( ~ i ' / ~ O ) ~ = 0
- ~ L / ~ = g = i ( ~ ) "
THEOREM 7.2. The parameter 9 is locally identifiable at 00 subject to
constraints ~(0) = g if
!
Le0e 0 ~0 0
~eo : 2
has full rank (20 = ~/~') or equivalently if
Rank (I 0 ~' : (Ieoeo)- _ 000 _00) = Rank
Proof. Similar to proof of Theorem 7.1.
We can also obtain a result for the identifiability of a specified
function t(e). We look for conditions on the solutions to ~L/~O = 0 that
make t(0) locally unique.
THEOREM 7.3. The function t(0) is locally identifiable at 00 subject
to constraints ~(0) = g if
Rank (19090
where 30 = St/D0.
Proof. Left as an exercise.
Lemma A6.3.)
Remark.
:_e o¢' :~e o) =Ra~k(fOOSO:¢~O )
(Use a joint Taylor series on L, t(0) plus
Using this result we can reproduce Theorem 6.Z beginning from
H(9, 00 ) = in I_~I - tr(~-l(H-~0)(_~ -~_0 )')
= in I_~I - tr(_~-I(_F-B~0)(_F-B_~0)').
213
Notes. The idea of using Kullback-Liebler information is due to Bowden
(1973). A result like Theorem 7,2 has been given by Kohn (1977).
Theorem 7.3 is new but closely related to ideas of Richmond (1976)
(cf. Theorem 6.7),
7B. Consistency and Identifiabilit~
There is a straightforward connection between weak consistency and
asymptotic identifiability.
TIIEOREM 7,4. If ~n (n= dim (y)) is a weakly consistent estimator of e 0 then
e 0 is asymptotically identifiable i.e. H(e, 00) + -~ as n + ~.
Proof. Consider that
H(e, %) = Hn(9,% ) = fln[ pe(Y) ] p%%j p%(~) dY
= f I~n_ @ l>g In(Po(Y)/Pg0(Y))Pg0(Y)dY
] I~-o'>~)+o. 21nP9 ( n
But P0 (l~n- O I > g) + 0 s o H(@, e 0) .... .
Remark. In time series models often the following limit exists
n-iHn(0, 00) * H(~, 00).
Then we say
(i) The parameter values are (asymptotically) observationally equivalent
if H(e, e') = o.
(ii) The parameter value 90 is identifiable if H(9, 90) = 0 => 9 = 90 •
Often H(9, 90) takes the form H(@, @0) = L(e) - L(90). Then e.g., 90 is
identifiable if L(0) = [(90) => 9 = 90 etc.
Remark. A result like lheorem 7.4 is quoted in Wu (1979).
LECTURE 8: IDENTIFIABILITY OF SOME TIME SERIES MODELS
8A. ARMAModels
Consider the scalar stationary AgMA (p, q) model (q!p)
(l+a(e))y k = (I+C(L))c k. (8.1)
Define _0 = (a I, ..., apc I, ..., Cq)'. Observe that
E0(yiYi+ k) = ~(0), k = 0, ±i
Thus by the estimability Lemma 5.1, the autocovariances are identifiable.
Further they determine the covariance matrix E~(yy') = [R (8) u -- [i-j] - ]lJi,j<_n
and hence, under Gaussian assumptions the full distribution. Thus in time
series the autocovariances are often taken as the starting point for
identifiability considerations.
Now take covariances through (8.1) with Yk-j
following equations
Hp(0)ap = ~p(!)
a = (a I ..... ap)' -p
RI(0) R2(0) ... R (8) -- -- p
Re(0) . . . . . . . . . . Hp(0) = :
for j ~ p to form the
% ( 8 ) . . . . . . . . . . R2p_l(0)
; r p ( O ) = [ Rp+l(0)
R2p(0)
(8.2)
The matrix H (e) is called a Hankel matrix; it is characterized by having -p
the same entries on cross diagonals. Now suppose p is unknown we might
expect to recover a by solving equation (8.2) for increasing p. -p
THEOREM 5.1. In the model (8.1), a is identifiable iff Rank (H (8) = p. -p -p -
Proof. The parameters a (O) are identifiable iff --p --
215
P@I (y) = P~2(Y) ~> ~pl = ~p2"
Thus, since ~(@) determines P (y), ap is identifiable iff
P,k(@l) = P,k(82) Vk-> ~pl = ~p2"
Now we have
Hp(@l)apl = rp(e I)
Hp(~2)~p2 = ~p(82 ).
However, ~(@i ) = ~(@2 ) Vk --> H_p(@l ) = Hp(O 2)_ = Hp say and rp(@l ) = ~p(e 2)
= r say. Thus -p
~p~pl = ~p = ~p~p2"
We deduce ~pl = ~p2 iff Rank (Hp) = p.
Remark i. A standard numerical analysis method for investigating Rank is
the Singular Value Decomposition (SVD) (see Strang (1974), Stewart (1973)).
From a statistical point of view it is natural to scale H by using -p
= R -½H R -½' where R ½ is a square root of R = [RIi- j I In -p -p -p-p -p -p . i ] l~i,J <_P"
practice then we replace ~ by
~= N-I N-k E 1 YiYi+k
and perform a SVD of H where ~ is an upper bound on the ARMA order.
Now observe that if we introduce the vectors
-'~ - = _ _ , ) v .
-Yi = (YiYi+l ' ""'Yi+~-I )' Yi (Yi lYi 2' "'" Yi-~
= +--I Then H E(ZO~ 0 ). This reveals that the SVD is equivalent to a canonical
correlation analysis of the two vectors y+, y (a clear discussion of
canonical correlation and its relation to SVD is given by Kshirsagar (1974)).
The ~proach to finding ARMA order through canonical correlation analysis
is due to Akaike (1976) who discussed it for multivariate ARMA models.
Surprisingly the scalar version given here is little known. Actually,
216
Akaike proposes to choose the order by using
AIC(p) ~ -2 %~+i in(l -°2)i - £n (v- p)2 where Oi are the canonical correla-
tions. The canonical correlations are obtained through the SVD.
Remark 2. In view of Theorem 5.1 we can define the order of the scalar
ARMA model as the minimum p such that
Rank (Hp) = Rank (Hp+i), i ~ i. (8.3)
There is also another way to view the order that is useful for multivariate
time series. Denote
Yk+i[k-i = E(Yk+ilEk-i ..... gl )"
For the ARMA model 8.I we find that
Yk+j+plk_ I # alYk+j+p_llk_ 1 + ... + apYk+jlk_ I = 0, j ~ 0. (8.4)
Thus we can say
The order is the smallest index p such that ]
Yk+p]k-i is linearly dependent on its pre- I
decessors Yk+ilk_l, 0 < i < p -i.
(8.5)
that
Now this links up with the defintion (8.3) as follows.
+ -i I Yk
H = E(y0y 0 ) = E[ " (Yk-i ..... Yk-~ )
Yk+v-i
Simply observe
Thus H
Ykik-l
= E " (Yk-i ..... Yk-~ )
Yk+~-llk-i
!
E(Yklk-i ~k )
=
• !
E(Yk+v-IIk-I ~k )
has rank p iff it has only p linearly independent rows which
clearly occurs iff (8.5) holds.
217
Remark 3. Given the autocovariances of an ARMA process we can then find
its order. Suppose instead we are given two polynomials 1 + a(L), 1 + c(L)
to be used in a simulation. We should like to be sure they have no common
factors before we can know the order. Usually in simulation the problem
is avoided by building polynomials in factored form. Two polynomials A(L),
C(L) are called relatively prime if they have no common factor. The greatest
common divisor gcd of two polynomials A, C is the common factor of highest
degree. We should like to have a test for primeness as well as a means of
extracting the gcd (so at least we can check someone else's simulation!).
The classical method for testing two polynomials for relative primeness is
the Sylvester Resultant. The classic method for finding a gcd is the
Euclidean division algorithm. These and newer techniques are discussed
in Kailath (1980).
Remark 4. In connection with the last point is the issue of stability
of a polynomial A. Again this is often avoided by building models in
factored form. There are many methods available, one of the best known
being the Schur-Cohn criterion (see Kailath (1980)).
8B. Transfer Function Models
Consider the scalar TFARMA model
Yk = Sk + Vk
s k = (l+a(L))-ib(L)Zk
v k = (l+d(L))-l(l+c(L))~k
where s k is a white noise sequence of zero mean variance i.
The model can also be written
ek(e) = (l+d)-l(l+e)(Yk - (l+a)-ibz k)
_0 = (~ :~.) = (al...ap bl...bp : d l...d m Cl...c m).
We suppose there is a true value @ 0 so that ek(@ O) = Sk" Actually, a
2t8
similar analysis can be made even without this assumption.
m are assumed known. We consider only asymptotic identifiability.
the l+a, l+d, l+c polynomials are stable i.e.
the roots of zP(l+a(z-l)) = 0 etc. ] (8 .6 )
are inside the unit circle.
Also suppose z k is wide sense stationary i.e. the following limits exist
E(zizi+ k) = lim n-i ~l-t~-kzizi+k = <, k = O~ fl, ..,
(In a late~i lecture some models are discussed where these last two condi-
If the Sk were Gaussian we should be led to consider tions do not hold.)
the function
The order p,
Suppose
- 2 H ( e , e o) = L(e) - L(e o)
L(9) = E (e~ (9 ) ) = n-~oolim n-i ~ne~(9)'}"l
It is straightforward to see (provided Zk, £k are uncorrelated)
1 + c 2 * ,, L(9) = /_~ i-~ [¢y - ToCyz - r0¢yz + ITgI2¢z ]df
where Cy = Cy(ei~); Cz = Cz (ei~) are the spectra of y, z; Cyz = Cyz (ei~)
is the cross spectrum; T 9 = Tg(e1~ ) where
Te(L) = (I+a(L))-~(L); the asterisk denotes complex conjugate.
By adding and subtracting ICyzl2/¢z to the term inside the square bracket
we can reorganize the expression as
L(9) = Ls(9) + Lv(O) (8 .7a)
Cv(ei~Ig) Cz - TeI2df (8.7b)
Cy Lv(9) = f~% (ei~ (I - Iyyzi2)df (8.7e)
Cv e)
219
where ~v(L]@) = [i + c(L)[2/ll + d(L) l 2 is the noise spectrum and
Iyyz 12 = I ~ y z ( e i ~ ) 1 2 / ~ y ( e i ~ ) ~ z ( e i W ) i s t h e s q u a r e d c o h e r e n c y be tween i n p u t
arid o u t p u t . Note t h a t ( s u b s c r i p t z e r o d e n o t e s a t r u e v a l u e )
Tso(eim ) = ~sz(e )/$z (eIW) = %yz/%z
Ls(8) _> O, /v(8) _> i = Lv(90)
since @y(e i~) : Cv(ei~)(l - ]yyzl2).
LEMMA 8.1. The two parameter values 0, ~0 are asymptotically observationally
equivalent iff Lv(8 ) = i, L (9) = O. s
Proof. L(8) = L(8 O) <:~> Ls(8 ) + 6v(8 ) = Ls(@0) + Lv(80) = i. But
Lv(9) ~ i ~> Ls(8) = O, /v(9) = i since i (8), L (9) are >0. Converse is S V --
obvious
LEMMA 8.2. Lv(0) = Lv(8 0) = i :> -~ = ~0"
Proof. Lv(6) = l iff ~v(eZ~I0) = ~v(eiC°)
LEMMA 8.3. Ls (@) = Ls(@O) = 0 => _B = 50 i.e. 50 is a-identifiable if
z k is persistently exciting of order 2p.
(See Appendix A8 for a definition of this.)
(8.8)
Proof. Theorem 7.1 could be used but it is quicker to proceed directly.
Let us observe
where
Ls(@) = lim ! n ((TGo(L) T@(L))~k)2 n+oo n E1 - = 0
l+d(L) Zk or z k l+c(L) ~k i+c(L) = i¥d(L) ~k"
Now in view of (8.6) and Lemma A8.3 we conclude
n+oolim ~n Eln ((T~o(L) _ T0(L))Zk )
Introduce Sk(@ ) = (i +a(L))-ib(L)Zk so
2 = 0.
Thus we have
It follows that
Now see that
s k
220
= Sk(e 0) = (I + s0(L))-ibo(L)z k.
lim n-i Eln (Sk(e) _ Sk)2 = O.
lim n-i Eln [(i + a(L))(Sk(0) - Sk)]2 = O. (8.9)
(l+a(L))(Sk(8) - s k) = -(a(L) -ao(L))s k + (b(L) -bo(L))z k.
= (~ - ~o ) ' ~k where we have i n t r o d u c e d
_x k = (Sk_l...Sk_ p Zk_l'''Zk_p)'
Now call
N t (X'X) = E 1 XkX k. n - -
(8.10)
In view of (8.10), (8.9) we have
lira (~- G0 )'(n-l(X'X)n )(~- 00) = 0.
Hence by the corollary to persistent excitation Lemma A8.2 ~ = ~0 if (8.8)
holds.
Remark. These calculations go through in the multivariate case with little
change. The arguments above can be used to establish equivalence between
the discussion of Kohn (1978 b) and Deistler (1976). A related discussion
of identifiability is given by Grewal and Glover (1976).
APPENDIX A8. BASIC LEMMAS IN TIME SERIES IDENTIFIABILITY
Let ~, x k be two sequences related by a filtering operation (not
necessarily stable)
A0(L) ~ = Xk,
Consider the lagged vectors
_x k = (Xk_l...Xk_p)',
A 0(L) = Zg aiLi.
& = (~_l...~_p)'
(AS. i)
221
and the cross product matrices
~ ~ n ~ ~ , n ,
(X'X)n = ZI XkXk, (X'X) n = ~iXk_Xk .
Let ~ be an arbitrary fixed vector.
LEMMA A8.1. If %, x k are related by (AS.I) then
(ii) if ~(X'X) ÷ ~ => g(X'X) °+ n
(iii) if ~(X'X) ÷ 0 :> g(X'X) + 0 n n
where O(M) = smallest eigenvalue of M.
P ~i Li Proof. Introduce ~(L) = l I so we can write ~'x k =
= ~'_~. Then
n ~'(X'X) n~ = Z I (~(L)~K) 2
~(L)Xk; ~(L)%
n (c~(L)A0 (L)%) 2 = Z 1
n 2 p2 _< cE 1 (<~(L)x k) , c = p E a i
This establishes (i).
a(x'x) n ±. a(x'x) n.
= c~'(x'b ~. - n
Now (ii), (iii) follows since clearly
Corollary. Suppose n-l(x'X)n + ~ which is positive definite. Then if -- -xx
A is stable n-l(x'X)n ÷ I~ which is positive definite.
Proof. If ~~ exists it is clearly positive definite. That it exists -xx
follows by an argument similar to that used in Theorem 14.2.
Remark. Lemma A8.1 also clearly holds for vector sequences.
Suppose Sk, z k are two sequences related by
Ao(L)s k = Bo(L)z k (A8.2)
222
q i Ao(L ) = ~ aiLi, go(L) = 2obiL
Ao(L) could be unstable. Introduce the lagged vectors
~k (Sk-l'''Sk-p) ~k (Zk'''Zk-q
~k = (Zk-l"'Zk-p-q-l)'
and the cross product matrices
(Z'Z)n Eln ZkZk"
Let ~ be a fixed arbitrary p +q+ I vector. LEMMA A8.2. Persistent excitation lemma.
~ :> ~(W'W) + ~ (i) ~(z'z) n n
(ii) o(W'W) ÷ 0 => o(Z'Z) + O. n n
Remark. Since e.g., ~(Z'Z) + oo iff ~'(Z'Z) ~ ÷ ~o the condition o(Z'Z) + oo n -- n-- n
is called generalized persistent excitation of order p +q +I meaning via (i)
that the system A8.2 is excited sufficiently to ensure o(W'W) ~ oo. n
_ = ! ~v v Proof. Partition ~ (_Tp :_q+l ) and observe that _~'_s k can be written
_~'s k = ~(L)s k + $(h)z k
P TiLi; B(L) q $i Li. T(L) = E 1 = Z 0
_p+q+l L i Now given _~ consider the p+q+l vector 7__ defined through 7(L) = L I Yi
= T(L)~o(L) + B(L)Ao(L). Then
_ n )2 n y'(Z'Z)n ] = E 1 (Y(L)z k = Z 1 [T(L)Bo(L) + B(L)Ao(L))Zk ]2
2 n = E 1 (T(L)Ao(L)s k + B(L)A0(L)z k)
n (T(L)s k + B(L) Zk)2 _<cZ I
223
c ~'(W'W) ~. -- n-
~¢7'7~,~ ~, ÷ ~ yet o(W'W) + D < ~ . So ~ ~'(W'W) ~ + E < m => ~ T 71 Suppose n n - ~ - n- -
y'(Z'Z) y + F < m which contradicts U(Z'Z) ÷ m. Hence (i) follows. n- n
Similarly, (ii) can be established,
Corollar X. We say z k is persistently exciting of order p+q +I if
n-l(z'Z)n + -zzl which is positive definite. In this case, if Ao(L) is
-i stable then n (W'W) ÷ I which is positive definite.
n -ww
Proof. Straightforward.
3. if Sk ~kukl~_s = E~h L s LEMMA A8. = z and H(L) is stable then s 'i s
,n 2 < c~ 2 H(ei0b g ls k _ Zk, c = essup I 1 2.
Proof. Call Zk = Zk' k ~ n; Zk = 0, k > n. Then set Sk = H(L)~k for all
k. So Sk = Sk' k ~ n. Then
n 2 n-2 co -2 = Ii Sk < E1 Sk E 1 s k
= fl~ IZ~keik~12df; f = ~/2~
~ ik~k - 12df = fl~I~l ° ~lhk-s~s
= f~ i~l eiS~zs E~s ei(k-S)~hk-s 12df
m 2 ~ it~ 2 fZ~Iz leis~-z sl l~l e htl df
co isc0- _< cf_~ l~le ZsI2df
oa_2 = c E 1 z s
= c 2 1 z 2 s"
Remark. A similar result holds in the multivariate case. If H(L) is
rational the essential supremum is a maximum.
LECTURE 9: LAGS IN MULTIVARIATE ARMAMODELS I
9A. Kronecker Indices and MeMillan Degree
From the discussion in Lecture 8 of the order of the scalar ARMA model
(l+a(L))y k = (l+c(L))g k
there are three ways of considering the order p
(a) If 1 + a, 1 + c are coprime
p = deg(l+a).
(b) p = Rank H~ where ~ is an upper bound on p and H
matrix
H
R I R 2 R 3
R 2 R 3 ... =
R 3 ...
m~ "."
• . . m
R2~- i
is the Hankel
(c) The order is the minimum index p ~ y(k+plk- i) is linearly
dependent on y(k+p-ilk-l), 1 ! i ! P.
The interrelations among these three were discussed in Lecture 8.
these three notions will be pursued for the multivariate case. Here we
discuss (b), (c); (a) is discussed in Lecture i0.
Let ~k be an ~ × 1 vector ARMA process
A(L)Ik = C(L)!k
A(L) = ~0 p ~i LI; _C(L) = I pO~i LI; ~k is white noise.
From definition (b) we introduce
(B) The multivariate order or McMillan degree
Now
225
= Rank (H v)
where H is the block Hankel matrix
H - ' 0
v
_a 1
-~2 =
I
R
-~2 "" " R' -,,)
_R 3
~2V-I
where -iRi = E(y0y' i)_ -- = ~-i" (Suppose throughout we have infinite data.)
Note that 6 need not be a multiple of ~: if yk is white noise ~ = 0•
Intuitively there should be more than one order for the vector process.
Rather there should be % orders, one for each of the % variates in ~k"
These are found from definition (C).
(C) The ith Kronecker index or output index or output lag is the minimum
index Pi such that Yi(k+Pilk-l) is linearly dependent on its predecessors
[Yi_l(k+Pilk-l)...Yl(k+Pilk-l) :!'(k+Pi- Ilk-l) : ... :y'(klk-l)]
where
y(k+jlk- i) = !k+jlk_l = E(!k+jl~k_l, ~k-2 .... )"
There is apparently a difficulty here in that the definition depends on
the ordering of the Yik in ~k: it will be seen later that this is not a
problem; in fact, the set {pi } is unique but Pl is not associated with Yl
etc.
Now the relation between definitions (B), (C) is explored by considering
the Hankel matrix more carefully. Observe that the ~i + j row of H is
! _v
r%i+j = E(yj(k+ilk-l)Yk_ I)
, , , )
-Yk = (-Yk-]•''!k-v "
Then definition (C) says to search the rows of ~v for linear dependencies•
The positions of these linear dependencies determines the indices of output
226
lags Pi" (Linear dependence can be investigated by singular value decomposition:
see Strang (1974).)
To clarify all this consider the following example of how the linear
dependencies among the rows of H might occur when ~ = 4. We suppose always
the process ~k is of full rank i.e. the covariance generating function or
spectrum is positive definite: otherwise the dimension can be reduced
appropriately.
Exhibit 9.1. Example of linear dependencies in the rows of the Hankel
matrix for a 4-vector A~MA model.
X = independent row # of linearly
Row # 0 = dependent independent rows
(on previous rows)
1 X 1 2 X 2 3 X 3 4 X 4
5 X 5
6 0 P2 = 1 - 7 X 6 8 X 7
9 0 Pl = 2 - i0 0 - ii X 8
12 0 P4 = 2 -
13 0 - 14 0 - 15 X 9 16 0
17 0
18 0 19 0 p3 = 4
Notice that if row j is dependent on previous rows then rows
j +4, j+ 8, ... (can be and so) are omitted from further consideration. Also
rows 1 2 3 4 are given first consideration.
Note that there are 9 linearly independent rows in the Hankel matrix
227
(clearly all subsequent rows are linearly dependent on these). Thus
6 = Rank (H) = 9.
Also, clearly by construction
~ p_ = 2 + i + 4 + 2 = 9 = Rank (H) = 6, i -I U
This relates the McMi!lan degree to the output indices. If ~ = pg so
Pi = p' i = l...g the process is called block identifiable.
9B. Determination of AR parameters
Now working from the Kronecker index calculations (i.e. linear depen-
dencies) we can determine the AR coefficients of the A~MA scheme by solving
Take cross covariances with ~k-j Hankel equations.
to find
in equation (9.1), j Z P
~ v = (~p ' "~o ) Ev = P"
Each row of A specifies that a linear combination of rows of H - -O
this amounts to ~g equations. So these equations must match up with the
order calculations of Exhibit 9.1. In particular it is clear that
P = maxi Pi"
In the example p = 4. The index p is called the observability index, it is
the maximum lag in the ARMA process. Now we can write out equations (9.2a)
symbolically to correspond to Exhibit 9.1 as follows (now X is a generic
non-zero quantity; 0 is zero).
(9.2a)
is zero:
Exhibit 9.2. AR parameters from Hankel equations.
Kron. lag 4 3 2 1 0 index variable
0 =
F O 0 0 0
I O 0 0 0
[XXXX
~ 0 0 0 0
0000
0000
XOXX
0000
XXXX
0000
00X0
XXXX
XOXX
XXXX
00XO
XOXX
X0001 2 a
XX00 I b H
00X01 -~ 4 c
00XX~ 2 d
(9.2b)
228
To see how the exhibit is constructed note viz. the second row of A entails
a linear dependence among
rows (13, 14, 15, 16, 17, 18) of H .
This is the same as a dependence among
rows (I, 2, 3, 4, 5, 6) of H
which is what is depicted in Exhibit 9.1. This makes it clear that the
pattern of O's, X's in the A. is determined by the linear dependencies of -i
the rows of H . To keep subsequent calculations clear let us agree to
interchange the equations so that the Pi'S are ordered decreasing top to
bottom; we find
Kron. 4 3 2 i 0 index variable
0 = XXXX 0000
0000
0000
XOXX
0000
0000
0000
00X0
XXXX
XXXX
0000
00X0
XOXX
XOXX
XXXX
00x i 4 a 00 xx H 2 b
X00 ~0 2 c
XX 0 i d
(9.2c)
Next observe that permuting the positions occupied by (Ya' Yb, Yc' Yd ) in
Z merely interchanges columns in ~i (and corresponding rows in H ) and does
not affect the lag structure. Let us do this to make AO lower triangular
index variable
0 =
~X X XX[X XX 0
0000 0000
0000 0000
L0000i0000
X000
XXXX
XXXX
0000
XO00
XXX0
XXXO
XXXX
10001H 4 c XIO0 2 d
0010 ~ 2 a
00XI~ 1 b
(9,2d)
The diagonal entries of ~0 have been made unity by scaling each row. We
do not scale by since this destroys the lag structure. Notice how the
apparent association Pi to i is altered. The Pi are rather associated
with the equations. In Lecture I0 it will be seen that the set {pi } is
229
unique. Note that it is clear from (9.2d) that block identifiability
occurs iff Rank (Ap) = 4.
As mentioned earlier each row of (9.2d) entails ~£ equations. However,
Rank (%) = 6 means that there are only 6 independent equations (column rank
= row rank). There are £ rows so for each row we have 6 independent equations
available (a total of 64). Now count the number of unknowns i.e. parameters.
(In our example ~ = 9, £ = 4 so 6£ = 36.) We have
row 1, row 2, row 3, row 4
9 8 7 5 (=> 29 total).
So we can certainly solve for these. Actually the zeroes in the lower half
of AO are non-generic (aside from the 3,2 entry: since number of indices = 2
is 2). So adding these gives (9, 8, 8, 7)(=> 32). The generic case corre-
sponding to Exhibit 9.1 would have
row, 1234 67819101112113141S1617 dependency X X X X X X X 0 X 0 0 0 X 0 0 0 I 0
In general there are 6£ - i parameters, i depends on the Pi" Repeating the
argument of Theorem 8.1 shows that there are 64 identifiable parameters.
What happened to the I parameters will be clarified in 9C.
Because of the definition (C) of the Pi the MA matrices C. must have -l
the same structure as the A. except for two things. -i
(i) ~0 is lower triangular and so has £(%+1)/2 parameters. Alter-
natively, ~0 = ~ and there are 4(£+1)/2 parameters in the innovations
covariance matrix.
(ii) If the bottom a. rows of A. are null then (so are the bottom i -i
a i rows of -IC" but) the top £ - a.1 rows of -IC" have all non-zero entries,
viz. in the example we have
230
A 2 C 2 rx0001 xxx i I XXXX XXXX
XXX X X XX
0000 000
The number of parameters in C "''~i is then -p
1 %(~+i) + ~ Rank (Ci) = ~ + ~ %(~+i)
2 -- "
Thus the total number of identifiable parameters is
~_](~+i) + ~ + ~6 - k = 2~6 - ~ + ~(~+i) 2 2
The determination of the MA parameters is not discussed since they are not
needed for modelling. This point, which is a consequence of the output
statistics Kalman filter is elaborated on elsewhere. In this connection
though it is worth seeing how a state space model may be constructed.
9C. Construction of a State Space Form
The idea is fairly straightforward. Firstly the dimension of the
state vector has to be the system dimension 6. It is easiest to proceed
via the example. The first <5 ( = 9) linearly independent rows were
(i, 2, 3, 4, 5, 7, 8, ii, 15) so take as the state vector
Y-k+l I k
Yl(k+21k)
Y3(k +21k)
_Xk+l] k = Y4(k + 21k)
Y3(k+31 k)
Y3(k + 41 k )
23t
Then we see e.g.~ that E(_Xk+ll k y k) are these very rows. The state space
model will be
~k+llk = ~-Xklk-i + ~k
Zk = Zklk-i + ~k = (1%0...0) _Xklk_ I + ~k
where F, K are to be determined. First observe that F, K are defined by
the relations
T
E(-Xk+lle -Xklk_ I) = F E(_Xklk_ I ~klk-i )
v
E(_Xk+ll k ek ) = K E(e k ek ).
We concentrate on F. By iterated conditional expectation we see
T
E(-xk+llk-1 ~klk~l ) = FE(~k lk_ l ~kle-1 )"
However, the linear dependencies spelled out in calculations such as
Exhibit 9.1 entail relations of the form
-Xk+l I k-i = FO -Xk I k-l"
This immediately yields F 0 = _F. Thus in the example we find
Xk+lLk_ 1 =
-Yl(k+IIk- l)--
Y2(k+llk- I)
Y3(k+llk- I)
Y4(k+llk- i)
Yl(k+21k- i)
YB(k+21k - i)
Y4(k+elk- i)
yz(k+31k-l)
YB(k+41k-l)_
0000 i0000-
XXXXX0000
000001000
000000100
XXXXXXX00
000000010
XXXXXXXX0
0 0 0 0 0 0 0 0 1
XXXXXXXXX
- Yl(klk- i)
Y2(klk- i)
Y3(klk- i)
Y4(klk- I)
Yl(k+llk-l)
j
Y3(k +llk - I)
Y4(k+llk-l)
Y3(k+21k-l)
~Y3(k+3 k-l)
(9.3)
Again the pattern of O's, l's, X's is specified by the linear dependencies.
Also the X's in F are the same ones appearing in the A.. The parameter count --i
232
is especially easy yielding
5, 7, 8, 9 (==> 29 total).
Again there are 3 non-generic O's mai~ing 32 parameters. There are then
~% - % parameters in F, ~% in K, ~(% + 1)/2 in the innovations covariance
matrix; a total of 2~% - % + ~(%+1)/2 as before. Now we can see that the
rows with X's can be filled out without changing the lag structure. This
gives 4 x 9 (=%~) = 36 parameters in F.
Remark i. Identifiability. It has already been proven the ~ - % parameters
in F are identifiable: clearly the %(%+1)/2 parameters in the zero lag
covariance matrix are identifiable. It follows from the form of the output
statistics Kalman filter in 4B that observationally equivalent structures
I have the same value for G = E(Xl y0 ) . It is straightforward to show the
parameters G are identifiable. This is further discussed elsewhere.
Remark 2. If the ordered distinct values assumed by the Pi are ~i' ""'~m
(viz. in the example m = 3, ~i~2~3 = 1 24), then % = ~(Pl .... 'P~)
= ~m-lj=l jn(~m_j,) where n(~ i) = number of indices equal to ~i" This is readily
verified by a simple counting argument in equation (9.3) (take the generic
case).
Notes. The discussion of this lecture is the author's. However, in various
forms most of the ideas here are in (especially) Akaike (1974a, 1974b, 1975,
1976) (who shows how to apply these ideas to analyze data); Wertz et al.
(1981), Ljung and Rissanen (1976), Rissanen and Ljung (1975), Tse and Weinert
(1975), Glover and Williams (1974). Early work on identifiability stressing
block identifiability is Hannan (1969). Related work for econometric models
is Hannan (1971, 1976), Deistler (1975, 1976, ]978), Kohn (1978b, 1979b).
An interesting note relating state space calculations to Hankel matrices is
Koussiouris et al. (1981): see also Rosenbrok (1970). The form obtained
in (9.2d) is related to the so-called polynomial echelon form: see Kailath
(1980). The idea of periodic autoregression is to attack lag structure too
(Pagano (1978), Newton (1982)). The present discussion is its logical completion.
233
APPENDIX A9: MATRIX POLYNOMIALS
A polynomial matrix (PM) is a matrix whose entries are polynomials
in L (or Z = L -I)
p k A(L) = [alj (L) ] = [ E0 aijk L ]l<_i<_~;l_%jf~m"
We can also write A(L) as a polynomial with matrix coefficients
_A(L) = ~A iLI.
A PM is called nonsingular if det(A(L)) # 0.
as two PM's
If a PM A(L) can be factored
A(L) = C(L)A(L)
then C(L) is called a left divisor (Id) of A(L). Two PM's A(L), D(L) have
a common left divisor if there is a PM C(L) and PM's A(L), D(L) with
A(L) = C(L)~(L)
D(L) = C(L)D(L).
A greatest common id (gcld) of two PM's A, D is a common id that is left
divisible by any other common Id.
A unimodular matrix is a square PM whose determinant is constant.
Elementary row and column operations or PM's can be represented by uni-
modular matrices. Thus, e.g., if %(L) is a polynomial the matrix
1 0
U(L) : 0 1
0 0
~(L)
0
i
has determinant i; it adds %(L) times row 3 to row 1 when it multiplied
on the left.
LEMMA A9-1. A PM matrix U(L) is unimodular if and only if its inverse is
also a PM.
Proof. If U(L) is unimodular then clearly
234
is a PM.
we deduce
U-I(L) = adjoint(U(L))/det(U(L))
On the other hand, if U-I(L) is a PM then from U(L)U-I(L) = !
~>
det U(L) det U-I(L) = 1
det U(L) = constant
LEMMA A9-2. If C(L) is a gcld of A(L), D(L) then every gcld of A(L), D(L)
has the form C(L)U(L) where U(L) is unimodular.
Proof. If F is another gcld there exists PM's P, Q
C = FP, F = CQ
- > c = CQP ~ > QP = ~ = > I Q l i P I = 1 .
But Q, P are PM's => IQI = constant, [Pl := constant.
Corollary. If one geld is nonsingular then all are. If one gcld is uni-
modular, all are. Thus elementary row and column operations cannot change
the rank of a PM.
Two PM's are called relatively ]eft prime or left coprime if all their
gcld's are unimodular.
If H(L) is a matrix of rational functions and A, B are PM's with A
nonsingular and with H = A-IB then A-IB is called a left matrix fraction
description (MFD) of H. If A, B are left coprime A-IB is called an irre-
ducible MFD. If 8, ~ are square H = A-IB is irreducible the poles of
are the zeros of det A, the zeros of H are the zeros of det 2"
LEMMA A9-3. Construction of a gcld. Given the ~ × ~ PM 8, the ~ x m PM
B find elementary row operations so that at least m of the bottom rows of
the rhs of the following expression are zero
m ~
m U21 B' 0 m
235
Then R' is a gcld of A, B. (U.. are FM's.) -i 3
Remark. Kailath (1980, p. 401) mentions an alternative via Sylvester's
Resultant.
Proof. Call
so R is a cld of A, B.
i Ill I u-1 -u11 ~12 = Vll -v12 ~ v
u21 _u22 _v21 _v22
:> A = ~Ii ' B = ~Y21
But also
R' = [l~' + [129'
so if ~i is another cld with say
A = ~l~l
T T
=> R = RI(§IUI I +~IU12)
so that ~i is aid of R; so R is a gcld.
Now we observe that from equation (A2.1) we can obtain an irreducible
right MFD from the left MFD A-IB.
LEMMA A9-4. If A is nonsingular
(a) [22 is nonsingular
<h> A IB =
(C) ~21' [22 are left coprime
(d) If 9, 9 are eoprime then deg ]~I = deg ]U22 ] .
Proof. Recall A' = VIIR' so A nonsingular => YlI' g' are nonsingular.
Next, by a well known formula for partitioned matrices
(A2. I)
236
t~22 t / tY l lL = [~t ~ o => L~22I ~ o
so (a) holds. Next (b) holds since
U2tA' + U22B' = O.
For (d) if A, B are coprime then R is unimodular and
deg l~I = deg lYll I.
But U is unimodular so (A2.2)
deg I~l = deg IVII I.
Finally, (c) follows since U is unimodular thus any subset of its rows
is irreducible. For further details see Kailath (1980, Section 6.3).
(A2.2)
LECTURE i0: LAGS IN MULTIVARIATE ARMA MODELS II
10A. Matrix Fraction Descriptions
Consider the multivariate ARMA model
A(L)y k = ~(L) ~k (i0.i)
where A(L), C(L) are matrices of polynomials (PM's) such as A(L) = [~Aiel;
-- ! = -- ~! ) ~k is a white noise sequence with E(g k ~k ) = I. The sequence ~t E(Yk Zk-t
is called the impulse response sequence for the A~MA model and the relation
Z0~H'Li-I = H(L)_ = A-I(L)C(L)_ _ = A-Ic
is called a left matrix fraction description (MFD) of H in terms of A, C.
A brief discussion of }~D's was given in Appendix A9 (see also Rosenbrock
(1970), Wolovich (1974), Kailath (1980)). The causal nature of H(L) is
expressed by the fact that IIH(0)!I < ~.
Recalling Definition A (Lecture 9) of the order of a scalar ARMA model
as the degree of A(L) (deg A(L) or deg A) in a coprime description A-Ic,
we might expect an equality between McMillan degree ~ and deg det A. First
howe~er observe that if A, C have a cld G(L) so
A = GA, C = GC
where A, C are PM's then deg det A = deg det G + deg det A ~ deg det A.
Extraction of a gcld of A, C will reduce the deg det A to a minimum value.
If all the gcld's of A, C are unimodular then the MFD H = A-Ic is irre-
ducible and any other irreducible MFD of H has the same value for deg det A.
Hence the definition
(A') The order of the multivariate ARMA model (i0.i) is 6' = deg det A
where H = A-Ic is any irreducible MFD of H. We have still to argue that
6' = ~.
238
Introduce Pi degree of the ith row of A(L) ( =deg rowi(A)) = maximum
of the degrees of the polynomials in the ith row of A(L). Then deg det A
5_ E 1 Pi" This is most easily seen with the help of an example. Take p = 4,
Exhibit i0.i.
A(L) =
(x is a generic nonzero value).
F x+ xL+ xL 2 x+ xL I
x+xL x
x + x L + x L 2 + x L 3 x + x L 2
x + xL + xL 2
x + x L + x L 2 + x L 3
L 3 L 2 L 1 L 0
0 0 0 X 0 X X XX XX X7
0 00 iO 00I I X 0 X i X X Oj
Lxo x I x x x I x o x i x x x
!
Pi
2
i
3
i
P i
2
1
3
We can write
I
P. _A(L) = _Ahc diag (L i) + _R(L)
where Ahc is the matrix of highest degree coefficients
XOX
XOX
XOX
L 20 0
0 L 0
0 0 L 3
+ R(L)
(i0.3)
and R(L) is a PM whose row degrees are less than those of A(L).
E ~ det A(L) = (det %c ) L Ipi + lower degree terms.
Thus deg det A(L) = E lpi iff
-~c is of full rank.
If (10.2) holds we say A(L) is row reduced. Thus in the example A(L) is
% , not row reduced so deg det A(L) < E lpi = 6. Further it is clear that if
A-Ic is irreducible then iff A is row reduced we have 6' ~ ' = E1 Pi" Any PM
Then clearly,
(i0.2)
239
can be row reduced by elementary row operations (i.e. left multiplications
by unimodular matrices). An example makes this clear
A(L) =
4+I+L2 2L+L 2 ]
3+2L+2L 3 5+2L 3
SO
11] = 0. det (~c) = det 22
It helps to list the coefficients A. --I
L 3 L 2 L 1 L 0
2 00 20 3
Clearly the operation row 2 * row 2 - 2L row I i.e. left multiplication by
Ii -2L) L 3 U(L) = 0 1 i removes the terms. The ~ew ~c matrix is
[i l det HA~c = det ~ 0. - -4
v
So the new matrix is row reduced. Before we can link 6' to 6 and Pi to Pi
we need a lemma.
LEMMA i0.i. Invariance of row degrees. All row reduced PM's related by
unimodular matrices have the same row degrees i.e. if A, A are row reduced
with row degrees arranged in say increasing order and
A = UA, U unimodu]ar
then A, A have the same row degrees.
Corollary. All irreducible left MFD's H = A-Ic with A row reduced have
the same row degrees and the same order.
Proof of Lemma i0.I, See Appendix AIO.
Proof of Corollar X. If A-Ic, ~-i~ are two irreducible left MFD's then for
! -!
some unimodular U A = UA; but A, A are supposed row reduced. Thus Pi = Pi'
240
i = i, ...,~ and 5' = Z 1 Pi
Remark. Thus the set {pi} and the order are unique. To establish the
connection to Pi' 6 simply observe that A(L) in (9.2d) is row reduced.
Thus 6' = Z ~ 1 Pi = 6.
There is one other result that is of interest.
LEMMA 10.2.
proper) iff
If A(L) is row reduced then H(L) = A-I(L)C(L) is causal (or
deg fOWl(C) ! deg rowi(A). (10.5)
Proof.
(i)
(ii)
(cf. Kailath (1980, p. 385).)
If H is causal then from AH = C we find
row.(C) = (row.(A))'H. i I
But every element of H(L) is causal (i.e. no terms L -I, L -2, ...)
so result follows.
If (10.5) holds (with <) consider that the i, j element of H(L)
is
h..(L) = det AiJ(L)/det A(L) l]
where A lj is the matrix obtained by replacing the j th column of
A(L) by the ith column of C(L). We can write as in (10.3)
AiJ(L) = (AiJ)he diag (L pi) + RiJ(L)
(AiJ)h c agrees with ~e except for the jth column which is where
zero since each entry in the jth column of C(L) has degree lower
than the corresponding entry in the j th column of A(L). So
but
deg det A = Z I Pi
deg det A lj < Z lpi.
241
Thus hij(L) is strictly proper: so then is H(L).
in (i0,5) the argument is similar.
When = holds
Remark. This result relates to the comment in Lecture 9 that the pattern
of O's, X's in the _ Ci's matches that in the _ Ai's.
APPENDIX AI0. THE PREDICTABLE DEGREE PROPERTY
The discussion here of row reduced polynomial matrices follows Kailath
(1980, p. 387-388).
LEMMA AI0.1, Let A(L) be a PM of full row rank. Let p(L) be a polynomial
L-vector; let q'(L) = p'(L)A(L). Then A(L) is row reduced iff
deg q(L) = max {deg Pi(L)+Pi } i=Pi(L)#0
where Pi(L) is the ith entry of p(L); Pi = deg rowi(A).
Proof. Call
d = max {deg Pi(L)+Pi }. i=Pi(e)~0
Obviously deg q(s) ! d. We have to show equality. Now {Pi(L)} must have
the form
d-P i Pi(L) = ~i L + lower degree terms,
also not all the ~, can be zero. Now note i
q-0 =-Ahc-°' -~= (~l ..... ~L )
where q(L) = gO Ld + ... + ~d" So that if ~he has full rank then qo # O.
Conversely, (AI0.1) must hold for arbitrary ~ with q0 = 2 => ~hc has full
rank.
Proof of Lemma i0.I. Invariance of row degrees. Proof by contradiction.
Suppose the row degrees of A, A are respectively pl!P2~.,.~p~;
(AI0.1)
242
Pl!P2 !'''!p% and that for some j
Pi = Pi' i < j but pj
Now write out A = UA as
> p j-
A =
j -I %-j+l
J IUll u12] ~-j U21 U22
The predictable degree property of Lema~a AIO.I => UI2 = 0. Hence,
Rank lull UI2] cannot exceed j- i: thus Rank U cannot exceed
j -i + ~-j = %-I i.e. U is singular hence cannot be unimodular.
LECTURE ii: MARTINGALE LIMIT THEOREMS
IIA. Martingales
The contemporary approach to deriving consistency and central limit
results for time series models is based on the use of martingale (MG)
convergence theorems. From a time series point of view this is rather
natural since the basic quantities in the martingale calculations are non-
linear innovation sequences (called martingale differences) that are ortho-
gonal to all functions of the past as opposed to linear innovations which
are only uncorrelated with linear functions of the past. The past or history
is naturally represented by increasing sequences of o-fields.
The other aspect of martingale calculations is that they often use a.s.
(almost sure) or a.e. results. While for most statistical inference it is
sufficient to prove weak consistency (convergence in probability) there are
some time series applications where strong consistency (a.s. or a.e. con-
vergence) is appropriate e.g., analysis of algorithms operating in real
time. Further it is often easier or just as easy to prove a.s. convergence
than convergence in probability.
An informal discussion of martingales is now given. A formal presenta-
tion is included in Appendix All.
Let Y be a sequence of random variables. Denote the history of the n
sequence namely the values YI' "'''Yn-I by Sn-i ~y (these are the increasing
G-fields generated by {Yn}~).± Then Y is a martingale (MG) if n
E(YnI~-I) ° Yn-i"
Now we introduce the nonlinear innovations
gn = Yn - E(YnI~Y-I)~ n
that is for any random variable Zn_ I that is a function of the history of
244
Yn i.e. a function of YI' "'''Yn-i (i.e. ~Yn_l measurable) we have
E(enZn_ I) = O.
Now we can write by the MG property
y =Y +g n n-i n
: vn
~i gs"
Thus a MG is an accumulated nonlinear innovation.
The importance of MG's stems from the Martingale Convergence Theorem
MGCT.
THEOREM Ii.i. If Y n
EIY I < ~ and Y with
is a MG and suPn EiYnl < oo then J a random variable
Y ÷y a.s. n
Proof. The proof is rather involved: see Doob (1953) or Billingsley
(1979). However, a weaker result is easy to prove.
THEOREM Ii. 2. If Y n
Y with E(Y 2) < co and
is a MG and SUPn E(Y~) < ~ then J a random variable
Y -+Y a.s. n
Proof. See Appendix All.
Remark. The following version of the MGCT is often useful in time series.
THEOREM 11.4 (MGCT*). If Jn are increasing o-fields on a probability
space ~, 3, P and if Qn-l' ~n' ~n are 7n_ 1 measurable non-negative random
variables and
Then if E 1 B n
further E la n < ~a.s.
0 _< E(Qn[~n-I ) <-- Qn-i - °'n + ~n'
< ~ a.s. ] Q < ~ a.s. such that E(Q) < ~ and Q ÷ Q a.s. n
Proof. See Neveu (1975). Let S = E 1 ~n if we suppose additionally that
245
E(S) < ~ then a very simple proof is available. Introduce S
note that
E(Sn+l lJ p = E(E(S.I~n+ 1) LJ n) = E(SI~ ' ) = S n
= E(SLTn),
so by MGCT Sn ÷ S. Also, Sn E~ @k £ 0. Then consider that
n n _ -Z1 -I @k + Zl-l~k- E(qn+Sn-El #k + El~kl]n-l) < qn-i + Sn-i
Hence by the MGCT (still true with an inequality)
Qn + s - ,n n n E1 Bk + El~k ÷ L a.s., E(L) < co.
n n But Sn - E1 ~k ÷ 0 a.s. and EI~ k is increasing and bounded by L so
E~ ~k + ~ ! L. Thus Qn ÷ L - ~ £ 0.
Remark. " In applications Qn is often a quadratic form, Then this theorem
is often referred to a stochastic Lyapunov function theorem.
liB. Martingale Central Limit Theorem
Suppose we investigate the asymptotic behavior of a least squares
regression estimator in
Yn = ~ + ~n
namely
n = Zl s s ;
so
i (Z'Z)n = ElnZs_Z s
To f i n d a c e n t r a l l i m i t t h e o r e m we l o o k a t an a r b i t r a r y l i n e a r c o m b i n a t i o n
s a y
~'(~ - ~ 0 ) = ~ ( z ' z ) ~ ~ n - - n - -' 21}s~s"
n T h i s h a s t h e f o r m Z i Z n s w h e r e E ( Z n s t g s _ l . . . c 1 ) = O. We a r e n a t u r a l l y l e d
then to formulate a central limit theorem for an array.
246
k An array {Xnk , ~nk}l n is called a martingale difference array (mda)
k {X n } n is aMG or nonlinear innovations array if for each fixed n k' ~nk 1
difference i.e. (fix n, let k vary) E(Xnk I ~n,k_l) = O; ~nk =~n,k-l"
Introduce the conditional variance
k
Vk2 = ZlnE(X2nj]~n,j_ I) n
There are a number of different MG central limit theorems available (the
connections are completely worked out in Hall and Heyde (1981)). Since
much time series theory is done under finite variance assumptions the
following version proves quite useful.
k THEOREM 11.5 (MGCLT). If A, B, C hold then ~i n Xnk '~'> N(0, i)
2_2+ 1 (A) V k n
(B) E(Vk2 ) ÷ 1 n
k
(c) Z l n E ( X ~ i I ( [ X n i ! > c) l J n , i _ l ) -~+ 0 Vg
Proof. See Hall and Heyde (1981) or Scott (1973). For a very short proof
of a similar theorem see McLeish (1974).
Remark. Often X . has the form X = Z .g. where Z . are measurable nl ni nl I nl
~n,i-i = %-1 = ~(~l'''gi-i )" For simplicity suppose c.l are independent
and uni formly i n t e g r a b l e ( see Appendix B l l ) . Ca l l g i (x ) = E[s~ I ( g ~ > t / x ) ]
so g i (x ) a re un i fo rmly bounded, a l s o n o n - d e c r e a s i n g in x, and sup i g i (x ) + 0
as x + ~. Finally, introduce
Z 2 Z 2 = max = max E(X l~n,i_l ).
n l<i<n ni l<i<n i
Then we can deal with (C) as follows
k k 2 ( 2
ZlnE(X2i~(Ixil > ~)13n,i 1) = Zln Znig i Zni)
247
_< k Z2.g .(z2) _< suPi gi(Z2n )(~KnLI Lni)~2. Eln nl I n
= suPi gi(Z2n)V 2 • n
Thus by u.i. and (A), we deduce (C) if Z 2 P-~ 0. n
C o r o l l a r y . I f X . = Z .~ . w h e r e ~. a r e i n d e p e n d e n t , u n i f o r m l y i n t e g r a b l e Hi Hi 1 1
and ~n,i = ~i = °(~l'''~i ) while Zni are ~n,i-i measurable then
k n X => N(0 i)
Z1 n i '
if A, B, C' hold
(c') max E(X~ I ~ P 0. l~_i<_n i ~n,i-i )
APPENDIX All. BASICS OF MARTINGALES
Let {Yn}l be a sequence of random variables on a probability space
(~,3, P). Let {3 } be a sequence of sub o-fields of 7. Often J will be n n
t h e i n c r e a s i n g o - f i e l d s g e n e r a t e d by {Y }. However , t h e y c o u l d be l a r g e r n
e.g., generated by {Yn' Z } where Z is another sequence of interest. If n n
~n = ~y = O ( Y I ' ' ' Y n ) a r e g e n e r a t e d by Y t h e n i n t u i t i v e l y we can t h i n k o f n
~Y as containing the "history" of Y . Another way of thinking of this is n n
t h a t any m e a s u r a b l e f u n c t i o n ( l i n e a r o r n o n l i n e a r ) o f Y I ' " ' ' ' Y n i s
m e a s u r a b l e w i t h r e s p e c t to J Y . A r e a d a b l e i n t r o d u c t i o n t o MG's i s g i v e n n
by Heyde ( 1 9 7 2 ) .
The pair (Y , ~ ) is called a MG if n n
(1) ~n+1 = ~n"
(2) Yn is Jn measurable; then we say Yn is adapted to In"
(3) EIYnl < ~: this ensures conditional expectations exist.
(4) E(Yn+llSn) = Yn"
Often there is no confusion as to which history or o-fields are being used.
248
Then we say Y is a MG. If {Y ] ~ is a MG and sY is the increasing o-
fields generated by Y then {Y ,~Y} is a MG. The proof is a simple exercise n n n
in iterated conditional expectation. Note (4) is equivalent to
E(YmI-%) = Yn' m An.
LEMMA AII.I. {Yn ']y)n is a MG iff E(Yn+I[YI...Y n) = Yn a.s.
This equality is often taken as the definition of a MG. Before giving
the proof we recall the projection definition of conditional expectation.
If G is a sub o-field in ~ and Y is an ~-measurable random variable then
E(YIG) is defined by the orthogonality condition
fA (Y -E(YIG))dP = 0 for all A e G
or equivalently
f (Y-E(YIG))ZdP = 0 for all C-measurable functions Z.
Proof of Lemma AII.I. Let A e O(YI'''Yn ) - = ~Yn then A must be the inverse
i m a g e u n d e r Y = ( Y 1 . . . Y ) ' o f a B o r e l s e t H e ]]{ n i . e . A = Y - I ( H ) - n
= {m : Y ( ~ ) e H}. Then by a c h a n g e o f v a r i a b l e i f py i s t h e j o i n t d i s t r i b u t i o n
of Y we have
[A Yn+l dP = fYgHYn+l dP = fH E(Yn+I IY = y)dbtf (11.i)
fAYn dP= fHE(Ynl Y =Y)d~Y = fHYndPy" (11.2)
THEOREM AII.I. If {Yn' In } is a MG and sup n E(Y~) < ~ then ~ a random
variable Y with E(Y 2) < ~ and Y + Y a.s. To prove this we need to use n
THEOREM AII.2.
Y > 0 then n
Proof. Let A I
Then
Kolmogorov's Maximal Inequality. If Yn' % is a MG,
] E(Yn) P max Y > o'. <
l~i~n i -- -- ~, "
= {~ :YI(~) >__ C~}, A k = {max Y. < ~ < Yk i<k i _ }, k = 2, ...,n.
249
n A = h A k = h
l<_i<_n
Also ~ 6 ~k = o(YI'''Yk )" Then consider that (since ~ are disjoint)
E(IEn) = E(Y n EiIak) (I A is the indicator a function)
n = E 1 E (YnI%)
= Eln E(E(Y n I e kl 3k))
= Eln E(I~Yk ) by the MG property.
But on %, Yk > ~ so
nE(l ) =~E(IA) =o.P(A) E(IAY n) >_ ~ E 1 A k
i.e.
max
l<iin 1 n
Proof of Theorem AII.2. Recall the following necessary and sufficient
condition for a.s. convergence (see e.g., Chung (1974)). A sequence Y n
converges a.s. iff
lim lim P{IYn+k-Ynl > g for some i <k <m} = 0. n+oo n+oa
This holds by upper bounding if
lim lim P ( max IYn+k-Ynl2 > 2] = n -~°° n -~° 1%k<m
By the maximal inequality this is bounded by
0.
-2 )2 lim lim E(Yn+ m-Yn (11.3) n-+oo m->~o
However, by the MG property
2 E (Y2n+m) E(y2) E (Yn+m - Yn ) = - "
But E(Y2n ) converges to a limit. To see this observe SUPn E(Yn 2) is bounded
by assumption and E(Y 2) is non-decreasing. This is because if we call
250
an = Yn - E(Ynl~n-i ) = Yn - Yn-i then Yn = an + Yn-i .... > E(y2) =n E(Y2n-I ) + E(~2)
>_ E(Y2n_I ). Thus the limit in (11.3) is zero.
APPENDIX BII. SOME USEFUL PROBABILITY RESULTS
i. Dominated Convergence Theorems
Let Y be a sequence of random variables (RV's) on a probability space n
Q, ], P. Let f be a sequence of Borel measurable functions on a measure n
space ~, ~, .
LEMMA BII.I.
(i) If f n
Dominated Convergence Theorem.
÷ f in measure or a.e. and 3 g ~ [fnl ~ g, fgd~ < ~ then
f fndD ÷ /fd~ or f if n - fld~ + 0 as n ......
(ii) On a probability space this reads
If Yn ~ Y and ] a RV Z ) I Y n I ! Z
and E(Z) < ~ then EIYn-Y I -+ O.
Proof. See Billingsley (1980).
Now in the probability setting we can improve on the dominated con-
vergence theorem. In fact we can give an iff condition for
Yn ~ Y to imply ElY n - Y I ÷ O.
The condition is uniform integrability.
A sequence of random v a r i a b l e s Y i s c a l l e d un i formly i n t e g r a b l e ( u . i . ) n
if
(a) sup n EIYnl <
(b) given g 3 (~ ) P(E) < 6 => a
fE IYn IdP < a Vn i.e. lim SUPn fA [Yn idP = 0 or lim P (A) ÷0 P (A) ÷0
= O.
sup n E(IAiYnl)
251
Remark. If Yn is u.i. and EIYI,: < ~ then ~IY n-Yl is u.i.
LEMMA BII.2. If Y ~ Y then E - n [Yn Y1 + 0 iff Y is u.i. n
Proof. The u.i. condition drops out naturally from the classical proof of
dominated convergence as follows.
EIYn-Y I = f IYn-YI dP
= flYn-YI>a: [Yn- YIdP + fIYn-YI<¢ IYn-YIdP
! sup m flyn_yl>g IYm -YI dP + g
+ g by u.i. "" P(IYn-YI > g) ÷ O.
But g is arbitrary so EIY n-YI ÷ 0. The converse is a little harder.
very simple proof i s given in B i l l i n g s l e y (1979).
Corollary.__ If Yn ~ Y and SUPn E(Y~) < ~ then EIY n-Yl ÷ 0.
A
Proof. Clearly SUPn EIYnl ! sup n (E(Y~)) ½ < ~ also SUPn E(IAIYnl)
! /P(A) /SUPn E(Y~) so Y is u.i. n
Remark. Lemma AII.2 also entails EIYnl ÷ EIY I since ElY - EIY-Yn
EIYnl ! EIY I + EIY-Ynl.
2. Kronecker's Lemma and the Toeplitz Lemma
Suppose Z is a sequence of numbers Z ~ Z. If w is an array of n n n:
n weights with Wni A 0; E lwni = i; Wn:. + 0 each i (i.e. no weight dominates)
n + Z. That this is then intuitively we expect the weighted average Z 1 WniZ i
true is the Toeplitz Lemma. A proof is straightforward.
LEMMA BII.3. Kronecker's Lemma. Let b n
a non-decreasing sequence V + oo. Then n
be a sequence of numbers and V ii
Vnl n nbk/V k ÷ S < oo Z 1 b k ÷ 0 if I 1
Proof. Call S = n bk/V k n E1 so Sn + S; S O = 0.
252
V-I .n V-I n n ~i bk n - • = Z 1 Vk(S k - S k i )
V-I n = n [EiVkSk - Vk-iSk-i + (Vk-l- Vk)Sk-l]
n = S - V~ I E 1 (V k - n Vk-l)Sk-i
+ S - S = 0 by the Toeplitz lemma.
Remark.
3.
(i)
(ii)
(iii)
The Kronecker lemma is much used in a.s. convergence proofs.
Some Useful Lemmas
If X__> 0, E(Ix> ~ X) = tiP(X> ~) + f~P(X > t)dt.
Proof: straightforward (see BillJngsley (1979)).
oo co EIP([XI >_ k) < E]X 1 ! i + >21P(]XI > k). This is called the
moments lemma.
Proof :
~o _ ~ ~ . . . . /i+l lIP(IX I > k) = llfkdP = El>~i=k ~ i dP
co co = Z li ~i+idp < ~i+l ~i ~I -i IxldP -< EIXI
_ oo ~i+]dP 1 + °°P(IX I > k). < El (i+l) ~i = E1 --
oe co n If ~ _> 0 and E 1E(~) < ~ then L I~ < oo a.s.i.e. >]i ~ converges
a.s.
Proof: see Billingsley (1979).
APPENDIX CII. STRONG LAW OF LARGE NUMBERS
The following result, which generalizes Khinchin's strong law of large
numbers (SLLN) is known but there does not seem to be a published proof.
THEOREM CII.I. Generalized Khinchin SLLN (KSLLN*). If X is a sequence n
of independent random variables with E(X ) = m and ] a RV Z with n
e(txl >~) <_cF(Izl >_~), Elzl <
then
253
Remark.
-i n CII i n EIXi + m a.s,
Condition CII.I will be called stron~uniform integrability (SUI)
since it is about the minJ~al condition that ensures uniform integrability.
It can be interpreted as saying that while the X are not identically n
distributed the different mechanisms that produce them yield outliers at
about the same rate.
Proof,
(i)
This is very similar to the usual proof. Let Yk = l(IXkl ! k).
ZIP(~Y k) = ~IP(I~I > k) ! C ZIP(IZ I ~ k) ! CEIZ I < ~ so
P(~#Yk i.o.) = 0 by the Borel Cantelli lemma. So it is enough
to show
-i n y + n ~i k m.
(ii)
(iii)
Now n -I n ZIE(Y k) ÷ m as follows.
m(Yk) = m k = f[Xkl<k~dP = m- fiXkl>_k~dP
But
IflXkl>kXkdP[ = kP(I~ [ > k) + /kP(~>t)dt
i C~[e[>_k IZIHP ÷ 0 as k -~ co
since EIZ I < 0% Thus E(Y k) ÷ m so n -I n Z I E (Yk) + m.
So we need only show n -I n Z I (Yk-E(Yk: -* 0 a.s. By Kronecker's
oo lemma this occurs if Z I (Yk-E(Yk))/k < co. Since Yk are independent
oo E(Yk)) 2/k2 this occurs (by MGCT) if Z I (Yk- < oo which occurs if
EiE(Y k- E(Yk))2/k2 < o~ It will do to show Z E(Y )/k 2 < o~.
k 2 k 2 10 t>dt < t0 >t>dt
k 2 <C tO P(Z2>t)dt = C fk p(iz I >_ u)Eudu.
Thus
oo 2 o~ -2 k ZIE( Y )/k2 < 2cZi k f0uP(IZl > u)du
2 5 4
~ k-2 _k i 2 ~ 1 ~1 7i-1 ~p(!zl z u)du
~k-2 k i 2o~I z 1 i / i _ l P ( I z l z u)d~
~ ~ - 2 . i 2cZi= I ~k=i k i/i-I P(IZI ~ u)du
i 4cZi: I /~_iP(iZl ~ u)du
= 4cEiZi < ~.
Remark. It is possible to give a theorem with ~ a martingale difference
sequence but the added generality is not useful since the conditions could
hardly be checked in practice.
LECTURE 12: ASYMPTOTICS OF REGRESSION MODELS
12A. Introduction
The aim in the next few lectures is to consider the asymptotic theory
(consistency and central limit theorem) of least squares estimators (Ise)
for the two basic linear (in parameters) time series models. The AR
where
and the ARX
Yn = ~-O-Yn-I + ~n' n h i
-Yn-i = (Yn-i .... ' Yn-p )'
v T
Yn = -~0Yn-i + -$0Zn-i + gn
where Z is an m-vector of exogenous variables: s is a white noise sequence -n n
(specific properties are specified below).
These two models allow us to see how general an analysis can be given.
These models can be included in the regression model
Yn = 20Xn + ~ ' (12.1) n
If we allow x to be ~g,z -n ~n-I measurable (i.e. we suppose ~n is a function of
[n-i ..... [0' ~n-i ..... ~0 ) i.e. yn , ~n are fed back into -nX : ~0 is an
r-vector. The exogenous variables are taken to be independent of the a . n
As a sort of bench mark it is first worth reviewing the asymptotic
theory of Ise for the regression model with x independent of s • -n n
12B. Regression Without Feedback
The Ise is
T = (x,x)[1 n (x'x) = n~s~s. -n E1 ~sYs ' n E1
Thus subtracting off the true value ~0 gives
256
~ 20 (x,x)~l n = - = ~1 Xs~s" - ~ - n -
T Then the variance of 8 is
var(2n) = (X'X)-in 02.
We have then
(12.2)
THEOREM 12.1. In the regression model (12.1) if x are independent of -n
gn: Sn are uncorrelated zero mean with E(~) = 02 then -n ~ ~ ~0 if
O(X'X) ÷ oo n
(12.3)
where o(X'X) n = smallest eigenvalue of (X'X) . n
D Proof. 0 _5+ 0 if for an arbitrary fixed vector ~, ~'@ ÷ 0 in mean
n . . . . n
square and this follows via C h e b s h e v ' s i n e q u a l i t y f rom ( 1 2 . 2 ) , ( 1 2 . 3 ) .
Since a'(X'X) -I ~ ÷ 0 for an arbitrary ~ iff o(X'X) ÷ ~. - n - - n
Remark. The condition (12,3) can be called a generalized persistently
exciting condition.
The question of a.s. convergence of -n
has only recently been obtained.
is much harder. The basic result
THEOREM 12.2. In (12.1) if -nX are independent of gn; sn are a nonlinear
innovations sequence or martingale difference sequence i.e. E(enI6n_l...el)
= 0 and E(~I~n_I...£1) 2 = a.s, then -n ~ ÷ 20 a.s. if (12.3) holds.
Proof. See Lai et al. (1979). In the Gaussian case there is a very easy
proof as follows. Consider that
-n n-i + (X,X)nl = (X'X)nl E 1 _XsY s XnY n
(X'X)n I A = (X'X)n_I@n_ 1 + (X'X)-I x y n -n-n
= - n-1 + (X X)nl n (Yn 1 )
257
=> ~n = ~n-i + (X'X)nlxne n- (12.4)
= -- ~ en Yn ~n~n-i"
It is easy to show the residuals e are uncorrelated and have variance n
°2(I+x'(X'X)71~n ) ' - n n Also it is easy to see the en are linear combinations
of the Sn hence if ~n are Gaussian so are en," but en uncorrelated => en
independent . Now l e t ~ be an a r b i t r a r y f ixed vec to r and l e t ~n = d ( e l " ' e n ) "
Then if T = ~'6 we have from (12.4) n - -n
E(T21~n-I ) = T2n-i + 0"2(1 +-xn(x'X) n l x n ) X'-n (X'X>n2-Xn"
Hence by MGCT* T converges a.s. if n
E l~ (l+_nX' (X'X)n I Xn)(Xn (X'X)n 2xn ) < ~°. (12.5)
Once (12.5) is established we employ (12.3) and Theorem 12.1 to see that
T -~P~ 0 hence T + 0 a.s. Now (12.5) is proved in Appendix AI2. n n
= Remark i. In the scalar case (r = i) Yn Xn@O + g ) n
12.2 is well known. We simply observe that
the result of Theorem
n (90 On = (~i x2)-IEns ± = _ x ~ + 0 a.s.
$ s
oo n 2 if E lxsgs/V s < oo a.s. (Vn= ElXs) by Kronecker's lemma. Now by MGCT
T = n Xsgs/Vs converges a.s. if SUPn E(T2n ) < EiX2s/V2s < o% But n El
1 ~ 2s 2 2s/VsVs_l = ZTVs I _ V-I = Vl I ÷ o% The difficulty in x /V _< Z x i s if V n
the vector case is that the Kronecker lemma holds only under restrictive
conditions,
Remark 2. Actually the result of Lai et al. (1999) allows g n
correlated noise. If g has the property n
to be a
2 ~ (12.6) ZI a n < ~ ==> E lane n converges a.s.
Then Theorem (12.2) holds. Condition (12.6) is satisfied by certain types
of strictly stationary linear processes (see Solo (1981)).
258
Remark 3. If g do not satisfy (12.6) but say are only uncorrelated or n
stationary with bounded spectrum then the natural result is
÷ ~0 a.s. if ~ -n p+2 log s/(sq(X'X)s ) < ~.
This result is proved in Solo (1981).
We turn now to the CLT.
THEOREM 12.3. In (12.1) if x are independent of E • Suppose g inde- -n n n
pendent and 2 s t r o n g l y u n i f o r m l y i n t e g r a b l e . L e t (X'X) ½ be a s y m m e t r i c n n
positive definite square root of (X'X) . Then n
if
) l)
max , -Ix. x. (X'X) ÷ O. --i n --l
(12.7)
Remark. If g. are iid the result is true. 1
Corol!ary. Condition (12.7) is implied by
o(X'X) + oo n
T -i x (X'X) x + O. -n n -n
(12.3)
(12.8)
Proof.
that
Let ~ be an arbitrary fixed vector (but ~'~ = i). Then consider
~ ~ n ~' (x'x)~(en-e0) ~' x'x 2 n =
= - ( )n Z1 5sea ~iZns~s
where Zns = ~' (X'X) 2Xn-s" Set ]n,s = ~s = d(Sl'''s- s )" Then Zns are trivially
]s-i measurable. We need then only check the conditions of the corollary
to Theorem 11.5. The conditional variance is
while
V 2 = En Z 2 = i n i ns
Z 2 , -I max < max x. (X'X) x. * O. l<i<n ni -- l<i<n -l n -l
The proof of the corollary is straightforward.
259
12C. Regression With Feedback
Now we return to the case where the regressors -nX may be -n~-i measur-
able i.e. depend on the past. A result like Theorem 12.2 is not possible.
We begin with a basic result.
~IEOREM 12.4. In the regression (12.]) with x -n
÷ ~0 a.s. if (12.3) holds and -n
being <-i measurable then
x' ~I ~r+l-n (X'X) Xn/O(X'X) n < ~. (12.9)
Proof. Begin by observing
~o ~ (x'x)[ 1 ~ - = = E I XsC s --n -n -
Also then
If we introduce
n-I (X,X)~I Xngn = ~n ~ (x'x)~ I El ~s~ +
(X'X) ~n = ,n-i n E1 ~sgs + ~ngn"
n = -n [' (X'X)n-Qn
Xs s )
=> E(QnI3 _i ) = (Zl-lXsSs)'(X'X) + (X'X) x --n -n
Qn-i + O2-nx' (X'X)n I }n"
Now divide throughout by ~ = o(X'X) ; call Qn = Qn/°(x'X)n n n
dcn = ~ - => =O(X'X) ) n °n-i (On n
and
E(QnI~_ I) ! Qn-I - Qn-ld~n/On + 2_nX, (X'X)n l_xn/o(X'X) n.
T h u s b y ( 1 2 . 9 ) a n d MGCT e we d e d u c e
(i2.1o)
(12.ii)
also
Qn converges a.s. to Q < ~ a.s,
co
E1 Qn_idOn/On < o%
260
Thus to avoid contradicting (12.3) we must have Q = 0 a.s.
Finally, the result follows since Qn - > Ii~n-~0 [12"
THEOREM 12.4'. Call i(X'X) = % = largest eigenvalue of (X'X) n. If n n
then
r+l (log ln/%n_l)/O(X'X)n < co
÷20 a.s. -1%
Proof. Introduce V = det(X'X) . From the identity n n
x' n = (V n- Vn_ l)/v n" (X'X)nl x n
We find (call o = o(X'X) n) n
ENr+I -nX' (X'X)nl -Xn/~n = ENr+I (Vn - Vn-l)/Vn°n
n dx < O n d~x = Zr+l n n -- Zr+l - x
= EN o -I (in V - in V n i ) r+l n n -
In V N EN in Vn_ I
< O N + r+l On_ I
du n (sum by parts).
(7 n
However,
In V < r in 1 n n
so
i in l N ~N in In_ I don]
_< r o N + r+l On_ 1 O n
= ~N in(%n/%n-l) + c. r ~i o
n
The result follows from Theorem 12.4 on letting N ÷ ~.
(12.12)
Remark.
that
From the identity (12.12) we also deduce the interesting fact
lln_sx, (X'X)s l~s = Eln(V s_Vs_l)/V s ~ Enl in(Vs/Vs-l)
~ in V ~ in %(X'X) n. n
261
12D. A General Central Limit Theorem
Consider the quantity
= (XX> ½ _ _ ~iXs~s •
2 THEOREM 12.5. In 12.1 if ~ independent, 6 strongly uniformly integrable;
n n
if for each n there is a positive definite symmetric non-random matrix B -n
such that
~n 2(X'X)~ ~p ! (12.13)
E(B n2(x'x)nBn½ ) ----+ ~ (12.14)
and if
max i~" <iSm
x: (X'X)-Ix. P--~ O. (12.15) --l n -I
Then (X'X)~ (@n- 00) => N(9' !)"
Proof. Let us write
(X'X)~ (in--00) = (X'X)½B-½n-n -n B½ Eln~sgs"
Thus in view of (12.13) we consider B½-nEln-Xsgs" Let ~ be an arbitrary fixed
vector (but ~'~ = i). Introduce the array X = Z Es, Z = ~'B -½x , - - ns ns ms - -n -s
Fns = Fs = o(~ l...Es). Then we can apply the corollary to Theorem 11.5.
The conditional variance is
V 2 = ~'B -½ (X'X) B -½~ P-~+ 1 by (12,13). --n n - n -
Also E(V~) + i by (12.14). Finally
, 1 xi -~ max Z 2. _< max x i (X'X) n ~'B-~i(X'X) Bn2~ ~ 0 by (12.15) and (12.13). l<i<_<_<_n n: l<i<n - - -n n - -
Remark. If ~(X'X) n ÷ ~ then (12.15) may be replaced by
x' (X'X)~ Ix n P_Z+ O. -n
Notes. Theorem (12.1) is due to Eicker (1963). Theorem 12.2 is due to
262
Lai et al. (1979). The Gaussian result was proved (by a much longer argument)
by Anderson and Taylor (1976). The proof given is basically due to Sternby
(1977); see also Solo (1981). Theorem 12.4 is due to Solo (1978). With another
proof Theorem 12.4' is due to Lai et al. (1981). Theorem 12.4' with an extra
condition was proved in Solo (1978). Theorem 12.3, 12.5 are well known. A
theorem like 12.5 is quoted in Lai et al. (1981).
APPENDIX AI2. CONVERGENCE OF A SERIES
We show if (X'X)n = Eln_xsX~_ then
z I (i+ ~' (x'x)~ 1 x' ~2 -n ~n ) -n (X'X) ~n < ~"
We begin from
= (X'X)n_ 1 (X'X) n x x' -n-n
(AI2. i)
~> I = (X'X)n(X'X)nl I - XnX n (X'X)n_I 1
=> (X'X)nl : (X'X)n_I 1 - (X'X)nlxnx n (X'X)nl 1
~> tr(X'X)nl = tr(X'X)nl I - _x n (X'X)nI(X'X)n_IlXn
, => ~r+l-n (X' i Xn -< tr(X'X)-in
(AI2.2)
(AI2.3)
On the other hand from (A12.2)
x; (x X)n l :
Using this in (A12.3) gives (AI2.1).
x' (X'X) -I --n n
x < x ' x ) x 1 + - n n - n
LECTURE 13: LEAST SQUARES ASYMPTOTICS IN AR AND ARX MODELS
13A. Consistency in ARModels
Consider once more the AR model
Y = ~ ' ~ n - 1 n
where Sn i s a mds w i t h E ( ~ I F n _ I ) = o2:
+ c n
t
F n = <~(Sl...¢n ) :
(13.1)
= (~l'"c~). P
of the poly- The behavior of (13.1) depends critically on the roots ~i
nomial equation
zP(I-~(Z-I)) Z p -P a Z p-i 0. (13.2) = - L1 i =
If all I~il < 1 the process has bounded variance and is as~totically
stationary. If some ~i are on the unit circle the process has growing
variance and it will be called a random wanderinl model (RW_) (this derives
from random walk - the simplest case). If some roots are outside the unit
circle it will have geometrically growing variance and will be called an
e__xplosive model. From a time series point of view the ease l~il ! 1 seems
to be most interesting so only this will be treated.
We can prove the strong consistency of the least squares estimate (ise)
= (y,y)~l n _ n , -n EiYs_l~s; Y'Y = Z IYs_IYs_l using Theorem 12.4'. To do this we
have to investigate the relative growth rates of I(Y'Y) n and °(Y'Y)n the
largest and smallest eigenvalues of (Y'Y) . n
Suppose then that the roots of (13.2) are inside or on the unit circle.
Let A = number of roots on the unit circle and factor
1 - ~(L) = (I+A(L))(I+a(L)) where zA(I+A(Z-I)) = 0 has only roots on the
unit circle and zP-A(I +a(z-l)) = 0 has only roots inside the unit circle.
Introduce two auxiliary sequences
u t = (I+A(L))Yt, v t = (I+a(L))Y t
=> (l+a(L))ut = ~t' (l+A(L))vt = ~Tt" (13.3)
264
Now i~troduce u -t
Now we can write
where
= (Yt-l" " "Ut-A) ' T
, V t = (Vt_ l...vt_p+A)'', W t = (Ut, Vt)"
= Tit
T =
p-A
-i A 1 ... A A 0 0
0 1 A 1 ..... A A
0 ..... 0 I...A A
1 a I 0 0 .... ap_ A
_0 ......... 0 1 a I. .ap_ A
Futher T is nonsingular.
So we need only deal with ~ . -n
observe firstly that
So if we introduce (W'W) = n n E 1 Wt_lWt_l we find
n ! -1 ~ : !-RY'Y)~IzI !s-l%
(T'(Y'Y)nT)-IE I [[s_iEs
(W,W)~I ~n = ~ say. = Z1 ~s-lSs -n
To calculate the eigenvalues of (W'W) n
(i) Largest eigenvalue
A(W,W) n < p(Ei u 2 + n 2 -- t ZlVt). (13.4)
Since i + a(L) is stable it follows by the basic Lemma A8.3 that for some c
n 2 nu2 _< c s t Z1 t E1 "
±@ Now consider 1 + A(L), all the roots have the form e
two sequences d t, b t related by
(l_e-i@L)dt = b t
= eit8 t -is@~ => d t E 1 e o s
(13.5)
Consider then
265
idtl2 t 2 => ~ t ~1 Ibsl
n idti2 -> I I n .t [bs[2 E1 Ibsl2 zn --< ZI t ~I = s t
n 12 ~n 1 ~n ibsl2 --< Y~] Ibs ~i t = ~ n(n+ i) 1
=> ZI idtl2 n 2 n 12. n ~ El ib s
Continuing we see that since 1 + A(L) has A roots
2A vn 2 n v 2 < n ~i ZI t -- ~t"
Thus from (13.4) we find
%(W,W) n ~ n2A Zln 2_t ~ n2A+l
(ii) Smallest eigenvalue.
k (E'E)n = Z l~s_l~s_l •
A8.1 that
clearly
= ), Introduce ~t-I (St-l'''St-n '
In view of (13.3) it follows from the basic Lemma
c o(W'W) ~ o(E'E) n n
n 2 ~7(E'E) ~ p - ~ pn.
n ZI -t (13. Sb)
(13.6)
(13.7)
(13.8a)
THEOREM 13.1. In the AR model (13.1) if all the roots of (13.2) are !l
then ~n ÷ ~0 a.s.
Proof. In view of (13.7), (13.8) the result follows from Theorem 12.4'
(with a little extra work).
Remark i. An obvious question related to the AR model is the consistent
estimation of the autocovariances when all the roots of (13.2) are <i.
This can be proved quickly by an argument similar to that used in Theorem
14.2 (see Appendix AI4).
266
13B. Central Limit Theorem
The situation here is rather complicated but interesting. 7~o cases
will be discussed firstly where all roots of (13.2) are < i; secondly,
where one root =i.
THEOREM 13.2. In the AR model (13.1) if all roots of (13.2) are < i and
g. i n d e p e n d e n t , g2. s t r o n g l y u n i f o r m l y i n t e g r a b l e and (Y'Y)½ i s a s y m m e t r i c 1 1 n
positive definite square root of (Y'Y) then n
Proof.
n ½ (Y'Y)~ (~ -~0 ) => N(0, I ) -n - -p
It f o l l o w s f rom Remark 1 a b o v e t h a t
-i n (Y'Y)n ÷ [Ri-j]l<_i,j<n = R (13.9)
where ~ = lira E(YiYi+k). i-~o
apply Theorem 12.5 with B -n
theorem we need only show
Also R is positive definite. Thus we can
= R-in ½ . In view of the remark following that
y, (y,y)-I y p~ O. -n n -n
From (13.9) this holds if Y' R-Iy /n P+ 0 i.e. if II!nII2/n p-E+ 0 which -n - -n
holds since EII!nll2/n ÷ 0.
When at least one unit root is allowed the limit theory is deeper:
non Gaussian limit distributions are obtained. The idea is briefly indi-
cated.
Consider the simple model
Yt = ~Yt-i + gt; ~ = i (i3.10)
where ~ = i, Y0 = O, E(gt~s) = 6ts , E(ct) = O. The ise is
(E 1 2 -i .n = Yt_l ) >21 YtYt_l
-I => d- I= V U
n n
U ~n V n y2 n = EiYt-lgt ; n = E1 t-l"
267
t 2 Since Yt = E 1 a s so E(Y ) = to we see from Un = Un-1
V = + y2 that n Vn-i n-i
1 4 1 2 2 E(U ) = ~ n(n+l) ~ ~'o n
+ enYn_l and
1 2 2 s(v n) ~ 7o n .
Clearly the usual Gaussian limit is not obtained. The natural quantities
to consider are n-fUn , n-2Vn" First observe that Wnt = El gs/~f~n behave s t
like a Wiener process on [0, i] viz. E(WntWns) = min (t/n, s/n). So writing
n-2V as n
n-2Vn = n-I ~i vn(E~ -I Cs/~)2
suggests the Riemann sum converges to
n-2Vn ~ f~W2(s)ds
when W(s) is standard Brownian motion (see Billingsley (1979)). Similarly
we are led to the calculation
t-I n-iU n ~t E1 as ~,n fl 0 = -- L I dW W W(t)dW(t). n E1 /~ /~ nt nt
This has to be interpreted as an Ito integral to give a value
1 n-iUn ~ ~ (W2(1) - i).
Alternatively, we can calculate directly that
t o t l (E st)2 = E g + 2 EI~; t E 1 a s
1 -~ n 1 n-i n 2 => n-Iu n = ~(n ~E la t) 2 _ ~ llgt
--> n-iU i n ~ 2 (W2(1) - i)
as before. Thus
268
The consequences of these calculations are discussed further by Dickey and
Fuller (1979, 1981).
13C. Consistency in ARX Models
The model being considered is
Yn = -~'Yn-i + -B'Z-n-I + gn (13.11)
where Z is an exogenous sequence independent of g ; ~ is a p-vector; -n n -
is an m-vector. Again the roots of (13.2) determine the behavior of
(13.11). Again only the case with roots ~i seems of interest in time
series. Consider the case _B'Z_n_I = ~(L)Z n where ~(L) = Elm BiL i. Introduce
the m+p-vector Z = (Zn_ I. .Z )' and (Z'Z) = n ~ ~ The -n " n-m-p n E1 ~s-I ~s-l"
following result is available.
THEOREM 13.3. In the ARX model (13.11) suppose all the roots of (13.2) are
<i. Call t = n ~2 E 1Z k + n. Call _e = (~'$)'_ _ , let -n ~ be the ise of _0" n
(i) If E 1 in(tn/tn-i )/~(~'~)n < ~ then -n ~ ÷ ~0 a.s.
(ii) If in tn/O(Z'Zn) + 0 then -n ~ -~p+ ~0"
Proof. See Solo (1982) (the vector case and the case roots !l are also
dealt with).
Remark i. In either (i) or (ii) we must have in t /o(Z'Z) n n
implies in n/O(Z'Z)n ÷ O,
÷ 0 which
n 2 c Remark 2. Suppose E I Z k ~ n c > i. Then in t
n
convergence we need
~ c in n so for a.s.
Z 1 (no(Z'Z))-i < ~. n
For weak convergence o(Z'Z) /in n ÷ ~ will do. If n
n 2 c ZlZk ~ n , c < i.
Then in t ~ in n so we still need (13.12). n
(13.12)
269
Remark 3. Another obvious question here is the issue of identifiability.
It is clear from Lecture 7B that -n ~ consistent for ~0 => ~0 is a-identifiable.
The question is whether the conditions of Theorem 13,3 are minimal in any
sense for a-identifiability. The K-L information function is
-2Hn(0, 60) = E(e~(O)) - E(~)
where en(e) = Yn - ~(L)Y n - $(L)Z n. Thus
-2Hn(e, e O) = (e- ~0)'(W'W)n( 2 - 20)
n Er~k-i n ~k-l] , ' (W'W)n = E 1 ~Zk_ 1 (!k-l~k-l)' > Z 1 Zk_lJ(~k-l~k-i )
where S k = (i - ~o(L))-I$o(L)Zk o Now if ~ # 20, Hn(8 , 80) ÷ -~ iff
d(W'W)n ÷ ~ this occurs if ~ a p +m vector %_ = (T_ : with
~ (L)s k + %$(L)Z k = 0
(or T (L)y k + ~(L)Z k = 0).
Alternatively by Lemma A8.2, ~(W'W) n ÷ ~ if d(Z'Z) n
Notes, Theorem 13.1 is due to Kawashima (1980) with a different, very
tedious proof. This result has also been proved by Lai et al. (1982 b)
(not available at the time of writing). Some central limit theorem results
related to Theorem 13.3 are given by Fuller et al. (1981). Lai et al. (1982 a)
proved a.s. convergence of the ise but the conditions they give are not
easily checked. The ones in Theorem 13.3 are. When some roots are < i, some
are >i, Stigum (1976) proved a.s. convergence: see also Lai et al. (1982 c)
(who seem to have overlooked Kawashima and Stigum). Dickey and Fuller (1979,
1981) discuss distributional theory associated with models like (13o10).
LECTURE 14: LEAST SQUARES ASYMPTOTICS FOR ARMA MODELS
14A. Preliminaries
We turn now to consider parameter estimation in the ARMA model
(l+a(L))y k = (l+c(L))~k, k ~ 1 (14.1)
~k is white noise, E(~) = d 2. Now the presence of the MA terms makes
the estimation and analysis of this model technically complicated. The
basic reason is that the ~ parameters enter non-linearly into the least
squares function. The desirable way to estimate the parameters in (14.1)
is to maximize the likelihood function. However, the analysis of the
asymptotic properties of the maximum likelihood estimator is surprisingly
technically complicated. Consequently, only least squares estimates will
be considered here.
Before continuing it will be useful to list some assumptions. Denote
= (al...a p bl...bq) , Q £ IR8 which is a compact subset of the open set
on which
zP(I+a(Z-I)) = O, Zq(l+c(z-l)) = 0 have all roots <i in modulus. (14.2)
Under Assumption (14.2) we have a Taylor series
(l+c(L))-l(l+a(L)) = ~0 hs (~)Ls' h0(~) = 1
under Assumption (14.2) h (8) will comsist of geometrically damped sines s
and cosines. Thus the following will hold
% < 1 ~ V~ e ~@, lhs(~)] < %s Vs. (14,2a)
Also it follows that
H(ei~l~) = Eohs(e)elS(~ is continuous in 6 e ~6" (14.2b)
Since ]R 9
2 ~k are independent; ~k
(see Appendix BII).
271
compact H(ei~10) is uniformly continuous on ~, is
are strongly uniformly integrable
Yk is second order stationary.
(14.3)
In place of (14.3) typical assumptions are that ~k is a strictly stationary
ergodic martingale difference sequence. Conditions (14.3) have the advan-
tage that they are not placed on the joint distributions of the ~k'S only
on the marginal distributions. Also introduce the spectrum
~(~19) = 2 Ii + c(ei~,)12/11+ a(ei~)12 (14.5)
and note that
¢-i(~I0 ) = a-2 [Z0hs(0)ei~Sl2. (14.6)
The obvious way to generate an error sequence for a sum of squares
function is from
ek(~) = (l+c(L))-l(l+a(L))y k. (14.7)
However, this involves the infinite past. If we start (14.7) up from
some initial conditions then we get
ek(0)- = ~0~'khk-s(0)Ys- + gk(~)_ y_ 1 (14.8)
where gk(O ) are geometrically decaying initial conditions and in view of
(14.2a) can be neglected in the ensuing discussion (actually they can be
subsumed in the sum). Also y_ 1 : (y_l...y_p) are initial conditions (e.g.,
k-I for ARMA(I,I) gk(9) = c a, [-i = Y-1 )"
The sum of squares function is then
1 -i so~n Sn(~) = n e (0).
Since Yk is stationary there is, for each 0 an initial condition y_l(e)
which makes ek(!) stationary. If this initial condition is used rename
ek(0) as $k(9). Then we assume
(14.4)
272
t h e r e i s a t r u e v a l u e -~0 ) ek(-~O ) = ~:k" ( 1 4 . 9 )
The ensuing analysis can however be done without this assumption.
The least squares estimate (Ise) is found by minimizing S (8) over n
IR e . This is also called the prediction error method since ek(@) in (14.7)
is a prediction error. To analyze the behavior of the ise we have to look
at the asymptotics of S (0). From (14.8) n
= = e ¢ ( c / ) d f ; ¢ (w) = ¢ ( c o l e o ) . Rk E(yiYi+k) /_~ iko0
THEOREM 14.1. Under conditions (14.1) to (14.4)
lira E(e2(_@)) = E 0 ~ohs(~)hj(e)Rs_j
= f_~ ¢(uo)/¢(wte)df = s(e)
uniformly in _0 c IR~.
Remark. It follows from the remark after (14.8) that S(@) is uniformly
continuous in e c IR e .
Proof. The proof is a consequence of the following observations, the
dominated convergence theorem and (14.6). We have
.k, (~. iscol2 k~khs(e)hj(8)R s j f_~ IEon s v ) e ¢(0J)df Z 0 _ _ _ =
khs(e)eiS°~ 1 < ~O %s = (l-l) -I V e ~ IRe {Zo
and
E k hs(e)eiS~ + Zohs(8)eiS~° uniformly in e.
We now prove the same result for S (8). n
THEOREM 14.2. Under conditions (14.1) to (14.4)
1 1 sn(e) ÷~ s(_e) = ~ f_~ ¢(o~)/¢(~o[e)df
uniformly in _8 e IR e .
a.s.
(14.10a)
Proof.
now agree that y_j
273
n-i n 2 n-I n Z0ek(_@) = E0ek(_@) Zkhs(_0)Yk_ s
n-i ~ n hs(9 ) n = s= 0 - Ek= sek(-9)yk- s
=0, j >_0
n = n -I Zn=ohs(e ) Zk=Oek(_e)Yk_ s
= n n k -i xn=0hs(9) Zk=0Yk_s EoNg(-6)Yk-%
n n
= n -I zn=0hs(9 ) E~=0N£(@) E£=kYk_sYk_ ~"
Now recall the convention above so
1 in n hs(O)h£(9) n-i ~n Sn(-@) = 2 s=0 Z2=0 - - max(s,£) Yk-sYk-£"
Now call
Rs,~(n) = (n -I Z n max(s,£) Yk-sYk-£ ) I (max (s,£) ! n).
Then for each s, % a.s. Rs,£(n) ÷ Rs_ ~ (see Appendix AI4). Further
iRs,£(n) I < (n-I En 2 -IEn 2 -- max(s,£) Yk-s n max(s,£) Yk-~ )
< n-i n 2 -- Z0Yt = R0(n) say,
Now R0(n) ÷ R 0 a.s. so given g 3 n0(g) ~ V n > n0(E)
Now set
R 0 = max IR0(n) I. n<_n0(O
rt
Then Vn IR0(n) I ! R 0
Thus Vn
Now rewrite
. v
R 0 = max [R 0, IR01 + c].
~v
IRs,~(n) l _< R 0
tRo(n)- RO] < ~.
i ~ ~ -s -£ AsA~ Sn(e) = ~ Z£=0 Es= 0% hs(0)% h£(~)Rs,£(n) •
(14.10b)
274
Then by the dominated convergence theorem and (14.2a) the result follows.
Remark. For future reference it is worth noting that the following results
can be similarly established.
Remark.
where I (W) [ n y = Z 1 Yj
From (14.10b) follows the approximation (cf. (14.10a))
i f~ly(C~)/$( w Sn(-~) = 2- [8)df
eiWJ ! 2 I •
uniformly in
dS de k n + d S (ek(8) ~- d@ d@ = lim E ) a.s.
k-+oo
= -L% @(w) d log @ df
uniformly in O
I (8) = - - n
d2Sn d2S
dSdQ' dSdS'
d2ek I dek dek] EIek d~d, ] lim E ~ - - + lim
k-~o dS'J k-~o
~ d 2 log @
= - f-g O(wl@) dOdO' df + /~ @(w) d log ~ dlog @ df
.d@ dO'
In(8) is the sample information matrix.
Note that since f~ -~ ~(wl@)df = 1 we deduce dS/d8 0 = 0 and
16080 = i(@0) d2S , f~_ $(w) d log $ d log, $ df.
d@od@ 0 d~ d@
(14.ii)
(14.12)
275
14B. Consistency of Least S~uares
Now strong consistency of the ise can be proved.
as the value such that Sn(9 n)_ ~ Sn(8)_ _9 c
The ]se 9 is defined -n
THEOREM ]4.3. If the model (]4,8) is a-identified at ~0 i.e.
S(!) = S(20) :> ! = ~0 then with conditions (14.1) to (14.4), ~n ÷ ~0 a.s.
Remark. The model orders p, q are assumed known: this eases the a-identi-
fiahility issue considerably.
Proof. Now -n ~ belongs to 11%@ so is a bounded sequence. Let 0* be a limit
point of -n ~ " We show S(~*) = S(90) which proves the result.
Consider the following We have S(£*) Z S(90) and Vn Sn(~n) ! Sn(90).
sequence of i nequa l i t i es (n to be chosen)
0 ! S(9") - S(80) ! S(8") - S(@n) + S(gn) - Sn(0n)
+ Sn(6 n) - Sn(00) + Sn(00) - S(Q0)
S(8*) - S(gn) + S(~n) - Sn(~n) + Sn(90) - S(90).
£ Vn > n 1
< [: Vn > n 2
< £ Vn > n 2.
Now take moduli
by uniform continuity i S(0* ) -s(e n) I
by uniformity of limit IS(~ )- S (9) n n n
by uniformity of limit ISn(90)-S(90)
Thus take n > max (nl, n2) giving
0 <_ S(O*) - S(90) _< 2s
but s is arbitrary => S(@*) = S(90): hence result.
14C. A Central Limit Theorem
The basic idea is the classic one for producing central limit theorems.
Expand the equation dSn/dRn = 0 in a Taylor series about 00
276
0 = dSn/d~ n = dSn/d80 + In(8:)(6 n-80 )
where Q* is an intermediate value with II_~:- e_0!l _< []0n--°011 provided ]IR@ is -rl
convex. The idea then is to invert the Taylor series to see
dS #-n (-~n --GO) = [In(8:) ]-i ~nn d8 n
n
However we need In(e:) full rank to do this. If we suppose
l(B) is continuous for e e IR 8 (hence uniformly continuous) (14.13)
then we can show In(8:) + l(Oo) a.s. Then Vn > n o say, In(8*)n is full
rank (since 1(80) is). Then it is enough to provide a CLT for
n -½ dSn/d@ 0.
THEOREM 14.4. If conditions (14.1) - (14.4) hold, and (14.8) is identified
at 80 then if (14.13) holds and IR% is convex
(~n-~O) => N(O, Z-I(80)).
8" Proof. Now -n ~ ÷ ~0 a.s. so -n ÷ ~0 a.s. Further
IlZn(e : ) - t (eo:li < II I n ( 8 : ) - I (8:) i l + II t (8 : ) - I (90)ii
by uniformity of limit II~n(e:)-~(e~)ll < ~ Vn > n 1
by uniform continuity
Thus
In(e: ) ÷ !(80 ) = Z8080
so we have to show n ½ dSn/d~ 0 ~> N(O I~l~ ) . -' -¥oUo
lli(e~)-Z(eo)ll < ~ vn > n 2
a.s.
Now
n½ dSn/dS0 _½ n = n I I ek(eo)dek/d8 0
= n-½~Igkdek/d@ 0 + n -½ Zln (ek(e0) _gk)dek/d@o.
Now the second term-~P~ 0 if
(14.14)
n -~2 E 1Elek(~0 ) - 8kllldek/d_0011 ÷ O.
By the Toeplitz lemma this holds if
This is bounded by
277
E lek(00) - gk1[l dek/de011 ~ ÷ 0.
(kE(ek(00) -~k )2 EIldek/d~oiI2) ½
so it is enough to show
2 kE(ek(00) -£k ) ÷ 0
since the second term ~ tr(l(@o)). We show (14.15) at the end.
is shown that the first term in (14.14) obeys the required CLT.
Introduce the array
-~ , F ~ X = Z Z = n ~ des/dO O, F = • nS ~SCS ~ NS ns s
(14.15)
Now it
In view of the corollary to MGCLT Theorem II.5 we need only show
V 2 n Z 2 p+ d2_,l(@o ) = _ (% n Y~I ns
E(V2n ) d 2 , ÷ ~ _~(eo)
max Z 2 --P+ O. l<s<n ns
The first two follow by arguments already given cf. Theorems (14.1), (14.2).
The last will follow if n -I llden/d@0112_ -P-P+ 0. But this holds since
n-i II den/d_00112 = n-I E1 II des/d_eoIl 2 _ (i - n -1) ( (n- I)-I ii-i II des/d20112)
tr(I(@0)) - tr(I(@0)) = 0
Finally, (14.15) must be established. Recall the discussion between (14.8)
and (14.10) then
ek(@O) - F~ k = ek(@ O) - ~k(80) = gk(@_o)(Y_l(eO)-Y_I ).
Thus
E(ek(00) -gk )2 <_ [l_gk(00) 1] 2 c
278
and
However, iigk(8O)II 2
c : EIIZ_I(G O) -Z_III 2 < ~J.
÷ 0 geometrically so (14.15) holds.
Notes. Consistency and CLT are discussed by Jennrich (1969), Hannah (1973).
Hannah and Nichols (1973), Caines (1976), Ljung (1976), Kohn (1978),
Rissannen and Caines (1979), Dunsmuir and Hannah (1976). Early results are
due to Walker (1965). The proofs given here are the author's, many of the
elements can be found in the above references. Also, Caines and Ljung
(1976) emphasize analysis when the stationary process Yk is not A~MA but
an ARMA model is fitted to it.
Extensions. The extension of these results to ARMAX models involves some
careful considerations concerning the exogenous quantities: see Hannan and
Nicholls (1976). It is also possible to deal with parameter estimation
subject to constraints. The theory can be developed in a manner similar
to that of Silvey (1959), Aitcheson et al. (1958); see Kohn (1979 a).
APPENDIX AI4: CONSISTENCY OF SAMPLE AUTOCOVARI~NCES
There are a number of ways to develop this. We suppose E(g~) = 2 K
and
2 ~k are independent; c k are strongly uniformly integrable.
Now suppose Yk are given by
k Yk = ~0 gk-s~s
this covers the ARMA case if we add a geometrically decaying initial
(AI4.1)
(AI4.2)
condition: it is easily dispensed with in the ensuing discussion hence
~-k its omission. We have to show n-12 yjyj+k ÷ ~ = lim E(YIYI+ k) a.s.
-i n 2 Actually we only need to show n lly j + R 0 a.s. To see this let _~ be
an arbitrary k-vector. We have our result if we show
279
-i n (~(L)y.)2 ~+ , R<~; R = [R i "]i< n E 1 3 . . . . 3 i,j<k"
But define yj = ~(L)yj so we are back to n -I Xl~nyj~2 ÷ R0"
It follows exactly as in the proof of Theorem 14.2 that
-i n 2 n n n-l,.n n E0Yj : E0 E0gsg£ ~max(s,£) tk-s k-£"
Now proceed as for the rest of that proof noting
c o
for each k n -I ~k~jaj_k -~ 0 a.s. if F. k~!:jsj_k/j < ~
which occurs by MGCT if Z~E(g~_k)/j
(A14.1) and Appendix B l l
2 < oo which it is. While in view of
-I n 2 2 n E0g k + 0 a.s.
we find then n-i ~0"nyj2 ÷ v °°~0 g2Js = RO"
LECTURE 15: ASYMPTOTICALLY EFFICIENT ESTIMATION FOR ARMA MODELS
15A. The One-Step Gauss Newton Scheme
While the present discussion gives an interesting asymptotic theory
it must be emphasized that the scheme may not work with finite data.
For AJ~ models with MA roots near the unit circle (or with cancellation
of unit root factors) it seems mandatory to use an exact likelihood
algorithm. This is firstly for numerical reasons and secondly because the
least squares estimates will be biased. The advantage of a least squares
approach though is that it gives a rapid view of the asymptotic theory.
Before it was easy to compute exact likelihood functions there was a
great interest in finding iterative linear methods for producing asymptoti-
cally efficient and consistent schemes. These are still worth studying at
least for the insight they provide about time series models. Further they
are still of relevance in providing starting values for an exact likelihood
calculation.
The usual method of finding starting values for ARMA parameters is as
follows
(i) AR parameters
Solve the Hankel equations H a = r .
(ii) MA parameters
Solve the (quadratic) equations for c(L) given by
2 12 L s c o Ii+ c(L) = ~q -qs
where R are the autocovariances of s
Ws = (l+a(L))y s.
There are basically two iterative methods for this; a linearly
281
convergent one and a quadratically convergent one (see Box and
Jenkins (1976, Appendix A6.2)).
With this in mind, consider now how we might solve the equation
dSn/dgn = 0 = n -I ~n~1 ekdek/d@l ~ t
N
A simple idea now is to use the Gauss Newton (GN) procedure. Suppose
~in is a consistent estimate of -@0 (the ARMA parameters) obtained e.g.,
as described in (i), (ii) above. We use -@in to initiate an iteration
as follows. Consider the Taylor series expansion for a new value ~2n
dS dS d2S n n + ___n ....
d02n dgln dg*de*'n n (-@2n -6in )
and ll@n*-91nl] % ll_@2n-@inl]. Since 9* is unknown it is natural to replace
2 it by ~ln" Fu r the r in the l e a s t squares s e t t i n g d Sn/dOlndeln takes the
form
d2S de k de k d2ek n 1 n 1 ,,n n E1 -- + --n ~lek(91n) - -
d ' deln 91n dgln dgln dglnd01n
Since 9in is consistent the second term will be negligible (compared to the
first) so the second term is dropped. The GN algorithm results on the
setting dSn/de2n = 0
-@2n = -@in - ~i (-@in)dSn/d@in
in(9 ) 1 n dek dek
= n El de dg'
Since 91n is consistent, ~n(Oln) should be of full rank (if n is "large
enough"). It would be interesting to prove that for fixed n this iteration
converges to the least squares estimate ~ . If n is "large enough" this -n
can be done (see Kohn (1979)). Here we settle for the interesting observa-
tion that 22 n is asymptotically efficient. This phenomenon is known in
other areas of statistical inference (see Cox and Hinkley (1974)) so its
occurrence in time series should not be surprising. The proof is now
sketched.
From the Taylor series
where
We find
282
dSn/dSln = dSn/d80 + In(8~)(81 n-80 )
I (8) = d2S /dgd@'. n
~2n = ~In - Z~ 1 (~In) [~n(SP (~ln- ~0 ) dSn/dSO]
:> _ = (@in)dSn/d@ 8
+ In I (~in)([n(@In ) - In(8~))(~in- ~0 )
:> ~ (82n-90) = (@In)/n dSn/d@ 0
+ In I (81n)(~n(@in) - In(Sn )) ~-nn (~in- i0 )"
by the argument used earlier the first term => N(0, I-i(80)). For Now
the second term since !IIn(Oln ) - In(8~)ll ÷ 0 (again by modifying a previous
argument) we will have the result provided say~nn (if n -90 ) converges in
distribution. In practice this is usually straightforward enough to show.
While this shows that one step of a GN iteration provides an asymptot-
ically efficient estimate it is recommended to continue to iterate to
(approximate) convergence,
There is a second way of looking at the GN scheme namely as a regression,
Consider the approximate Taylor series expansion
ek(8) = ek(8 I) + (8-81)' dek/d8 I.
Now viewing this as a regression of ek(91) on -dek/d@ 1 yields the ise
- 81 = I-ln (81)dSn/dSl
which is exactly the GN scheme.
However, in the time series setting there is one further interesting
283
feature. Because the parameters typically occur as coefficients on lagged
quantities there is yet a third way to look at the GN scheme (as a regression).
It will usually be possible to write the model naturally as a sort of
regression
ek(@l) = ek + (dek/d@l)'
where ek is a sequence depending on the model.
Now observe that
@i (15.1)
n dSn/d@ I = n -I E lek(@l)dek/d@ 1
= rn(@ I) + In(@l)@ 1
r(@) = n -I Elnek dek/d@l"
Thus the GN scheme becomes
92 = 91 - I-ln (@l)(rn(@l) + In(@l)91)
= -I-ln (@l)rn(el)
Thus the GN often reduces to the following prescription
regress ek on - dek/dO 1.
This idea is now pursued for two examples.
Example i. MA(q) model. We have
ek(6) = (l+c(L))-lyk ; 9_ = (Cl...Cq)'
=> dek/d@ = -(l+c(L)) -lek; e k = (ek_l...ek_ q)'
Now introduce ek = (l+c(L))-i ek and note that dek/d@ = -~k" Further
- (dek/d9) '@. e k
Now the fact that the sign here is different from that in (15.1) produces
the unusual formula
284
~2n = 2~in + !n I (~In)ln<!l)"
Actually, recalling the earlier comments about MA model fitting (=
factorization) as generalized square rooting makes the factor 2 a little
less mysterious.
Example 2. XAR model, Here
ek(e) = (i +c(L))Vk(e)
Vk(@) = Yk - b(L) Zk
9 = (c'; b')' = (Cl...c p : bl...b )' - m
= )' ~> dek/d~ = ~k (Vk-l'"Vk-p
dek/d~ = - ~k = -(Zk-l'''Zk-m )'
and
~k = (l+c(L))z k.
We can write ek(@) in two ways
ek(9 ) = v k + Zk£
ek(6) = Yk - ~k~
and
Yk = (i + c(L))y k.
So introduce
~bn(81 ) n-i n ~ = ZlYk~ k
N ~cn(@l) = n -I Z I Vk~ k
and observe that asymptotically we expect
n-l~n dek ___dek -1 nZkVk ÷ O. 1 dc db' n Z 1 _ _ _
So we obtain the two regressions
(15.3)
spectral
285
and
= _~-± b-2n -bn (-~i) rbn (-~in)
= _i -I
C2n -cn (21)rcn(-~In)
n~ ~
ibn(21 ) = n-i Z 1 ZkZ k
n
!cn(~l) = n -I Zllk[ k
This has the following interpretation. Using !In filter Yk' Zk to
produce Yk' Zk then do a regression of Yk on ~k to get ~2n" Using ~in
form Vk(01n) and regress it on Vk_l, ...,Vk_ p to obtain !2 n. This agrees
with the intuitive method of proceeding.
15B. Adjoint Variables for Gradient Calculation__ss
One motivation for the GN method is to avoid calculating terms
d2ek/d0d0' in the Hessian. However, in the time series setting there is
a computationally efficient method of generating the gradient dS /d@ and n
the Hessian d2S /d@d@' by avoiding such calculations. Again it is all n
because of the lag structure. The idea is easily seen for an ARMA model.
We have
i -i n 2 Sn = 7 n E le (0)
ek(~) = (l+c(L))-l(l+a(L))y k
Then
2 = (a'; _c')'.
dSn/da_ = n-I zlnekdek/da_
n dS /dc = n -I E lekdek/dc
Ii -
which suggests that to calculate dS /d0 we have to generate the p +q-vector n -
dek/d ~. Let us calculate these, we find
(l+c(L))dek/d ! = !k_l = (Yk_l..-Yk_p)'
286
(l+c(L))dek/d ~ = -!k-I = -(ek-l'''ek-q )'
The fact that the filtering is by 1 + c(L) in both cases enables the following
device to be used. Consider the backwards or adjoint sequence
%i (%n+l = 0 = ... = %n+q)
z = L -I. (l+c(z))~i = -el;
Then calculate as follows
dSn/da_ = n-i zlneidei/da_ (15.4)
= -n-i Zln (I +c(z))~idei/da
-i n + = -n ~1%i(i c(L))de./dal _ (write it out)
-I .n
= -n ~I li-Yi-l" (15.5)
Similarly we find
dS /dc = n -I ~n n - 'i %i~i-i"
These two expressions clearly cut down the computation enormously.
One scalar sequence %. replaces the p +q-vector sequence de./dg. This i i -
is especially useful if only a gradient scheme is being used to maximize
the sum of squares. However if the Hessian or d2Sn/dQd@ ' matrix is
needed then dek/d ~ must be generated anyway but (15.5) is still superior
to (15.4). Also the adjoint variables can be used to avoid calculating
d2ek/dgd8 '. This is left as an exercise.
The above discussion shows the role played by the adjoint sequence
%. in filtering. However there is a natural interpretation as Lagrange I
multipliers as follows.
Consider the problem
I -I ~k e2 i minimize ~ n 1 ei,a,_c
subject to
287
ei = Yi + a(L)Yi - c(L)ei"
We can solve this using Lagrange multipliers by forming
i n-i n 2 n -I E~%i(ei+ - a(L)Yi) Hn = 2 El ei + c(L)ei-Yi
and solving
~H /De. = O, SH /~a = O, SH /~c = 0. n 1 n - n -
We find
$H /~a = -n -I n n - E 1%iZi_ I = 0
SH /~c = n -I ~n n - Z 1%iei_ 1 = 0
SH /De. = e. + (l+c(z))%. = 0 n i l 1
The equivalences are now obvious.
Notes
(A) The GN idea was revealed in times series by Akaike (1973). The
unusual iteration (15.3) is due to Hannah (1970) who derived it
in a different way. Actually there are some differences since
Hannan uses spectral methods to produce the I (~) matrices: see -n
Kohn (1977). Some related work is in Nicholls (1976). Finally,
it should be said that many other ad hoc procedures for producing
asymptotically efficient schemes for particular models have turned
out to be Gauss Newton schemes.
(B) The use of adjoint variables is due to Goodwin (1968) and Kashyap
(1970).
LECTURE 16: HYPOTHESIS TESTS IN TIME SERIES ANALYSIS
16A. Lagrange Multiplier Tests
Some part of time series analysis is involved with tests of hypothesis.
This includes tests for order, order of differencing, the presence of auto-
correlation. It is always possible to construct tests by the likelihood
ratio principle and in small data problems this may be the best idea. Again
however with larger data sets other methods have advantages and insights
to offer.
There are three basic methods for constructing tests of multiparameter
hypotheses, the Wald (W) Test, the Likelihood Ratio (LR) Test and the
Lagrange Multiplier (LM) or score test. The LM test has the great advantage
that it only requires parameter estimation under the null hypothesis.
Recently, it has become clear that many previously suggested ad hoc pro-
cedures are actually LM tests.
Suppose the hypothesis to be tested is posed as a set of restrictions
on the parameter r-vector @, say
H 0 = h(~) = O; h is a p-vector,
If we denote -2 log likelihood by L(@) then to estimate ! under H 0 the LM
approach is to minimize the Lagrangian
L(@) + %' h(@)
where i is a p-vector of Lagrange multipliers. This yields a set of first
order equations
+ H), : 0 (16,1)
h(~) = 0 (16.2)
2 8 9
where D = ~L/$~, H = ~h'/$@. Also 2 is the restricted maximum likelihood
estimate (mle).
The W and LM statistics are based on two Taylor series expansions.
Let 8 be the unrestricted mle then since $L/$~ = 0
e(8) = L(8) + (6 - @) '3(@*) (8 - @)
3(6) = ~2L/~8~6'; I]8"-611 ! lJS-SJj. In the previous lecture we had I = J/n.
Now the LR test is based on t(@) - e(@). If J(6) is replaced by say
J(e) = E(J(@)) and @* by @ we obtain the W test
n(G- ~) 'Jd) (~ - ~)
On the other hand consider the other Taylor series
0 = dL/d8 = dL/d8 + J(@*)(@-@).
This leads to
_ ~ _- _j-l(~) ~LI~ = ]-iB
so that we obtain the score test
Applying (16.1) to this gives the LM test
LM=~'~'7-1~
Asymptotically the LR, W, LM tests can be expected to be equivalent, all
X$. However, in small sample situations there are some interesting being
inequalities between them (Breuseh (1979)).
A common form for the restrictions occurs when 8 is partitioned
8= ( :_
i.e.
and the test is
HO : 21 = 21o
Ii'l H 0 : h(_@) = (Im : O) - 210 = O.
2
(16.3)
(16.4)
290
On partitioning D, J appropriately we find
since D2
Often it occurs that J12
J21 J22
= 0 by (16.1). Then the LM statistic becomes
. . . . . . i- -i DI ~li DI" (16.5) LM = DI(JII-JI2 J22321 ) =
= J21 = 0 so that J is block diagonal, then
LM = D1 Jll] DJ (16.6)
Now there is an interesting way to compute the LM statistic using the
one step GN method mentioned earlier. Suppose we perform one step of
Fisher's scoring algorithm with starting value @ = 8. The step will yield
=> LM = (el - ~ ) ' J(@z- ~)"
This quantity is what would be obtained by doing a Wald test after one step.
We can also view this idea in GN terms when the likelihood is a least
I n ~ squares function: L(@) = ~ Z le (0)/d 2. Now consider, as earlier, an
approximate Taylor series
ek(Q) = gk + (dek/de)'(0 - ~)
where ek = ek(e) are the residuals after fitting the restricted model e.
So if we regress ek on -~ = dek/d~ we find
_ ~ = j-1 ~2 n
and
n
n = Jn (~) = Z1 dek/dSdek/d@' = ~'~
n ~2D = Z I ekdek/d0 = dL/d@ = X'e
291
where _e = (el' ..., ~n ),; ~2 = n-i Zln ek'~2
we find
~-2 Thus on approximating J by J
- - II
LM -- D' ]-i~ $-2 n
= (~- ~)' ~ (6- ~) ~-2 n
R 2
-dek/de.
(1)
(2)
(3)
2 The coefficient of determination is LM ~ Xr.
= ~,(~,~)-i ~/$2
= n R 2
is the coefficient of determination for the regression of $k
So to generate the LM in such a case is straightforward.
Obtain 0 under H 0.
Generate ek(0) = ek' dek/d~'
Regress ek on dek/d8
x n
on
An example is now given.
16B. Testing for Autocorrelation in a Regression
Consider the regression model
Yk = -~ ~ + Vk
b Vk = a(L)Vk + ~k; a(L) = E l a i L i .
So the disturbance is AR(p). We test the hypothesis:
= (a I, ..., ap). The model in prediction error form is
v
ek(8) = (l-a(L))(yk-~k p"
Then
dek/d ~ = Xk = -(Vk_ 1 ..... Vk_p)'
dek/d~ = -~k = -(l-a(L)) _x k.
Ho:a=_O,
We put @ = (_81:_0'2) = (a' : @')'; = 2~o 0
292
Now the restricted estimates are a 0 and ~ - = - = ~OLS (OLS = ordinary
least squares). Further ek = Vk = OLS residuals. Finally,
dek/d~ = -Xk; dek/d~ = ~k"
Thus to obtain the LM we regress e k = v k on (ZkXk). Thus
~-2 n . . . . , = E 1 Vk(Vk_ I ..... Vk_p_X k)
= n(Rl ..... Rp~') = (DI : ~) (as expected)
where R. are the autocorrelations of the OLS residuals. J
If ~k has no lagged Yk'S then J12 = 0 so
~, ~-i DI LM = D 1 Jll
= ~;(~p)-l~p.
This is equivalent to the intuitively reasonable procedure of fitting an
AR(p) to the residuals and testing that against the null hypothesis p = 0.
If, however, the _x k has lagged y's then J12 # 0 and the full regression
of Vk on lagged Vk'S and _Xk'S should be performed. In the simple case of
an AR(1) model the limit of n -I J12 can be evaluated in a straightforward
way. The h-test of Durbin (1970) results.
16C. Choice of Order and AIC
The classical approach to order determination involves testing a nested
sequence of hypotheses. The difficulty involves finding a stopping point.
Some time ago Akaike (1970, 1976) suggested a new idea for chosing model
order based on an information theoretic argument. The idea is to consider
AIC = -2 log (maximized likelihood) P
+ 2 # of free parameters (degrees of freedom).
The idea behind this is easily understood through the AR model.
293
Suppose we use data YI' "''' Yn to fit an AR(p) model (by regression)
^
producing a parameter estimate a . Now the fit will be evaluated on a new -p
data set -±Y~' ..., _v n. The model is
Yn = apYn-i + gn; Vn-i = (Yn-I ..... Yn-p )
The prediction error sequence is
n = Yn - apY_n_l
= g + (a-a)'Y n - -p -n-l"
An average measure of predictive performance on the new data set for this
particular fit is the expected mean squared error (emse)
~2 = E~(e~) = 0~(1 + (a-~p) ' Rp(a- ~p))
where 0 2 is the variance of the innovations from a pth order model; also P
R = [E(y 0 Yj_k)]l< ~ ,kip" -p
From the original fit we know
(a-~p) = N(O, 0 2 R -In-l). _ _ p-p
Notice that the distribution of emse is thus approximately
~2 e ~ o2(1+x2/n). P P
(16.7)
Thus, if we compute the average emse based on the original sample (this
giving an average measure of predictive performance of our procedure of
having fit a pth order autoregression) we find
Ey(e 2) = EyEy(e~) = ~(l+tr Ip/n)
= ~2(l+p/n).
P (16.8)
Calculations like these led Akaike to propose choosing AR model order by
minimizing an expression such as (16.8). In view of (16.7) it is not
surprising that this method does not produce a consistent estimate of p.
294
Notes. A number of authors have stressed LM tests recently. The presenta-
tion here draws from Breusch and Pagan (1980). See also Godfrey (1979),
Hosking (1980) and Silvey (1959). Some calculations related to the AIC
are given by Broomfield (1972). An alternative to AIC is Parzen's (1974)
CAT. S~derstr~m (1977) gives an interesting discussion of the relation
between AIC and hypothesis tests. Shibata (1976) studies the asymptotic
behavior of AIC. See also Hannan and Quinn (1979).
LECTURE 17: IDENTIFIABILITY OF CLOSED LOOP SYSTEMS I
17A. Introduction
A common situation in time series modelling is when the data are
collected on a system in which the output variables are used to produce
(feedback) the inputs or exogenous variables. In control engineering the
overall system is called a closed loop system (since there are dynamic
relations from z to y and back to z). These schemes can be usefully
represented by block diagrams which are basically pictures of z transforms.
A basic scheme is shown in Figure i.
Figure i.
reference t< aignal~
A Closed Leo E System
.• input z 1 system P
i or plant i
vfl forward loop
~controller C
feedback I
loop noise Iv b
The closed loop system is described then by two dynamic equations
Forward loop Yk = PZk + Vfk
Feedback loop z k = Cy k + Vbk
where P = P(L), C = C(L) are rational causal transfer functions of L.
It is convenient to write this in matrix form
(17.1)
(17.2)
296
i -P 1
-C i
=>
z k
Yk } =
z k
1 Vbk
[ 1 P Vfk S
C 1 Vbk
The quantity S = (I-PC) -I is called the return difference. If the loop
is broken anywhere then the transfer function from the rhs of the break
to the lhs is S. The basic problem for model fitting is the issue of
identifiability.
(17.3)
(17.4)
17B. Basic Issues in Closed Loop Identifiability
To reveal the ideas consider the simple ARX model
Yk = a(L)Yk + b(L)Zk + Ck
gk is a white noise, a(L) = E i ~ aiLi etc. Suppose we propose to estimate
e = (a', b') = (a I ..... ap bl, .... b )' by least squares. We know from -- _ -- p
Lecture 13C (and it is anyway intuitively clear) that @ is a-ldentifiable
provided a linear dependency of the form
~(L)y k + B(L)z k = 0
(deg ~(L) ! p, deg $(L) ! p) is not possible.
There are three ways to avoid this
(i) If we allow a linear controller
-I z k = -(I+F(L))
(ii)
R(L)Y k
we must have it of sufficient complexity i.e. deg F ~ p.
(17.5)
Alternatively, we can introduce a dither signal in the feedback
loop
297
z k = -(I+F(L)) -IQ(L)y k + ~bk"
Then provided {gbk } is not linearly dependent on {Vfk} (17.5)
cannot occur.
(iii) Allow a time varying or nonlinear controller.
In the ensuing discussion case (ii) will be investigated. To keep
matters simple only scalar y, z will be treated: this already reveals
most of the issues involved, One last definition will be useful. A
rational transfer function H(L) = B(L)/A(L) (where B, A are polynomials)
is called causal (or proper) if ]H(0) I < ~: this simply means H(L)z k
requires only Zk, Zk_l, ... for its calculation. H(L) is called strictly
causal (or strictly proper) if H(0) = 0: this means there is a delay in
H(L) i.e. H(L)z k requires only Zk_l, Zk_2, ... for its calculation. All
transfer functions in the ensuing discussion are assumed to be causal.
17C. a-Identifiability of the Forward Loop
To simplify the discussion we suppose infinite data is available. In
Equations (17.1, 17.2) take vf, v b to be stationary. Then v b can be
interpreted as a dither signal or a noise in the feedback loop or a sum
of the two.
Intuitively in identifying the forward loop we can allow vf, v b to
be correlated. Let Vfk have an innovations representation or Wold decom-
position as (subscript zero is a true value)
Vfk = N0(L)~fk;
while Vbk is given by
and E(~fk gbj) = 0 Vk, j.
E(gfkSfj) = o2fSkj
Vbk = R0(L)Vfk + M0(L)abk
Clearly, it is necessary that
298
Ro(L) is causal.
The closed loop structure becomes
(17.6)
lyk) s01 1 P01 z k -C O I
No oll fkl RoNo MO gbk
(i7.7)
To proceed further, conditions will be imposed that ensure (Yk' Zk) is
stationary. It will be convenient to intoduce the forward loop and feedback
loop ARMAX descriptions (or irreducible matrix fraction descriptions)
(Po :N0) = Af~(L)(Bf0(L ) :Cfo(L))
(C O :M 0 : RoNo) = A;~(L)(Bbo(L ) : Cb0(L) : Db0(L)) where
deg Af0 = deg Bfo = deg Cfo = pf
deg ~0 = deg Bb0 = deg Cb0 = deg Db0 = Pb
and pf, Pb are assumed known. Then a necessary and sufficient condition
for (Yk' Zk) to be stationary is
(17.8a)
(17.8b)
[ z pf 0 det
~ 0 z pb
Afo(z_l) -Bf0(z-i )
-Bbo(Zb I) ,0(z-l) ]} =
has no roots in izl ~ i
This is obvious once we write the full model in A~MAX form as
(17.9a)
i i -Bbo ~0
Yk =
z k
I Cfo
Dbo
0 ( gfk
Cbo gbk
Alternatively, given Equations (17.8) observe from (17.7) that
1 AfoAbo
SO 1 - PoCo Afo~o - BfoBbo
and the numerator of S O cancels all the other denominators in (17.7). This
299
further makes it clear tbat the other way to express the iff condition for
(Yk' Zk) to be stationary is that
S O , SoM 0, SoPoM 0, SoNo(I-PoRo ), SoN0(R O-C O ) are stable.
Now let 2 denote the ARMAX parameters in a forward loop model P, N.
Introduce the prediction error sequence
ek(@) = N-l(yk-PZk)
= (N -1 : N-ip)
z
Clearly to produce a stationary error sequence it is necessary that
N -I, N-Ip are stable
or equivalently
N is minimum phase, N-Ip is stable
afortiori these entail
minimum phase, NoIp 0 N O
or alternatively, in view of (17.7)
if
stable
Pfcf0(z-i ) z = 0 has all roots < i.
The least squares function is L(e) = E(e~(@)) and @0
L(e ) = L (e o) => e = e o.
Now since ek(@0) = gfk we have to ensure
L(8 ) > 2f = E(2fk)
For this it is necessary that
re.
either
or
is a-identifiable
Po(hence P) has a delay i.e. is strictly causal
R 0 and C O each have a delay.
These statements entail
(17.9b)
(17.10a)
(17.10b)
(17.11a)
(17.11b)
(17.12)
(17.13a)
(17.13b)
300
Sn(O) = 1 i.e. there is a delay somewhere in the loop.
To see all this calculate
(Yk) ek(@) = N-I(1 l-P)[ Zk
= N_I( I :-P) SO ( 1 C o i RoN0 M0 gbk j
irN0 P000 0M0]i k I N-I(1 :-P) S O CON0+ R0N 0 M 0 J Ebk
_ : [~fk = N - I S O [(1-PC0)N 0 + (P0 P)RoN0 (P0- P)M0]
(~bk
: [ Cfk] = N-I(N0 + S0(P 0-P)(C 0+R0)N 0 S0(P 0-P)M0)
ebkJ
= Cfk + Tfe Efk + Tbe~bk
Tf0 = N -IN O - I + N -IS0(P 0-P)(C 0+R0)N 0
Tbe = N -IS0(P_P)M 0.
(17.13c)
= Note Tf0 0 0 = Tb@ 0. Thus, clearly (17.13) ensures Tfe is strictly causal:
then (17.12) follows.
The following result can now be proved. THEOREM 17,1. For the closed loop model (17.1), (17.2) suppose
(Yk' Zk) is stationary i.e. (17,9a) or (17,9b) (17.9)
N O is minimum phase
f No1P 0 i s s t ab le
R 0 is causal (17.6)
There is a delay in P0 or in C o and R 0 17.13)
Then P0' NO are a-identifiable.
17.11a)
301
-i Remark. Note that P0 need not be stable i.e. Afo need not be stable.
Proof. Consider that
ek(@ ) - ek(80) = ek(0 ) - Efk
We have to show E<e~<@)) = E<e~(@0) ) = ~ => @ = 80 . In view of the
discussion above the equality is equivalent to
)2 E(ek(8) -ek(@ O) = O.
Then N -I = NOI; N-Ip = NoIP 0 follows from a lemma.
LEMMA 17.1. If F(L), Fo(L ) are stable causal, rational transfer functions
and x k is a stationary process, then
E[(F(L) -Fo(L))Xk ]2 = 0 => F(L) = Fo(L).
Proof. Let F(L) = H-I(L)G(L), Fo(L ) = HoI(L)Go(L). Introduce S k
0 = F0(L)Xk. We can always take G(O) = i. Consider that S k
N(L)(S k-S~) = -(H(L) -H0(L))S ~ + (G(L) -G0(e))x k.
Now
E(Sk-SO)2 = 0 => E[H(L)(Sk-Sk0)] 2 = 0.
But calling 0 = (hi, ..., hp gl' "''' gp)'; p = deg (G(L)) we see
0 2 E[H(L)(Sk-Sk) ] = (_~-GO) J(_G-_@ O)
J = E(WkWk) ; w e = (Sk_ l ..... Sk_ p Xk_ I ..... Xk_ p)
and J is positive definite hence O = 00.
(17.14)
= F(L)Xk,
Notes. The present arguments extend straightforwardly to the multivariate
case. Theorem 17.1 is due to SSderstrom et al. (1976). Their method of
proof is a little different; perhaps the present one more clearly reveals
302
the origin of the conditions. ~iderson and Gevers (1981) have also proved
similar results by rather longer arguments. (It is possible to obtain their
theorem (5.2) also by modifying the argument given here: see Lecture 18.)
A discussion of case (i) is given in Ng et al. (1977). A general discussion
of the related topic of stochastic control is in Astr~m's (1970) book: see
also Kailath (1980).
LECTURE 18: IDENTIFIABILITY OF CLOSED LOOP SYSTEMS II
18A.
as the forward loop (17.1).
system is now
Clearly, it is necessary that R 0 = 0.
Identifiability of the Closed Loo~
Now consider the identifiability of the feedback loop (17.2) as well
The
Yk = P0Zk + NoCfk (18.1)
Zk = CoY k + M0gbk (18.2)
where (gfkgbk) is a white noise. The stationarity of (Yk' Zk) follows
iff (cf. 17.9b)
SO, 80Mo, SoPoM0, SON0, SoNoC0 are stable. (18.3)
By exactly the same argument as used in Theorem 17.1 the following holds.
THEOREM 18.1. In the system (18.1), (18.2) if (18.3) holds or equivalently
(Yk' Zk) is stationary
N O , M 0 are both minimum phase (18.4a)
NoIPo , MoICo are both stable (18.4b)
there is a delay in P0 or C O i.e. S0(0) = 1 (18.5)
E(~fkCbk) = O. (18.6)
Then, P0' NO' CO' MO are a-identifiable.
Remark. Actually, it is possible to remove conditions (18.4) as follows.
Recall from the ARMAX descriptions (17.8) that
(N~I :N~IP0) = Cfo(Af 0-I :Bf0) (18.7a)
304
("o 1:"o1~o ) ° ~ ( % o : ~o) (18.~b)
Now if we carry out identification with a minimum phase N, M and
stable N-Ip, M-Ic we will obtain e.g.,
( ~ i ~glP0 ) ~-1 : = Cf0(Afo :Bf0)
where Cf0~-I is stable. Thus we can recover PO = Af0Bfo-i (which may be unstable)
but only obtain Cf0 = CfoVf0 where Vf0 is an all pass transfer function that
removes the unstable parts of Cf0. The following result is then not un-
expected
THEOREM 18.2. In the system (18.1), (18.2) if
(Yk' Zk) is stationary
there is a delay somewhere in the loop i.e. S0(O) = 1 (18.5)
E(gfk Sbk ) = O. (18.6)
Then PO' CO are a-identifiable and NO, M 0 are a-identifiable up to multi-
plication by an all pass transfer function.
Proof. See Anderson and Gevers (1981). (They use the word paraunitary
rather than all pass.) Also their approach is different being based on
spectral decomposition. Recall the overall system is
lYk) = S0 I NO -PoM0
0o 0If I °I I ' Zk ~bk ~bk
Now they treat this as a bivariate model. Just how P, C, N, M are
recovered from W is discussed in Appendix AI8.
Remark. There is a problem with condition (18.6). In a phyiscal setting
it may be straightforward to determine whether (18.6) holds. But e.g.,
in econometric modelling it may not be clear. This issue may be resolved
by the following result. This theorem will be best understood after
305
studying the discussion of feedback free processes in Lecture 19.
A model P, N, C, M is called generic if certain special pole zero
cancellations among them are prohibited. These are set out in Appendix BI8.
THEOREM 18.3. Suppose (Yk' Zk) is stationary,
there is a delay in the loop i.e. S0(0) = 1
N0(O), M0(O) are both non zero
P0' NO' CoM0 is generic
P, N, C, M (obtained from the NMSF as in Appendix AI8) is generic
Then
from the NMSF is block diagonal.
El ~fk](~ Cbk)' is block diagonal Q= fk
and PO' NO' CO' MO are a- ident if iable and given by P, VNN, C, VNI~ respectively
where VN, V M are all pass transfer functions.
Proof. See Gevers and Anderson (1981).
Remark. The conditions of the theorem can also be expressed in terms of
conditions on W 0 e.g., the genericity requirement includes
1 deg W 0 = ~ deg ~(z).
1 Note that (Lecture 3) for the NI~F deg W = ~ deg ~(z): again see Gevers
and Anderson (1981).
Notes. The discussion surrounding Theorems 18.1, 18.2 is new. Though
most of the arguments are different this lecture draws heavily on the
series of papers by Anderson and Gevers (1981) and Gevers and Anderson
(1980, 1981). Also note that Theorem 18.2 is closely related to some
theorems of Ljung and Caines (see Caines and Chan (1976, Theorems 3.3, 3.4)).
306
APPENDIX AIS. Closed Loop Models from Spectral Factors
quoted in Lecture 2.
comparing
Starting from the autocovariance generating function recall the NMSF
From the NMSF W we can recover a set P, N, C, M by
W21 W22
=> SM = W22 , SPM = WI2
:~> ~ - --i = w12w22 (AI8. i)
SN = WII, SCN = W21
~> ~ - --i
= w21Wll •
Consider two expressions for det
. . . . ~-l- det W = SNM = WII(W22-W21 IIWI2 )
(W22 - --i- => = - W21WIIWI2)
det W = SNM = W22(WII-WI2W221W21 )
--> N = WII - WI2W22~]21.
(A18.2)
(A18.3)
(A18.4)
Remark. Once P, N, C, M have been computed we can deduce the closed loop
is stable i.e.
S, SM, S>M, SN, SNC are stable
by the stability and minimum phase property of W.
APPENDIX BI8. Nongeneric Pole Zero Cancellations
Introduce the notation N(P) = the poles of P; z(P) = the zeroes of P.
307
The special pole-zero cancellations are avoided by
p(N o) n {z(N o) UP(~'O)} =
p(M 0) I] {z(M 0) Up(C0)} = (~
where e.g., M; is the complex conjugate of M O.
LECTURE 19: LINEARLY FEEDBACK FREE PROCESSES
19A. Introduction
In Lecture 17 it was pointed out that much time series modelling must
be done on systems operating in closed loop. A common question in the
econometric situation is the presence or absence of the forward loop or
feedback loop. Intuitively, for example, if the feedback loop is absent
we feel z is exogenous to y and vice versa. It would be nice then also to
say that z causes y. However, here we enter the classic problem of the
relation between two random variables in observational studies. If it is
known that y (lung cancer) cannot cause z (smoking) then an observed corre-
lation between y, z cannot be interpreted causally (perhaps another variable
W causes them both). The usual procedure for yet attempting to draw causal
inferences involves adjustment for other variables, existence of the relation
in different settings and theoretical explanation. Rather than entering into
these issues we concentrate on establishing the time series equivalent of
"lung cancer does not cause smoking" i.e. exogeneity. The process (y, z)
will be called linearly feedback free (or we say z is linearly exogenous to
y; or y does not cause z) if the feedback loop is missing.
19B. Linear Exogeneity
Here some general definitions and properties of linear exogeneity are
b described. The symbol Ya will denote the collection of observations
Ya' Ya+l' "''' Yb: similarly for z b. For simplicity we suppose an infinite a
data set is available. Note that the following definition and equivalences
do not rely on stationarity.
(i) Weak linear exogeneity
The following three definitions are shown to be equivalent.
309
DEFINITION a. z t is weakly linearly exogenous (wle) to Yt if
E(Ytlzt )_ = E(YtlZ~, Zt+l). (a)
(The wide sense conditional expectation or projection E is discussed in
Appendix A3.) This says that the one-sided (causal) and two sided (smoothing)
filters for the estimation of y from z are identical. To put it another
t way, the (residuals from the) regression of Yt on z_~ is (are) the same as
t (the residuals from) the regression of Yt on z_~, zt+ I.
DEFINITION b. zt is wle to Yt if
~(zi(y t ~(ytl t - z_~)) = 0 ~i > t. (b)
(This automatically holds for i J t by the definition of E.) This says
that future z's carry no (linear) information on present y's once the
effect of past z's has been extracted.
DEFINITION c. z t is wle to Yt if
) = t t E(zilz_~ , y_<) i > t. (c)
The interpretation here is that past y's carry no information on future
z's not present in past z's. Alternatively, z t is wle to Yt if the
(residuals from the) regression of z on its own past (are) is the same
as (the residuals from) the regression of z t on the past of z and of the
past of y.
Proofs of Equivalence. First observe that
~(ytl t ~ ~(ytlz t ~(yt I~~ z_oo, zt+ I) = _~) + zt+ I)
where
Thus
zt+ 1 = zt+ 1 - E(Zt+llzt_co )"
(a) <=~> E yt(zi- E(zilz~ )) = 0 i > t
<=~> E(y t- E(Ytizto) zi = 0 i > t
310
which is (b).
Next, since
where
Then (c) =~
which is (b).
then (b) =>
which is (c).
~(zilzt , t ) z t ~t _ Y_~ = E(zil _~) + E(zilY_ )
~t t ~, t ,z t ) Y_~ = Y_~ - ~Y_~L _~ "
E zi(y t- E(YtlZ~ )) = 0
On the other hand, introducp
wt = Yt - E(Yt Izt~)-
i > t
E(ziwt) = 0 Vi, t
,-> ~(zilwt ~) = 0
~ t ~(Zi ztoo, W t => E(zilz_~) = I -~)
The last equality follows since for any x
t-l) E(xlzL ' Ytoo ) = E(xl zt-co' Yt' Y_oo
= E(xl t t-l) z_oo~ w t' Y-oo
(see Appendix A3)
t-i = E(x]zt~ I' Y_oo ' z t'wt)"
Now iterating the argument gives the above equality.
(ii) Strong line@r exogenei ~
The following three definitions are equivalent.
DEFINITION a'. z is strongly linearly exogenous (sle) to Yt if t
Iz_~ ) = E(Ytl _~ , zt). E(Yt t-i z t-I ~ (a')
The difference between a' and a clearly is that here instantaneous corre-
lation between y and z is not allowed.
311
DEFINITION b'. zt is sle to Yt if
~ ~ t-i E(zi(Yt-E(YtlZ_o ° )) = 0
(again this automatically holds for i < t).
correlation that is excluded.
DEFINITION c'. zt is sle to Yt if
~(zilzt21) ~(zi I t-1 t _ = z_~ , y~)
Proofs of equivalence are similar to before.
Vi >_t
Again it is instantaneous
i >_t.
(b')
(c')
19C. Linear Feedback Free Processes
Consider two processes y, z and suppose the joint dynamic relation
between them is described by the closed loop system
Yk = P0Zk + Nogfk
z k = CoY k + Mogbk
El ~fk) ~bj) Q . gbk (~fj = -~kj
Suppose that (Yk' Zk) is stationary. As pointed out in Lectures 17, 18
this is guaranteed if
1 - PoCo = S O , SoM O, SoPoM 0, S0N O, SoNoC 0 are stable.
Actually, if (Yk' Zk) is stationary we can always construct such a
closed loop model from any spectral factor of the spectrum of (Yk' Zk)
(cf. Appendix AI8). However, the identifiability question is exactly the
fact that this may not give PO' NO' CO' MO of the true system (if there
is one).
Identifiability will require there be a delay in the loop i.e.
(19.1)
(19.2)
(19.3)
(18.3)
s0(0) = l .
312
Then we can solve (19.1), (19.2) to find
f Yk] 1 PO NO Cfk] = S O •
[ z k C O I MO ~bk J
The identifiability question leads to the following defintions.
We say the structure (or system) P0' NO' CO' M0 is
(i) weakly linearly feedback free (wlff) if
C O = 0
(ii) strongly linearly feedback free (slff) if
C O = 0 and Q is diagonal.
If (i) fails we say the structure PO' NO' CO' MO is weakly linearly
feedback connected.
We also formulate the following definitions. The ordered pair of
processes (y,z) is
(i) wlff if
(ii) slff if
~=0
= 0, Q is diagonal
where P, N, C, M, Q are obtained from the NMSF as in Appendix AI8 and
Lecture 2.
The connections between the structural definitions and the process
definitions are discussed in Lecture 20. For the moment we relate the
process definitions to weak and strong linear exogeneity of the previous
section.
THEOREM 19.1. If (Yk' Zk) is stationary then the process (y, z) is wlff
iff z k is wle to Yk"
THEOREM 19.2. If (Yk' Zk) is stationary then the process (y, z) is slff
(19.4)
(19.5a)
(19.6a)
(19.5b)
(19.6b)
313
iff z k is sle to Yk"
Proof of Theorem 19.1. We show the equivalence of (b) and (19.5b).
suppose (b) holds. Now by iterating the projection calculation
First
~ - t-2 E(ztlz_tool ) = E(ztlz_o ° , zt_ I)
z t-2
= E(ztlCt_ I,, z_oo)
= + (z Iz- Z 2)
where c~ : z t E<z tlzt~ I) - - : we can produce an innovations representation
(IR) or Wold decomposition for z as t
co z = Z0 D c zt i t - i
z Z ~
Di = E(zt a t - i )Z- lz 2z = E(S~S t ); D O = I.
Stationarity ensures D. does not depend on t. Now define i
t Yt ~(ytl t ~(ytl zt w = - = - s °° ) . z_oo) Y t
Then iff (b) holds w t, z i are orthogonal. Let
= w co w
wt st + ZIAiE t-i
= st-i) Ewl; ~-w = Ct ); A0 I Ai E(wt w E(g t w' =
be an IR of w . Then we have the joint IR representation t
Yt = A(L) B(L) c t
z t 0 D(L) c
(19.5c)
i z 7j-i A(L) = Z 0AiL etc.; B i = E(y tst_i) z "
Now introduce B(L) = B(L) - A(L)B 0 and rewrite this as
z t 0 D(L) Ebt J Sbt
with
314
[nil I 0]i I Cbt j 0 I E~
So gft' gbt are instantaneously correlated; B(0) = O. The result is
thus established since W(O) = I.
Now suppose (19.5b) holds i.e. we have an IR of the form (19.5c)
but we do not know yet that c~, ~$ are innovation sequences so call them
~w ~z gt' gt" Then we have
z t = D(L)~.
z ~z But this immediately ~*> g = ~ .
t t
Further we then deduce
~ zt-l) Yt - A(L)g~ = Yt - E(Ytl ~_oo
= w say. t
t Now also w t B(L)~ . Hence
]
--> E zt(y j
Vt,j '~> E(z t w.) = 0 Vt,j ]
- E(yjlzj-l)) = 0 Vt, j
which is (b).
Proof of Theorem 19.2. Similar.
Notes. Definition (a) is the one used by Pierce and Haugh (1977) and
introduced by Sims (1972). Definition (c) is due to Granger (1969).
Some of the discussion here is adapted from Caines and Chan (1976) and
Caines (1976). Some related work is Geweke (1978). Caines and Chan
define feedback free processes in terms of a spectral factorization of
(y, z) so a closed loop model is not needed. The present discussion
raises the issue of identifiability discussed in the next lecture.
LECTURE 20: TESTS FOR ABSENCE OF LINEAR FEEDBACK
20A. Test for Linear Feedback
Now it is time to consider how the various equivalences expressed
in Theorems 19.1, 19.2 can be used to generate inferential procedures
(tests of hypothesis at least) for determining the presence or absence
of linear feedback. Here we discuss tests of feedback for processes.
To connect this to feedback testing of a structure (or closed loop system)
we have to discuss identifiability. This is done in the next section.
The equivalences in Theorems 19.1, 19.2 suggest several ways to test
for linear feedback of processes.
(a) Weak linear feedback
(i) Fit a closed loop (ARMA) model and test whether the transfer
function C = 0.
(ii) Fit a univariate time series model to each of y, z and then
replaced them with their (estimated) innovations sequences,
(lii)
say Syt' Szt" Now compare the two sided regression of gy t
t ~ with the one sided regression of ~yt on gz,_~, gz,t+l on
t g (cf. Definition (a)).
Replace z t by its innovations sequence ~zt" Regress Yt on
t z_~ to p~oduce residual gyzt" Now test czt, gyzt for inde-
pendence (cf. Definition (b)).
(b) Strong linear feedback
(i)' Fit a closed loop n odel and test whether C = 0, Q is (block)
diagonal.
t-i (ii)' As for (ii) but co~are the regression of ~ on ~ with
yt z,-~
t-i E~ (cf. Definition (a')). that of gyt on ~z,-m' z , t t-i
(Jii)' As for (iii) but regress Yt on z_~ to produce Eyzt. Now
316
test s and g for independence (cf. Definition (b')). zt yzt
20B. Identifiabilit X and Weak Linear Feedback
It has been mentioned that there is a problem with condition (18.6):
we may not know whether it is true. The nature of the problem is revealed
in the following example.
EXAMPLE 20.1. Consider the simple system (structure)
i + 2L Yk i + .4L Sfk
i + 4L Zk i + .6L Sbk"
So P0 = CO = 0. Suppose E(gfk Cbj) = 6kjl. Thus the structure is wlff.
Suppose we use method (ii) to test for wle of z t with respect to Yt
(i.e. whether the process (y, z) is wlff). We will only be able to fit a
minimum phase model namely
1 1 i+ ~L ~ i+ 7L ~
Yk 1 + .4L Sfk' Zk I + .6L ~bk
where then
~ 1 + 2L ~ 1 + 4L 1 gfk' ebk = 1 Sfk' gfk i + ~L 1 + %L
Now ~f, Sb are white noises but they are cross serially correlated because
1 + 2L 1 + 4L the all pass filters - - , 1 smear the instantaneous correlation
1 + ½L 1 + ~ L
of ef, Sb over the whole lag axis. Thus we will deduce the process (y, z)
is weakly linearly feedback connected.
On the other hand, if E(efkebj) = 0 Vk, j then also E(~fkSbj) = 0
Vk, j so if the structure is slff then when we use method (ii) (or any other)
to test the process (y, z) we will deduce that the process (y, z) ia slff.
This reveals an identifiabiiity problem with detecting wle of the structure:
we cannot do it unless we know NO, M 0 are minimum phase. If they are not
then we can still detect sle of the structure.
To put it carefully, if the structure is slff then so is the process.
317
But if the process is slff can we be sure the system (or structure) we
modelled exhibits sle? The answer is yes provided certain special pole
zero cancellations are prohibited. This is brought out in the following
example.
EXAMPLE 20.2. Consider the system
1 1 + .6L PO 1 + .5L ' NO = 1 + .8L
1 + .5L C O = 0 ' MO - 1 + .75L
E(Cfk Sbj) = 0 Vk, j.
So the system shows sle or is sill. We can construct the overall structure
in a number of ways. It is convenient here to construct the irreducible
M~D's or AR~X forms
(P0 :No) = A~I(Bf :Cf)
= [(I +.5L)(I+ .8L)]-I[(I+.8L)L : (I+.6L)(I+.5L)]
(C O :M 0) = <I(B b :C b)
= (I+.75L)-I[0 :i + .5L].
Then
W 0 = ~-B b Abl k 0 CbJ
= [ (I+.5L)(I+.8L)0 -(I+,8L)LI + .75L ]-i [ (I+'6L)(I+'5L)0 1 +0 ]..5L
This factorization is not coprime. A coprime factorization is
WO = [ 5(I+.8L)L ] [ 5(I+.6L) 6L
0 1 + .75L 0 1 + .5L
318
Note that W(0) = I. det W has no poles or zeroes in [z I ~ 1 so it is stable,
minimum phase so W 0 = W the NMSF. Also, Q : I, W21 = 0 => the process
(y, z) is slff (this follows any way since the structure is slff).
Now consider the following spectral factor
[ Ill = 5(I+.8L) L 5 + 13.913L 4.034L
0 1 + .75L j -I.085L 1 + .882L
.197 .813
This defines a structure P, N, C, M. Note that W(0) = I (so P(0) = 0)
but W21 # 0 and Q is not (block) diagonal. Thus the process (y, z) is slff
but the system or structure from which it could have come is not. The
problem is then the special pole-zero cancellation.
The situation can be summed up in the following result.
THEOREM 20.2. If the structure P0' NO' CO' M0 is generic (see Lecture 18)
Q0 is block diagonal, the process (y, z) is stationary,
P0(O) = 0 = Co(0)
N0(0), M0(0) are both non zero.
Then if the process (y, z) is slff so is the structure P0' NO' CO' M0
i.e.
= 0, Q (block) diagonal
'=> C 0 = 0, Q0 (block) diagonal.
Proof. See Gevers and Anderson (1981).
Summary. Suppose then we have a system described by a closed loop model.
If we know the noise transfer functions are minimum phase we can test wle
and sle. If this is unknown then provided the system is generic we can
test for sle. If the system is actually wle (but NO, M 0 are not minimum
319
phase) we can never know that; if the process is wle we cannot be sure the
system is (unless N O , M 0 are minimum phase). These points first raised
by Gevers and Anderson (1981) seem of some importance for testing linear
exogeneity.
Notes. A discussion of tests of linear exogeneity is given in Hsiao (1979)
who uses method (i) by fitting autoregressions. Pierce and Haugh (1977)
use method (ii). Also Caines and Chan (1976) use method (i) but fit APd~A
models. The material in this lecture draws heavily on Gevers and Anderson
(1981).
REFERENCES
Aasnes, H.B. and Kailath, T.: An Innovations Approach to Least Squares Estimation - Part VII: Some Applications of Vector ARMA Models. IEEE Trans. Autom. Contr. AC-18, p.601-607, 1973
Aasnes, H.B. and Kailath, T.: Initial-Condition Robustness of Linear Least Squares Filtering Algoritlnns~ IEEE Trans. Autom. Contr. , p.393-397~ 1974
Abraham, B. and Box, G.E.P.: Deterministric and Forecast-adaptive Time-dependent Models. J.R.S.S. (C) 27, p.120, 1978
Aitchison, J. and Silvey, S.D.: Maximum Likelihood Estimation of Parameters Subject to Restraints. Ann. Math. Stat. 29, p.813- 828, 1958
Akaike, H.: Autoregressive Model Fitting for Control. Ann. Inst. Stat. Math. 23, p.163, 1971
Akaike, H.: Maximum Likelihood Identification of Gaussian ARMA Models. Biometrika 60, p.225, 1973
Akaike, H.: Stochastic Theory of Minimal Realization. IEEE Trans. Autom. Contr. AC-19, p.667, 1974a
Akaike, H.: Markovian Representation of Stochastic Processes and its Application to the Analysis of ARMA processes. Ann. Inst. Stat. Math., p.363, 1974b
Akaike, H.: F~rkovian Representation of Stochastic Processes by Canonical Variables. SIAM J. Con~. 13, p.162, 1975.
Akaike, H.: Canonical Correlation Analysis of Time Series and the Use of an Information Criterion. In System Identification, Advances and Case Studies, Mehra and Lainiotis, eds., Academic Press, 1976
Anderson, B.D.O. Covariance Factorization via Newton-Raphson Iteration. IEEE Trans. Inform. Theory IT-24, p.187, 1978
Anderson, B.D.O., Hitz, K.L., and Diem, N.D.: Recursive Algorithm for Spectral Factorization. IEEE Trans. C,A.S., 21~ p.742~ 1974
Anderson, B.D.O, and Moore, J.B°: Optimal Filtering. Prentice Hall 1979
Anderson, B.D.O. and Gevers, M.R.: Identifiability of Linear Stochastic Systems Operating under Feedback. Automatica 1981
Anderson, T.W.: The Statistical Analysis of Time Series. John
Wiley, 1971
Anderson, T.W. and Taylor, J.B,: Strong Consistency of Least Squares Estimates in Normal Linear Regression. Ann. Stat. 4,
p.788, 1976
321
tt
Astrom, K.J.: Introduction to Stochastic Control Theory. Academic Press, 1970
Billingsley, P.: Probability and Measure. John Wiley, 1979
Bloomfield, P,: On the Error of Prediction of a Time Series. Biometrika 59, p.501-507, 1972
Breusch, T.S.: Conflict among Criteria for Testing Hypotheses: Extension and Comment. Econometrica 47, 203-208, 1979.
Breusch, T.S. and Pagan, A.R.: The Lagraage Multiplier Test and its Applications to Model Specification in Econometrics. Rev. Econ. Stud. 47, p.239, 1980
Bowden, R.: The Theory of Parametric Identification. Econometriea 41, p.i069-1074, 1973
Box, G.E.P. and Jenkins, G.M.: Time Series Analysis Forecasting and Control. Holden-Day, Revised Ed., 1976
Box, G.E.P. and Tiao, G.C.: A Canonical Analysis of Multiple Time Series. Biometrika, p.355-365, 1977
Caines, P.E.: Weak and Strong Feedback Free Processes. IEEE Trans. Autom. Contr. p°737, 1976
Caines, P.E.: Predication Error Identification Methods for Sta- tionary Stochastic Processes. IEEE Trans. Autom. Contr. p.500, 1976
Caines, P.E. and Ljung, L.: Asymptotic Normality and Accuracy of Prediction Error Estimators. Tech. Report #7602, Dept. Elec. Eng., Univ. Toronto, 1976
Caines, P.E. and Chan, C.W.: Estimation Identification and Feed- back. In System Identification Advances and Case Studies, Mehra and Lainioti~ eds., Academic Press 1976
Chan, S.W., Goodwin, G.C. and Sin, K.S.: Convergence and Proper- ties of the Solutions of the Ricatti Difference Equation, Tech. Report #8201, Dept, Elec. Cmptr. Eng°, Univ. Newcastle, NSW Australia 1982
Cox, D.R. and Hinkley, D.V.: Theoretical Statistics. Chapman and Hall 1974
Deistler, M.: Z-Transform and Identification of Linear Econo- metric Models with Autocorrelated Errors. Metrika 22, p.13 1975
Deistler, M.: The Identifiability of Linear Econometric Models with Autocorrelated Errors. Intl. Econ. Rev. 17, p.26-46, 1976
Deistler, M.: The Structural Identifiability of Linear Models with Autocorrelated Errors in the Case of Cross Equation Restric- tions. Jl, Econometrics 8, p.23, 1978
Dickey, D.A. and Fuller, W.: Distribution of the Estimators for Autoregressive Time Series with a Unit Root. J.A.S.A. 74, p.427 1979
322
Dickey, D.A. and Fuller, W.: Distribution of Likelihood Ratio Test Statistics for Nonstationary Time Series. J.A.S.A. 1981
Doob, J.L.: Stochastic Processes. John Wiley 1953
Duncan, D.B. and Horn, S.D.: Linear Dynamic Recursive Estimation from the Viewpoint of Regression Analysis. J.A.S.A. 67, p.815 1972
Dunsmuir, W. and Hannah, E.J.: Vector Linear Time Series Models, Adv. Appl. Prob 8, p.339, 1976
Durbin, J.: Testing for Serial Correlation in Least Squares Regres- sion ~en Some of the Regressors are Lagged Dependent Variables Econometrica 38, p.410, 1970
Eic~er, F.: Asymptotic Normality and Consistency of the Least Squares Estimators for Families of Linear Regressions. Ann. Math. ~tat. 34, p.447, 1963
Findley, D.F.: Geometrical and Lattice Versions of Levinson's General Algorithm. In A~lied Time Series Analysis II, Findley, ed., Academic Press 1981
Fuller, W.: Introduction to Statistical Time Series. John Wiley 1976
Fuller, W.A., Hasza, D.P. and Goebel, J.J.: Estimation of the Parameters of Stochastic Difference Equations. Ann. Stat. 9, p.531-543, 1981
Gardner, G.A.C., Harvey, A.C. and Phillips, G.D.A.: An Algorithm for Exact Maximum Likelihood Estimation of A~MA Models by Means of Kalman Filtering. J.R.S.S. (C) 29, p.311, 1980
Gevers, M.R. and Anderson, B.D.O.: On Jointly Stationary Feedback Free Stochastic Processes. Manuscript, 1980
Gevers, M.R. and Anderson, B.D.O.: Representations of Jointly Stationary Stochastic Feedback Processes. Int. Jl. Control., 1981
Geweke, J.: Testing the Exogeneity Specification in the Complete Dynamic Simultaneous Equation Model. Jl. Econometrics 7, p.163, 1978
Godfrey, L.G.: Testing the Adequacy of a Time SEries Model. Biometrika 66, p.67, 1979
Goodwin, G.C.: A Simplified Method for the Determination of the Curvature of an Error Index with Respect to Parameters and Initial States. Int. Jl. Control. 8, p.253, 1968
Goodwin, G.C. and Chan, S.W.: Restricted Complexity Predictors for Time Series with Deterministic and Nondeterministic Components. Unpublished manuscript 1982
Granger, C.W.J.: Investigating Causal Relations by Econometric Models and Cross-Spectral Methods. Econometrica 37, 1969
Glover, K. and Willems, J.C.: Parameterizations of Linear Dynamical Systems: Canonical Forms and Identifiability. IEEE Trans. Autom. Contr. AC-19, p.640, 1974
323
Grewal, M.S. and Glover, K.: Identifiability of Linear and Nonlinear Dynamical Systems. IEEE Trans. Autom. Contr. , p.833, 1976
Hall, P. and Heyde, C.C.: Martingale Limit Theory and its Applica- tion. Academic Press 1980
Hannan, E.J.: The Identification of Vector Mixed ARMA Systems. Biometrika 56, p.223, 1969
Hannan, E.J.: MuJtiple Time Serie>. John Wiley, 1970
Hannah, E.J.: The Identification Problem for Multiple Equation Systems with Moving Average Errors. Econometrica 39, p.751,
1971
Hannan, E.J.: The Asymptotic Theory of Linear Time Series Models. J. Appl. Pro~ i0, p.130, 1973
Hannan, E.J.: The Identification and Parameterization of ARMAX and State Space Forms. Econometrica 44, p.713, 1976
Hannah, E.J. and Nicholls, D.F.: The Estimation of Mixed Regres- sion, Autoregression Moving Average and Distributed Lag Models. Econometrica 40, p.529, 1972
Harvey, A.C.: Finite Sample Prediction and Overdifferencing. Jl. Time Series 2, p.221, 1982
Heyde, C.C.: Martingales: A Case for a Place in the Statisti- cian's Repertoire. Aust. Jl. Stat. 14, p.l, 1972
Hosking, J.R.M.: Lagrange and Multiplier Tests of Time Series. J.R.S.S. (B) 42, p.170, 1980
Hsiao, C.: Causality Tests in Econometrics. Jl. Econ. Dyn. Contr.
i, p.321, 1979
llsiao, C.: Identification. Tech. Report #311, Economic Series, Inst. Math. Stud. Soc. Sci., Stanford University, 1980
}~ang, S.Y.: Solution of Complex Integrals Using the Laurent Expansion. IEEE. Trans. ASSP 26, p.263-265, 1978
Intriligator, M.D.: Econometric Models, Techniques and Applica- tions. Prentice Hall 1978
Jazwinski, A.H.: Stochastic Processes and Filtering Theory_. Academic Press, 1980
Jennrich, R.I.: Asymptotic Properties of Non-linear Least Sqaures Estimators. A.M.S. 40, p.633-643, 1969
Justice, J.H.; An Algorithm for Inverting Positive Definite Toeplitz Matrices. SIAM..7. AppI. Math. 23, p.289-291, 1972
Kailath, T.: A View of Three Decades of Filtering Theory. IEEE Trans. Inform. Theory. IT-20, 2, p.146, 1974
Kailath, T.: Linear Sxstems. Prentice Hall, 1980
Kalman, R.E.: A New Approach to Linear Filtering and Prediction Problems. J. Basic Eng t, Trans. ASME, Series D, 82, p.35-45,1960
324
Kashyap, R.L.: Maximum Likelihood Identification of Stochastic Linear Systems. IEEE Trans. Autom. Contr. AC-15, p.25, 1970
Kawashima, H.: Parameter Estimation of Autoregressive Integrated Processes by Least Squares. Ann. Stat. 8, p.423-435, 1980
Kimeldorf, G. and Wahba, G.: A Correspondence Between Bayesian Estimation on Stochastic Processes and Smoothing by Splines. Ann. Math. Stat. 41, p.495, 1970
Kohn, R.: Note Concerning the Akaik~ and Hanr~n Estimation Pro- cedures for an ARMA Process. Biometrika 64, p.622, 1977
Kohn, R.: Asymptotic Properties of Time Domain Gaussian Estimators Adv. Appl. Prob. i0, p.339-359, 1978a
Kohn, R.: Local and Global Identification and Strong Consistency in Time Series Models. Jl. Econometrics 8, p.269-293, 1978b
Kohn, R.: Asymptotic Estimation and Hypothesis Testing Results for Vector Linear Time Series Models. Econometrica 47, p.i005- 1030, 1979a
Kohn, R.: Identification Results for ARMAX Structures. Econo- metrica 47, p.1295, 1979b
Koussiouris, T.G. and Kafiris, G.P.: Controllability Indices, Observability Indices, and the Hankel Matrix. Int. Jl. Control 33, p.723, 1981
Kshirsagar, A.M.: Multivariate Analysis. Marcel Dekker, 1974.
Lai, T.L., Robbins, H. and Wei, C.Z.: Strong Consistency of Least Squares Estimates in Multiple Regression II. J. Mult. Anal 9, p.343-361, 1979
Lai, T.L. and Wei, C.Z.: Least Squares Estimates in Stochastic Regression Models with Applications to Identification and Con- trol of Dynamic Systems. Ann. Stat. i0, p.154-166, 1982a
Lai, T.L. and Wei, C.Z.: Asymptotic Properties of Projections with Applications to Stochastic Regression Problems. J. Mult Anal. To appear, 1982b
Lai, T.L. and Wei, C.Z.: Asymptotic Properties of General Auto- regressive Models and Strong Consistency of Least Squares Esti- mates of Their Parameters. J. Mult. Anal. To appear, 1982c
Linquist, A.: A New Algorithm for Optimal Filtering of Discrete- Time Stationary Processes. SIAM J. Control 4, p.736-747, 1974
Ljung, L.: On the Consistency of Prediction Error Identification Methods. In System Identification, Advances and Care Studies, Mehra, R.K. and Lainiotis, D.G. (eds), Academic Press 1976
Ljung, L. and Caines, P.E.: Asymptotic Normality of Prediction Error Estimators for Approximate System Models. Manuscript 1976
Ljung, L. and Rissanen, J.: On Canonical Forms, Parameter Identi- fiability and the concept of complexity. 4th IFAC Symp. on Identification and Parameter Estimation, Tbilisi, U.S.S.R., 1976
325
McLeish, D.L.: Dependent Central Limit Theorems and Invariance Principles. Ann. Prob. 2, p.620, 1974
Makhoul, J.: A Class of All-Zero Lattice Digital Filters Proper- ties and Applications. IEEE Trans. ASSP 2, p.304-314, 1978
Neveu, J.: Discrete Parameter Martingales. North-Holland, 1975, p.34
Newton, H.J.: Using Periodic Autoregressions for Multiple Spectral Estimation. Technometrics 24, 2, p.109, 1982
Ng, T.S., Goodwin, G.C., Anderson, B.D.O.: Identifiability of MIMO Linear Dynamic Systems Operating in Closed Loop. Automatica 13, p.477-485, 1977
Nicholls, D.: The Efficient Estimation of Vector Linear Time Series Models. Biometrica 63, p.381, 1976
Pagano, M.: On Periodic and Multiple Autoregressions. Ann. Stat. 6, p.1310, 1978
Paige, C.C. and Saunders, M.A.: Least Squares Estimation of Discrete Linear Dynamic Systems Using Orthogonal Transforma- tions. SIAM J. Numer. Anal. 14, p.180, 1977
Parzen, E°: Some Recent Advances in Time Series Modeling. IEEE Trans. Autom. Contr. AC-19, p.723, 1974
Parzen, Eo: Statistical Inference on Time Series by Hilbert Space Methods. In E. Parzen Time Series Analysis Papers, Holden-Day, 1967
Parzen, E,.: Time Series Model Identification and Prediction Variance Horizon. In Applied Time Series Analysis II, Findley, D.F. (ed), Academic Press, 1981
Pierce, D.A. and Haugh, L.D.: Causality in Temporal Systems: Characterizations and a Survey. Jl. Econometrics 5, p.263, 1977
Rauch, H.E., Tung, F. and Striebel, C.T.: Maximum Likelihood Estimates of Linear Dynamic Systems. AIAA Jl. , p. I145-1450, 1965
Richmond, J.: Identifiability in Linear Models. Econometrica 42, p.731-736, 1974
Richmond, J.: Aggregation and Identification. Int. Ec. Rev. 17 p.47-56, 1976
Riers~l, O.: Identifiability, Estimability, Phenorestricting Specifications, and Zero Lagrange Multipliers in the Analysis of Variance. Skand. Aktuar. 46, p.131-142, 1963
Rissanen, J.L. and Barbosa, D.: Properties of Infinite Variance Covariance Matrices and Stability of Optimum Predictors. Inform. Sci. i, p.221-236, 1969
Rissanen, J.L. and Ljung, L.: Estimation of Optimum Structures and Parameters for Linear Systems. Proc. CNR. CISM. Symp. on ~raic System Theory. U dine 1975
326
Rissanen, J.L. and Caines, P.E.: The Strong Consistency of Maxi- mum Likelihood Estimators for ARMA Processes. Ann. Stat. 7, p.297, 1979
Robinson, E.A.: Random Wavelets and Cybelrnetic Systems ~. Charles Griffin, London, 1962
Robinson, E.A.: Multichannel Time Series Analysis with Digital Compute S Prosrams. Holden-Day, 1967
Robinson, E.A. and Treitel: Geophysical ~ignal Analysis. Prentice-Hall, 1980
Rosenbroek, H.H.: State-Space and Multivariable Theor X. John Wiley, 1970
Rothenberg, T.J.: Identification in Parametric Models. Econometrica 34, p.577-591, 1971
Sage, A.P.: Optimum Systems Control. Prentice-Hall 1968
Scheffe, H.: The Analysis of Variance. John Wiley, 1959
Scott, D.J.: Central Limit Theorems for Martingales and for Processes with Stationary Increments Using a Skorokhod Representation Approach. Adv. Appl. Prob 5, p. 119, 1973.
Shibata, R: Selection of the Order of an Autoregressive Model by Akaike's Information Criterion. Biometrika 63, p. 117-126, 1976.
Sidhu, G.S. and Kailath, T.: Development of New Estimation Algorithms by Innovations Analysis and Shift Invariance Properties. IEEE Trans. I.T., p.759-762, 1974
Silverman, L.M.: Discrete Ricatti Equations: Alternative Algorithms, Asymptotic Properties and System Theory Inter- pretations. In Control and Dynamic S~%tems, Advances in Theory and Applications. Leondes, C.T. (ed), Vol. 12, 1976
Silvey, S.D.: The Lagrangian Multiplier Test. A.M.S. 30, p.389-407, 1959
Sims, C.A.: Money, Income and Causality. Am. Ec. Rev. 62, p.540, 1972
SSderstrSm, T., Ljung, L. and Gustavsson, I.: Identifiability Conditions for Linear Multivariable Systems Operating Under Feedback. IEEE Trans Autom. Contr. p.837-840, 1976
SBderstr~m, T: On Model Structure Testing in System Identifi- cation. Int. Jl. Control 26, p. 1-18, 1977.
Solo, V.: Time Series Recursions and Stochastic Approximation. Unpublished Ph.D. Thesis, Aust. Nat. Univ., 1978
Solo, V.: Strong Consistency fo Least Squares Estimates in Regression with Correlated Disturbances. Ann. Stat., 1981
327
Solo, V.: Consistency of Least Squares Estimates in ARX models. In preparation 1982
Solo, V.: ARMA models without MA parameters. In Preparation, 1982
Son, L.H. and Anderson, B.D.O.: Design of Kalman Filters Using Signal Model Output Statistics. Prec. IEE, Vol. 120, p.312-318, 1973
Sternby, J.: On Consistency for the Method of Least Squares Using Martingale Theory. IEEE Trans. Autom. Contr. AC-22, p. 346, 1977
Stewart, G.W,: Introduction to Matrix Computations. Academic ~ress, 1973
Stigum, B.P.: Least Squares and Stochastic Difference Equations. Jl. Econometrics 4, p.349-370, 1976
Strang, G.: Linear Algebra and its Applications. Academic Press 1976
Theil, H.: Principles of Econometrics, John Wiley, 1971
Tiao, G.C. and Box, G.E.P.: An Introduction to Applied Multiple Time Series Analysis. Tech. Report #582, Dept. Statistics, Univ. Wisconsin, 1979
Tiao, G.C. and Tsay, R.S.: Stationary A~MAmodels. Univ. Wisconsin, 1981
Identification of Nonstationary and Tech. Report #647, Dept. Statistics,
Tse, E. and Weinert, H.L.: Structure Determination and Parameter Identification for Multivariable Stochastic Linear Systems. IEEE Trans. Autom. Contr. AC-20, p.603, 1975
Vieira, A. and Kailath, T.: On Another Approach to the Schur- Cohn Criterion. IEEE Trans. C.A.S.p.218-220, 1977
Walker, A.M.: Asymptotic Properties of Least Squares Estimates of Parameters of the Spectrum of a Stationary Non-Deterministic Time Series. J. Aust. Math. Soc. 4, p.363, 1964
Wertz, V., Gevers, M. and Hannah, E.J.: The Determination of Optimum Structures for the State Space Representation of Multivariate Stochastic Processes. Unpublished manuscript, 1981
Whittle, P.: The Analysis of Multiple Times Series. JRSS (B) 15, p.125-139, 1953
Wilson, G.T.: The Factorization of Matrical Spectral Densities. SIAM Jl. Appl. Math. 24, p.420-426, 1972
Wilson, G.T.: Some Efficient Computational Procedures for High Order ARMA Models. J. Stat. Comp. Simul__~. 8, p.301-309, 1979
Wolovich, W.A.: Linear Multivariable Systems. Springer-Verlag Applied Math. Sci., Vol. ii, 1974
328
Wu, C.F.: Asymptotic Theory of Nonlinear Least Squares Estimation. Ann. Stat. 1981
Rissanen, J. (1973a), Algorithms for triangular decomposition of block Hankel and Toeplitz matrices with application to factoring positive matrix polynomials, Math. Comp., 27, 147-154.
Rissanen, J. (1973b), a fast algorithm for optimum linear prediction, IEEE Trans. Autom. Contr., AC-18, 555,
Hannan, E.J., Quinn, B.G. (1979). The determination of the order of an autoregression. Journal of the Royal Statistical Society. (B), 41, 190-195.