Linear Models with Serially Correlated Datamwatson/misc/SW_W3_19_20_part2_20200316… · 16-03-2020...

Gerzensee, Econometrics Week 3, March 2020

Linear Models with Serially Correlated Data

Review from Bo's Week 2…

Background: Asymptotics for Serially Correlated Processes –LLN and CLT

(1) LLN and Ergodicity A process is ergodic if its elements are asymptotically independent – that is, if random variables that are far apart in the sequence are essentially statistically independent of one another (see Hayashi, page 101 and Hamilton, pages 46-47). Ergodicity is important because, together with stationarity it leads to a SLLN:

Ergodic Theorem: Suppose is stationary and ergodic with , then

This is a generalization of the SLLN. (For a proof of the theorem and a more detailed discussion see Karlin and Taylor (1975).)

If is stationary and ergodic, then so is xt = f(zt) for arbitrary function .

{zt} E(zt ) = µ

T −1

i=1

T∑ zt →a.s.

µ.

zt f


(2) CLT for martingale difference sequences (mds): Let {gt} be a (possibly vector-valued) mds that is stationary and ergodic with E(gtgt′) = Sgg . Then

See Hayashi (p. 106). Notes: While {gt} is serially uncorrelated, it may be serially dependent (through higher order moments). E(gtgt′) = Sgg concerns the “unconditional” variance. The conditional variance may be non-constant.

T g = 1

Tgt∑ →

d

N (0,Σgg )


Review: Linear Model with Serially Correlated Regressors

These notes will work through the linear regression model and present conditions under which the results in i.i.d. setting continue to hold when the data are serially correlated. I will not work through the linear IV model, but under analogous conditions the i.i.d. results will also continue to hold. Chapter 3 of Hayashi provides the details. Replace the i.i.d. assumptions with

is a stationary and ergodic process E(etxt) = 0, or letting gt = etxt then E(gt) = 0 E(xtxt¢) = Sxx which is non-singular {gt} is a mds with E(gtgt′) = Sgg. Notes: E(etxt) = 0 is weaker than E(et | X) = 0 and E(et | xt ) = 0. It is sometimes (as in Hamilton) called the assumption of predetermined regressors. We will return to the difference between these two assumptions when we discuss GLS in the context of serially correlated errors. E(gtgt′) = Sgg allows the errors to be heteroskedastic conditional on the regressors – that is, it does not require that Sgg = . “{gt} is a mds” is a “high-level” assumption concerning the cross product of and . At a

more primitive level it is implied by

{yt , xt}

σε2Σ xx

ε t xt

E(ε t |{ε i , xi}i=1t−l , xt ) = 0


Properties of

Consistency: (consistency)

Proof:

and is stationary and ergodic, with , so that

which is non-singular. Also {gt} is stationary and ergodic with E(gt) = 0. Thus

and the result follows from Slutsky’s theorem.

β!

β!→

p

β

β! − β = 1

Txt xt

′∑⎛⎝⎜

⎞⎠⎟

−11T

gt∑⎛⎝⎜

⎞⎠⎟

{xt xt′} E{xt xt

′}= Σ xx

1T

xt xt′∑ →

p

Σ xx

1T

gt∑ →p

0


Asymptotic Normality: , where .

Proof:

From earlier:

Also

which follows by the CLT for a mds.

The result then follows by Slutsky’s theorem.

T (β! − β )→

d

N (0,Vβ!)

V

β!= Σ xx

−1ΣggΣ xx−1

T (β! − β ) = 1

Txt xt

′∑⎛⎝⎜

⎞⎠⎟

−11T

gt∑⎛⎝⎜

⎞⎠⎟

1T

xt xt′∑ →

p

Σ xx (nonsingular)

1T

gt∑ →d

N (0,Σgg )


Feasible inference:

(1) Let be a consistent estimator of Sgg. Then , where

.

Proof: (You should fill in – use Slutsky’s Theorem)

(2)

Proof: (You should fill in)

(3) Let the Wald Statistic be written as

then

and

where .

Proof: (You should fill in – use from Slutsky’s theorem and the continuous mapping theorem.)

Σgg

Vβ= Sxx

−1ΣggSxx−1→

p

Vβ!

Sxx =

1T∑ xt xt '

t j =T (

jβ! − β j )

(Vβ) jj

→d

N (0,1)

ξW = T (Rβ! − Rβ )′ (RV

βR′ )−1(Rβ! − β )

ξW →d

χm2

ξW

m→

d

Fm,∞

rank(R) = m


(4) Suppose that is finite for all and . Let , , and

. Then

Proof: (in the model with , for simplicity)

so that

Now

by the ergodic theorem.

Thus

Note is finite since

and thus

so that

and the result follows.

E[(xt ,ixt , j )

2] i j ε t = yt − xtβ!

gt = ε t xt

Sgg =

1T

gt2∑ = 1

Tε t

2xt2∑

Sgg →p

Σgg

k = 1

ε t = yt − xtβ − xt (β! − β )

= ε t − xt (β! − β )

Sgg =1T∑ε t

2xt2

+(β!−β )2 1T

xt4∑

−2(β!−β )1T

ε t xt3∑

1T

ε t2xt

2∑ = 1T

gt2∑ →

p

Σgg

(β! − β )→p

01T

xt4∑ →

p

E(xt4 )

(β! − β )2 1

Txt

4∑ →p

0

E(ε t xt3) | E(ε t xt

3) |≤ [E(ε t2xt

2 )E(xt4 )]

12 (Cauchy−Schwartz)

1T

ε t xt3∑ →

p

E(ε t xt3)

(β! − β ) 1

Tε t xt

3∑ →p

0


Application to the AR(p) model Some preliminaries: (a) Suppose yt follows the MA process

yt = q(L)et =

where et ~ iid(0,s2). Let li = E(yt yt+i) denote the autocovariance of . If

then

and the process is stationary and ergodic.

Proof: Hamilton page 69-70 and Dhrymes page 370.

i=0

∞

∑θ iε t−i

i′th {yt}

i=0

∞

∑ |θ i |< ∞

i=0

∞

∑ | λi |< ∞


(b) Suppose f(L)yt =et where et ~ iid(0,s2), f(L) = 1 − f1L − … −fpLp with roots outside the unit circle. The AR model can be inverted to yield yt = q(L)et

with < ∞.

This can be verified using the results that we worked out for AR models earlier in the week.

i=0

∞

∑ |θ i |


Consider using OLS to estimate the coefficients of the AR model. Maintain the assumptions that is and that the roots of are outside the unit circle. Write the model as

where and . Then

and

where

with . Proof: Key Points

is stationary and ergodic

, is independent of for independent of for and

. Thus and so gt is a mds.

. The result then follows from the general results given above.

ε t iid(0,σ 2 ) φ(z)

yt = xt 'β + ε t

xt = ( yt−1, yt−2,...yt− p )

β = (φ1,φ2,...,φp )

β!→

p

β

T (β! − β )→

d

N (0,Vβ)

V

β=σ 2Σ xx

−1

Σ xx = E(xt xt′ )

{yt , xt}

[E(xtxt′ )]ij = E( yt−i yt− j ) = λ|i− j|

gt = ε t xt ε t xt−i i ≤ 0, ε t−i i < 0

E(ε t ) = 0 E(ε t |{ε i , xi}i=1t−l , xt ) = 0

E(gt gt′ ) = E(ε t

2xt xt′ ) = E[E(ε t

2xt xt′ | xt )]=σ

2Σ xx


AR(1) Example with and . Then

and

with so that

and an approximate 95% confidence interval for is given by

yt = βxt + ε t

β = φ xt = yt−1

Σ xx = var( yt−1) = var( yt ) =

σ 2

1−φ 2

T (φ! −φ)→

d

N (0,Vφ )

Vφ =σ

2Σ xx−1 = (1−φ 2 )

φ! ~

a

N φ, 1T

(1−φ 2 )⎛⎝⎜

⎞⎠⎟

φ

φ! ± (1.96)

(1−2

φ! )T


Dropping the Assumption that (gt) is a mds

Reference: Hayashi Chapter 6. 1. Implications for standard GMM inference The asymptotic distribution of the OLS and GMM estimators are driven by the asymptotic

normality of .

What happens when gt is serially correlated (so that gt is not a mds)?

It turns out that, as long as gt is “weakly dependent,” then ,

but the W ≠ Sgg.

1T

gtt=1

T

∑

1T

gtt=1

T

∑ →d

N (0,Ω)


Before stating the result, it is useful to work out an expression for W. Suppose gt is stationary with autocovariances li. Then

If the autocovariances satisfy (jargon: they are “1-summable”) then

This will be the expression for W. Recall that we defined the autocovariance generating variance function (AGF) as

l(z) = .

Looking at this, and the expression for W, you can see that W can be computed from the ACF by setting z =1. That is, W = l(1).

var(T −1/2 gt )t=1

T

∑

= 1T

{Tλ0 + (T −1)(λ1 + λ−1)+ (T − 2)(λ2 + λ−2 )+ ... + 1× (λT−1 + λ1−T )}

= λ jj=−T+1

T−1

∑ − 1T

j(λ j + λ− j )j=1

T−1

∑

j |λ j |j=1∞∑ <∞

var(T −1/2 gt )t=1

T

∑ → λ jj=−∞

∞

∑

λ j zj

j=−∞

∞

∑


Now, a few results: CLT for the MA(∞) model Let

where ~ iid(0, s2) and . Then

Proof: Anderson (1971, page 429) or Hamilton (page 195).

yt = µ +

j=0

∞

∑θ jε t− j

ε t |θ j |∑ < ∞

T ( y − µ)→d

N 0, λ jj=−∞

∞

∑⎛

⎝⎜⎞

⎠⎟


Example: MA(1) Suppose , then

.

Notice

and . Finally, to reconcile with earlier notation: For the MA model, l(z) = s2q(z)q(z-1), so that

,

and this is the result stated in the theorem.

yt = µ + ε t −θε t−1

T ( y − µ) = 1T t=1

T∑ ε t −θ 1T t=0

T−1∑ ε t = (1−θ ) 1T t=1

T∑ ε t +θ 1T(εT − ε0 )

(1−θ ) 1

T t=1

T∑ ε t →d

N (0,σ 2(1−θ )2 )

θ 1

T(εT − ε0 )→

p

0

λ(1) = λ jj=−∞

∞

∑ =σ 2θ(1)2 =σ 2(1−θ )2


CLT for ergodic and stationary processes Suppose that is stationary and ergodic with finite variance. Then under a set of “dependence” assumptions analogous to absolutely summable MA coefficients

The additional assumptions are given in White (1984), Theorem 5.15.

Properties of OLS and GMM when .

These are the same as the results presented above but with W replacing Sgg.

{yt}

T ( y − µ)→d

N (0, λ jj=−∞

∞

∑ )

T −1/2 gt

t=1

T

∑ →d

N (0,Ω)


2. Digression and a Review of GLS Consider the regression model Y = Xb + u where Y is n×1, X is n×k and so forth. Suppose that E(u | X) = 0 and Var(u | X) = L. If L = s2 I, then the OLS estimator of b , say is the best linear unbiased estimator of b conditional on X (the Gauss-Markov theorem). Moreover, if the errors have a conditional Gaussian distribution, is the MLE and achieves the CR lower bound and therefore in the minimum variance unbiased estimator conditional on X. When L ≠ s2 I, does not (in general) have these efficiency properties. Another estimator, the “generalized least squares estimator” does. This estimator is

. To motivate this estimator, write , so that and L-1/2LL-1/2¢ = I. Multiplying the regression relation by L−1/2 yields

−1/2Y = −1/2Xb + −1/2u or , where , , and . Note that and . Because and are nonsingular transformations of Y and X, the best linear unbiased estimator of b conditional on X is (from the Gauss-Markov Theorem):

where the final equality follows from the definitions of and .

βOLS

βOLS

βOLS

βGLS = X 'Λ−1X( )−1

X 'Λ−1Y

Λ−1 = Λ−1/2 'Λ−1/2 Λ = Λ1/2Λ1/2 '

Λ Λ Λ !Y = !Xβ + !u

!Y = Λ−1/2Y !X = Λ−1/2 X !u = Λ−1/2u

E( !u | !X ) = 0 Var( !u | !X ) = I

!Y !X

βGLS = !X ' !X( )−1 !X ' !Y = X 'Λ−1X( )−1

X 'Λ−1Y

!Y !X


An alternative way to derive the estimator is to write the Gaussian conditional likelihood function: Y|X ~ N(Xb, L), so that

, so that the MLE solves

Carrying out this minimization yields

f (Y | X ) = (2π )−n/2 |Λ |−1/2 exp − 1

2Y − Xβ( ) 'Λ−1 Y − Xβ( )⎧

⎨⎩

⎫⎬⎭

minb Y − Xb( ) 'Λ−1 Y − Xb( )

βGLS


Examples: (1) Weighted least squares:

Suppose that L = diag( ). Then L−1/2 = ,

and .

The GLS estimator can be constructed as the OLS regression of onto . Notice that this estimator is OLS after re-weighting the observations, where the weight applied to the i’th observation is 1/si (so that observations corresponding to ui with a low variance receive more weight).

σ i2

diag 1

σ i

⎛⎝⎜

⎞⎠⎟

!yi = yi /σ i !xi = xi /σ i

!yi !xi


(2) GLS in time series models: Suppose that ut = c(L)et, where et ~ iid(0,s2). Then (ignoring effects associated with initial conditions) , so that the GLS estimator can be constructed by regressing c(L)−1yt onto c(L)−1xt via OLS. As an example: suppose that (1−rL)ut = et, so that c(L) = (1−rL)−1. Then the GLS estimator is formed by regressing (1−rL)yt = yt − ryt−1 onto (1−rL)xt = xt − rxt−1. Notice that 1 observation is “lost” using this transformation. A calculation shows that the initial observation should be constructed as and

. Because the first observation is asymptotically negligible relative to the information in the other T−1 observations, the first observation is often dropped from the analysis.

!ut = c(L)−1ut = ε t

!y1 = (1− ρ 2 )1/2 y1

!x1 = (1− ρ 2 )1/2 x1


Feasible GLS: To construct the GLS estimator, you need to know L. Suppose that it is unknown, but depends on a small number of parameters, say q, so that L = L (q). This suggests that L can be estimated as . The “Feasible” GLS is

.

In many models it is possible to show that , so that, in large samples,

shares the same efficiency properties as .

Λ = Λ(θ )

βFGLS = ( X ' Λ−1X )−1 X ' Λ−1Y

n(β FGLS − βGLS )→p

0

βFGLS

βGLS


When should you use OLS and HAC standard errors instead of GLS? Consider the regression model

where is the regression error. Suppose that , so that . Given the other assumptions discussed above, the OLS estimator of is consistent and asymptotically normal. Now suppose that where ~ iid(0,s2). This suggests that the GLS estimator might be used. The GLS estimator can be formed as the OLS estimator applied to the transformed model:

where and similarly for . As discussed above, under certain assumptions the GLS estimator is BLUE, and hence is more efficient that the OLS estimator. The key assumption underlying the consistency of the GLS estimator is that . Alternatively, noting that et = ut - fut-1, this can be written as

for this to be true for all values of it must be the case that

The first two of these are implied by , the same assumption used for the consistency of OLS. The last two restrictions are not.

yt = xt 'β + ut

ut

E(ut | xt ) = 0 E(ut xt ) = 0β

ut = φut−1 + ε t ε t

!yt = !xt 'β + ε t

!yt = yt −φ yt−1 !xt

E(ε t!xt ) = 0

E[(ut −φut−1)(xt −φxt−1)]= E(ut xt )+φ2E(ut−1xt−1)−φE(ut xt−1)−φE(ut−1xt ) = 0

φ

E(ut xt ) = 0 (Term 1)E(ut−1xt−1) = 0 (Term 2)

E(ut xt−1) = 0 (Term 3)E(ut−1xt ) = 0 (Term 4)

E(ut | xt ) = 0


These two restrictions are implied be stronger assumptions. Term 3 is implied by . When this holds the regressors are said to be predetermined (or sometimes the term exogenous is used) Term 4 is implied by the stronger assumption . When this holds the regressors are said to be strictly exogenous. Evidently, GLS requires the assumption of strict exogeneity. Examples to be discussed: (a) Multi-period forecast efficiency (b) Orange juice prices and the weather

E(ut | xt , xt−1,...) = 0

E(ut | ...xt+1, xt , xt−1,...) = 0


24

HAC and HAR inference

(same setup as above) yt = xtb + et, where (yt, xt) stationary and ergodic and so forth

gt = etxt and

with and lj = E(gtgt').

.

Let and .

If W, then , and .

T SXε = T g = 1T

gtt=1

T

∑ d⎯ →⎯ N (0,Ω)

Ω = λ jj=−∞

∞

∑

T (β − β ) d⎯ →⎯ N (0,Vβ) with V

β= Σ XX

−1 ΩΣ XX−1

V = SXX−1 ΩSXX

−1 ξW = T (β − β )'Vβ−1(β − β )

Ω p⎯ →⎯ Vβ

p⎯ →⎯ Vβ

ξWd⎯ →⎯ χ k

2


25

Focus on estimators of W and how this affects inference. Simple case, xt = 1 and b = 0. Thus gt (= yt) are the data, E(gt) = 0,

= ,

and

and let .

β − β g = T −1 gtt=1

T

∑

T g d⎯ →⎯ N (0,Ω)

ξW (Ω) = TgΩ−1g d⎯ →⎯ χ1

2

ξW (Ω) = TgΩ−1g


26

Estimators of W:

HAC: The goal is estimate .

With a finite sample of data it is impossible to consistently estimate W for all possible sequences {lj }. But for special sequences, consistent estimation is possible. Two examples:

Ω =j=−∞

∞∑ λ j


27

Example 1: Suppose l|j| = 0 for |j| > 1 (so gt follows an MA(1) process). Then one needs only estimated the variance and first auto-covariance of the process. These can be estimated consistently. Thus

is consistent.

Ω = λ jj=−1

1

∑


28

Example 2: Suppose gt ~ AR(1). In this case lj = s2 fj/(1 -f2), and (from the formula for the ACGF),

=

= s2/(1-f)2. This can be consistently estimated by estimating the two parameters characterizing the AR(1) process, s2 and f and yields

.

Ω =j=−∞

∞∑ λ j

Ω = σ 2 / (1− φ)2


29

These two example are easily generalized. The logic of example 1 accommodates l|j| = 0 for |j| > k , where k is finite. The logic of example 2 accommodates any (vector) finite order ARMA model. And even these can be generalized, if they hold "approximately". There is a large literature on this.


30

Truncated estimators: with

Truncated estimators are not PSD. (They can generate values of that are note PSD).

Ω = λ jj=−k

k

∑ λ j = T−1 gtgt+ j 't=1

T− j

∑

Ω


31

Weighted truncated estimators: where w are weights.

Carefully chosen weights ensure PSD estimators.

Ω(w) = wjλ jj=−k

k

∑


32

The most widely used estimator is the "Newey-West" estimator:

, where w|j| = (k+1-j)/(k+1) (called, 'Bartlett' weights)

k is chosen as a function of T. (The Stock-Watson UG textbook suggests k = 0.75T1/3).

is PSD.

ΩNW = wjλ jj=−k

k

∑

ΩNW


33

With Barlett (i.e., Newey-West) weights, Andrews (1991) shows: if k = k(T) with k(T) → ∞ and k(T)/T → 0.

MSE( ) is minimized with k(T) = O(T1/3).

ΩNW p⎯ →⎯ Ω

ΩNW


34

With gt ~ AR(1) with coefficient f, applying Andrew's formula yields:

k* = 1.1447×41/3×(f2)1/3×T1/3 = 1.82×f2/3×T1/3

f k* T 100 400 1000

0.00 0 0 0 0 0.25 0.72×T1/3 3 5 7 0.50 1.15×T1/3 5 8 11 0.75 1.50×T1/3 6 11 15 0.90 1.70×T1/3 7 12 17 0.95 1.76×T1/3 8 12 17


35

Using rules such as these: so

ξW (ΩNW )− ξ W (Ω)

p⎯ →⎯ 0

ξW (ΩNW ) d⎯ →⎯ χ 2


36

This is all fine, 'asymptotically,' but it turns out that size distortions can be large using c2 critical values for . Example: gt is AR(1) with coefficient f. Size of 10% tests of µg = 0 when T = 250.

f Size 0.00 0.10 0.25 0.13 0.50 0.15 0.75 0.21 0.90 0.33 0.95 0.46

ξW (ΩNW )


37

Can we go beyond 'first-order asymptotics': (a calculation in Phillips and Sun (2008))

Let

Then

Write

then

or

If is independent of (as it would be, for example when the data are Gaussian), F(c,W) = P(x(W) > c) which can be computed (exactly) from the c2 when the data are Gaussian. This is the first-order asymptotic term. The Bias and MSE are higher-order terms. Evidently, bias is important (above and beyond its role in MSE). This suggests that k should be larger than the value chosen to minimize MSE.

P ξ(Ω) > c( ) = P ξ(Ω) > ΩΩc

⎛

⎝⎜⎞

⎠⎟

P ξ(Ω) > ΩΩc

⎛

⎝⎜⎞

⎠⎟Ω = F(c,Ω)

P ξ(Ω) > ΩΩc

⎛

⎝⎜⎞

⎠⎟= P ξ(Ω) > Ω

Ωc Ω

⎛

⎝⎜⎞

⎠⎟∫ f (Ω)dΩ

= F(c,Ω)∫ f (Ω)dΩ = EΩ(F(c,Ω))

F(c,Ω) = F(c,Ω)+ (Ω −Ω)F '(c,Ω)+ 12(Ω −Ω)2F ''(c,Ω)+ ...

P ξ(Ω) > c( ) ≈ F(c,Ω)+ E Ω −Ω( )F '(c,Ω)+ 12 E Ω −Ω( )2 F ''(c,Ω)

P ξ(Ω) > c( ) ≈ F(c,Ω)+ Bias(Ω)F '(c,Ω)+ 12 MSE(Ω)F ''(c,Ω)

T g Ω


38

Lazarus, Lewis, Stock (2019) do analysis that considers size distortion versus power loss tradeoff. They suggest k = 1.3T1/2 (together with modifying the critical values). (T = 400, SW textbook k = 0.75T1/3 ≈ 6 … LLS k = 1.3T1/2 ≈ 26) There are alternative approaches: Most promising follows from Müller (2004). Lazarus, Lewis, Stock, and Watson (2018) compare inference using various HAR estimators.


39

… Think about how serially correlated gt is likely to be in practice Examples:

Date post:	01-Aug-2020
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

Linear Models with Serially Correlated Datamwatson/misc/SW_W3_19_20_part2_20200316… · 16-03-2020...

Documents