+ All Categories
Home > Documents > A Sample Lecture Notes for Advanced Graduate Econometrics · Preface This is a sample lecture notes...

A Sample Lecture Notes for Advanced Graduate Econometrics · Preface This is a sample lecture notes...

Date post: 01-Sep-2018
Category:
Upload: doanphuc
View: 221 times
Download: 0 times
Share this document with a friend
60
A Sample Lecture Notes for Advanced Graduate Econometrics by Tian Xie Queen’s University Kingston, Ontario, Canada December 1, 2012 c Tian Xie 2012
Transcript

A Sample Lecture Notes for AdvancedGraduate Econometrics

by

Tian Xie

Queen’s UniversityKingston, Ontario, Canada

December 1, 2012

c©Tian Xie 2012

Preface

This is a sample lecture notes for advanced econometrics. Students are assumed to havefinished an introductory econometric course and an intermediate econometric course or theequivalent. The course aims to help students to establish a solid background in both theoreti-cal and empirical econometrics studies. The lecture notes contain comprehensive informationthat is sufficient for a two semester course or it can be condensed to fit a one semester course,depending on the course design. The lecture notes cover various topics that start at a mod-erate level of difficulty, which gradually increases as the course proceeds.

The lecture notes were developed during my time of completing my PhD in the Queen’sUniversity economics department. The lecture notes were created based on the followingtextbooks:

• “Principles of Econometrics” (3rd Edition) by R. Carter Hill, William Griffiths andGuay C. Lim

• “Introduction to Econometrics” (3rd Edition) by James H. Stock and Mark W. Watson

• “Econometric Analysis” (6th Edition) by William H. Greene

• “Econometric Theory and Methods” by Russell Davidson and James G. MacKinnon

I used these lecture notes when I was working as a lecturer/TA/private tutor for bothundergraduate and graduate students. These lecture notes have received positive feedbackand great reviews from my students, which is demonstrated in the enclosed “2012 WinterTerm Evaluation Form”.

i

Contents

1 The Generalized Method of Moments 11.1 GMM Estimators for Linear Regression Models . . . . . . . . . . . . . . . . 11.2 HAC Covariance Matrix Estimation . . . . . . . . . . . . . . . . . . . . . . . 21.3 Tests Based on the GMM Criterion Function . . . . . . . . . . . . . . . . . . 3

1.3.1 Tests of Overidentifying Restrictions . . . . . . . . . . . . . . . . . . 31.3.2 Tests of Linear Restrictions . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 GMM Estimators for Nonlinear Models . . . . . . . . . . . . . . . . . . . . . 4

2 The Method of Maximum Likelihood 52.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Regression Models with Normal Errors . . . . . . . . . . . . . . . . . 62.1.2 Computing ML Estimates . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Asymptotic Properties of ML Estimators . . . . . . . . . . . . . . . . . . . . 72.2.1 Consistency of MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.2 Dependent Observations . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.3 The Gradient, the Information Matrix and the Hessian . . . . . . . . 92.2.4 Asymptotic Normality of the MLE . . . . . . . . . . . . . . . . . . . 9

2.3 ∗The Covariance Matrix of the ML Estimator . . . . . . . . . . . . . . . . . 102.3.1 ∗Example: the Classical Normal Linear Model . . . . . . . . . . . . . 10

2.4 ∗Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.5 ∗Transformations of the Dependent Variable . . . . . . . . . . . . . . . . . . 12

3 Discrete and Limited Dependent Variables 143.1 Binary Response Models: Estimation . . . . . . . . . . . . . . . . . . . . . . 143.2 Binary Response Models: Inference . . . . . . . . . . . . . . . . . . . . . . . 153.3 Models for Multiple Discrete Responses . . . . . . . . . . . . . . . . . . . . . 163.4 Models for Count Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.5 Models for Censored and Truncated Data . . . . . . . . . . . . . . . . . . . . 183.6 Sample Selectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.7 Duration Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Multivariate Models 224.1 Seemingly Unrelated Linear Regressions (SUR) . . . . . . . . . . . . . . . . 22

4.1.1 MM Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.1.2 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . 23

4.2 Linear Simultaneous Equations Models . . . . . . . . . . . . . . . . . . . . . 24

ii

CONTENTS iii

4.2.1 GMM Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2.2 Structural and Reduced Forms . . . . . . . . . . . . . . . . . . . . . . 264.2.3 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . 27

5 Methods for Stationary Time-Series Data 295.1 AR Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.2 MA Process and ARMA Process . . . . . . . . . . . . . . . . . . . . . . . . . 305.3 Single-Equation Dynamic Models . . . . . . . . . . . . . . . . . . . . . . . . 325.4 Seasonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.5 ARCH, GARCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6 Unit Roots and Cointegration 356.1 Unit Root . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.2 Cointegration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.2.1 VAR Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.2.2 Testing for Cointegration . . . . . . . . . . . . . . . . . . . . . . . . . 39

7 Testing the Specification of Econometric Models 427.1 Specification Tests Based on Artificial Regressions . . . . . . . . . . . . . . . 42

7.1.1 RESET Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427.1.2 Conditional Moment Tests . . . . . . . . . . . . . . . . . . . . . . . . 427.1.3 Information Matrix Tests . . . . . . . . . . . . . . . . . . . . . . . . . 43

7.2 Nonnested Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 437.2.1 J Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447.2.2 P Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447.2.3 Cox Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7.3 Model Selection Based on Information Criteria . . . . . . . . . . . . . . . . . 457.4 Nonparametric Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7.4.1 Kernel Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467.4.2 Nonparametric Regression . . . . . . . . . . . . . . . . . . . . . . . . 46

A Sample Assignment and Solution 48

Chapter 1

The Generalized Method of Moments

1.1 GMM Estimators for Linear Regression Models

We can consider GMM as a combination of IV and GLS, in which endogenous regressorsand heteroskedasticity coexist. The linear regression model is given as

y = Xβ + u, E(uu>) = Ω, (1.1)

where there are n observations. We assume that there exist an n× l matrix of predeterminedinstrumental variables W satisfying the condition

E(ut|W t) = 0 for t = 1, ..., n.

As we have seen in IV, we need to first “break” the correlation between X and u bypremultiplying a linear combination of W>, then “purify” the error terms to make thecovariance matrix of purified error terms proportional to I. The sample moment conditionsof (1.1) then become

J>W>(y −Xβ) = 0,

where J is l×k. The associated β = (J>W>X)−1J>W>y. We choose J so as to minimizethe covariance matrix of the plim of n1/2(β − β0), in which

J = (W>Ω0W )−1W>X,

where Ω0 is the true value of Ω. Then, the efficient (infeasible) GMM estimator is

βGMM =(X>W (W>Ω0W )−1W>X

)−1X>W (W>Ω0W )−1W>y. (1.2)

Of course, the above efficient GMM estimator can be obtained by minimizing the (effi-cient) GMM criterion function

Q(β,y) ≡ (y −Xβ)>W (W>Ω0W )−1W>(y −Xβ). (1.3)

However, the above criterion is actually based on the efficient J . Without knowing J inadvance, there is no way you can form up a criterion function like (1.3), but

(y −Xβ)>WΛW>(y −Xβ),

which will lead us to the inefficient GMM estimator

β = (X>WΛW>X)−1X>WΛW>y.

1

2 CHAPTER 1. THE GENERALIZED METHOD OF MOMENTS

1.2 HAC Covariance Matrix Estimation

We notice that both (1.2) and (1.3) include an unknown matrix Ω0. If we replace Ω0 withan efficient estimate Ω, we obtain the feasible efficient GMM estimator

βFGMM =(X>W (W>ΩW )−1W>X

)−1X>W (W>ΩW )−1W>y. (1.4)

The performance of the feasible estimator (1.4) crucially depends on the estimate Ω or equiv-alently W>ΩW . In practice, we should always compute 1

nW>ΩW instead for computation

efficiency.If there is only heteroskedasticity, we can compute 1

nW>ΩW in a fashion similar to the

HCCMEs introduced Chapter 5. A typical element of the matrix 1nW>ΩW is

1

n

n∑t=1

u2twtiwtj,

where ut ≡ yt − X tβ and β is a preliminary estimator; for example, the generalized IVestimator.

The HCCMEs are not appropriate when there is also autocorrelation. In this case, weneed to use a heteroskedasticity and autocorrelation consistent (HAC) estimator.We define a set of autocovariance matrices to mimic the serial correlation:

Γ(j) ≡

1n

∑nt=j+1 E(utut−jW

>t W t−j) for j ≥ 0

1n

∑nt=−j+1 E(ut+jutW

>t+jW t) for j < 0

It is easy to show that Γ(j) = Γ>(−j). Then, we can write the sample HAC estimator for1nW>ΩW as

ΣHW = Γ(0) +

p∑j=1

(Γ(j) + Γ

>(j)). (1.5)

This estimator is usually refer as the Hansen-White estimator. The threshold parameterp, which is also called the lag truncation parameter, is somewhat arbitrary. For thepurposes of asymptotic theory, we need p to go to infinity at some suitable rate as thesample size goes to infinity; for example, n1/4. In practice, when the sample size is finite, weneed to choose a specific value of p.

Another issue with the HW estimator (1.5) is that it is not always positive-definite infinite samples. To solve this critical deficiency, Newey and West (1987) proposed a modifiedHW estimator, which is later known as the Newey-West estimator:

ΣNW = Γ(0) +

p∑j=1

(1− j

p+ 1

)(Γ(j) + Γ

>(j)). (1.6)

The NW estimator is a biased estimator of Σ, since it underestimates the autocovariancematrices. It is still consistent if p increases as n does, and the appropriate rate is n1/3. Notethat Newey and West (1994) proposed a procedure to automatically select p.

1.3. TESTS BASED ON THE GMM CRITERION FUNCTION 3

With an estimated Σ (HW, NW or other estimators), the feasible efficient GMM estima-tor becomes

βFGMM =(X>W Σ

−1W>X

)−1X>W Σ

−1W>y.

The associated covariance matrix is then

Var(βFGMM) = n(X>W Σ−1W>X)−1.

1.3 Tests Based on the GMM Criterion Function

We first rewrite the GMM criterion function here

Q(β,y) ≡ (y −Xβ)>W (W>ΩW )−1W>(y −Xβ).

Define Ω−1 = ΨΨ>, which implies Ω = (Ψ>)−1Ψ−1. Then

Q(β,y) = (y −Xβ)>Ψ[Ψ−1W (W>(Ψ>)−1Ψ−1W )−1W>(Ψ>)−1

]Ψ>(y −Xβ)

= (y −Xβ)>ΨPAΨ>(y −Xβ)

= u>PAu

where A = Ψ−1W , u = Ψ>u ∼ iid(0, I). Therefore,

Q(β0,y)a∼ χ2

(Rank(A)

)= χ2(l), (1.7)

where β0 is true parameter vector. All tests in this section are based on result (1.7).

1.3.1 Tests of Overidentifying Restrictions

Whenever l > k, a model estimated by GMM involves l−k overidentifying restrictions. Ourtarget is to find the distribution of test statistic Q(βGMM,y):

Q(βGMM,y) = (y −XβGMM)>ΨPAΨ>(y −XβGMM).

If the model is correctly specified, it can be shown that

PAΨ>(y −XβGMM) = PA(I − PPAΨ>X)Ψ>y

= (PA − PPAΨ>X)Ψ>(Xβ0 + u)

= (PA − PPAΨ>X)Ψ>u

= (PA − PPAΨ>X)u.

Therefore,

Q(βGMM,y)a∼ χ2

(Rank(A)− Rank(PAΨ>X)

)= χ2(l − k).

This test statistic is often called Hansen’s overidentification statistic.

4 CHAPTER 1. THE GENERALIZED METHOD OF MOMENTS

1.3.2 Tests of Linear Restrictions

It is even easier to construct a test statistics for linear restrictions based on the GMMcriterion. For model

y = X1β1 +X2β2 + u, E(uu>) = Ω.

Suppose we want to test the restrictions β2 = 0, where β1 is k1×1 and β2 is k2×1. We first

obtain a constraint estimates βFGMM = [β1

... 0]. Then, the constraint criterion function is

Q(βFGMM,y) = (y −X1β1)>W (W>ΩW )−1W>(y −X1β1)

= u>Ψ(PA − PPAΨ>X1)Ψ>u

a∼ χ2(l − k1)

If we subtract Q(βGMM,y) from Q(βFGMM,y), the difference between the constrained andunconstrained criterion function is

Q(βFGMM,y)−Q(βGMM,y) = u>Ψ(PPAΨ>X − PPAΨ>X1)Ψ>u

a∼ χ2(k − k1) = χ2(k2)

Note that the above results hold only for efficient GMM estimation. It is not true fornonoptimal criterion functions like

(y −Xβ)>WΛW>(y −Xβ).

The reason is very simple: we can (almost) never construct a projection matrix of any formsfrom Λ. Therefore, the asymptotic distribution can rarely be a χ2 distribution.

1.4 GMM Estimators for Nonlinear Models

When dealing with nonlinear models, we can adopt an approach that employs the conceptof an elementary zero function. We denote f(θ,y) as a set of such function, where θ isa k-vector of parameters. The n× 1 f(θ,y) correspond to the error terms, in which

f(θ,y) ≡ y −Xθ = u

in the linear regression case. The n× k matrix F (θ), which has typical element

Fti(θ) ≡ ∂ft(θ)

∂θi,

correspond to the regressors X in linear models.We can simply duplicate results from above sections by replacing u with f(θ,y) and X

with F (θ). For example, the covariance matrix of θ is

Var(θ) = n(F>W Σ

−1W>F )−1

and the nonlinear GMM criterion function is

Q(θ,y) = f>(θ,y)ΨPAΨ>f(θ,y).

Of course, the real story is much more complicated than one we just showed. Assumptionsand regularity conditions are different for nonlinear models. Also, in practice, the F (θ)matrix is usually very painful to estimate.

Chapter 2

The Method of Maximum Likelihood

2.1 Basic Concepts

Models that are estimated by maximum likelihood must be fully specified parametricmodels. We denote the dependent variable by the n-vector y. For a given k-vector θ ofparameters, let the joint PDF of y be written as f(y,θ). This joint PDF function f(y,θ)is referred to as the likelihood function of the model for the given data set; and theparameter vector θ that maximizes f(y,θ) is called a maximum likelihood estimate, orMLE, of the parameters.

If observations are assumed to be independent, the joint density of the entire sample yis

f(y,θ) =n∏t=1

f(yt,θ).

Because the above product is often a very large or very small number, it is customary towork instead with the loglikelihood function

l(y,θ) ≡ log f(y,θ) =n∑t=1

lt(yt,θ),

where lt(yt,θ) is equal to log ft(y,θ). For example, if yt is generated by the density

f(yt, θ) = θe−θyt , yt > 0, θ > 0,

we take the logarithm of the density and obtain

lt(yt, θ) = log θ − θyt.

Therefore,

l(y, θ) =n∑t=1

(log θ − θyt) = n log θ − θn∑t=1

yt.

We take FOC with respect to θ,n

θ−

n∑t=1

yt = 0,

which can be solved to yield θ = n/∑n

t=1 yt.

5

6 CHAPTER 2. THE METHOD OF MAXIMUM LIKELIHOOD

2.1.1 Regression Models with Normal Errors

For the classical normal linear models

y = Xβ + u, u ∼ N(0, σ2I),

it is very easy to show that the regressand yt is distributed, conditionally onX, as N(X tβ, σ2).

Thus the PDF of yt is

ft(yt,β, σ) =1

σ√

2πexp

(−(yt −X tβ)2

2σ2

).

Take the log to obtain lt(yt,β, σ) such that

lt(yt,β, σ) = −1

2log 2π − 1

2log σ2 − 1

2σ2(yt −X tβ)2.

Thus the sum of all lt(yt,β, σ) is

l(y,β, σ) = −n2

log 2π − n

2log σ2 − 1

2σ2(y −Xβ)>(y −Xβ). (2.1)

Taking FOC with respect to σ gives us

∂l(y,β, σ)

∂σ= −n

σ+

1

σ3(y −Xβ)>(y −Xβ) = 0,

which yields the result that

σ2(β) =1

n(y −Xβ)>(y −Xβ).

Substituting σ2(β) into (2.1) yields the concentrated loglikelihood functions

lc(y,β) = −n2

log 2π − n

2log

(1

n(y −Xβ)>(y −Xβ)

)− n

2. (2.2)

Maximizing (2.2) is equivalent to minimize (y −Xβ)>(y −Xβ), which yields

β = (X>X)−1X>y.

Then, the ML estimate σ2 = SSR(β)/n, which is biased downward.

2.1.2 Computing ML Estimates

We define the gradient vector g(y,θ), which has typical element

gi(y,θ) ≡ ∂l(y,θ)

∂θi=

n∑t=1

∂lt(yt,θ)

∂θi.

2.2. ASYMPTOTIC PROPERTIES OF ML ESTIMATORS 7

The gradient vector is the vector of first derivatives of the loglikelihood function. Let H(θ)be the Hession matrix of l(θ) with typical element

Hij(θ) =∂2l(θ)

∂θi∂θj.

The Hessian is the matrix of second derivatives of the loglikelihood function.Let θ(j) denote the value of the vector of estimates at step j of the algorithm, and let

g(j) and H(j) denote the gradient and Hessian evaluated at θ(j). We perform a second-order Taylor expansion of l(y,θ) around θ(0) (the initial value of θ) in order to obtain anapproximation l∗(θ) to l(θ):

l∗(θ) = l(θ(0)) + g>(0)(θ − θ(0)) +1

2(θ − θ(0))

>H(0)(θ − θ(0)),

The FOC for a maximum of l∗(θ) with respect to θ can be written as

g(0) +H(0)(θ − θ(0)) = 0,

which implies θ(1) in the next step

θ(1) = θ(0) −H−1(0)g(0).

We keep doing this iteration until the estimator converges. The whole process can be sum-marized by the following equation:

θ(j+1) = θ(j) −H−1(j)g(j),

which is the fundamental equation for Newton’s Method.In practice, however, Newton’s Method does not always work due to the reason that

the Hessian is not negative definite. In such cases, one popular and more appropriate wayto obtain the MLE is to use the so-called quasi-Newton method, in which the fundamentalequation is replaced by

θ(j+1) = θ(j) + α(j)D−1(j)g(j),

where α(j) is a scalar which is determined at each step, and D(j) is a positive definite matrixwhich approximates −H(j).

2.2 Asymptotic Properties of ML Estimators

2.2.1 Consistency of MLE

The loglikelihood function l(θ) is a concave function. Otherwise, we are not able to find itsmaximum. Let L(θ) = exp

(l(θ)

)denote the likelihood function. By Jensen’s Inequality,

we have

E0 log

(L(θ∗)

L(θ0)

)< logE0

(L(θ∗)

L(θ0)

)= log

∫L(θ∗)

L(θ0)L(θ0)dy = 0,

which impliesE0l(θ

∗)− E0l(θ0) < 0.

8 CHAPTER 2. THE METHOD OF MAXIMUM LIKELIHOOD

By LLN, we can assert

plimn→∞

1

nl(θ∗) ≤ plim

n→∞

1

nl(θ0) ∀θ∗ 6= θ0. (2.3)

Since the MLE θ maximizes l(θ), it must be the case that

plimn→∞

1

nl(θ) ≥ plim

n→∞

1

nl(θ0). (2.4)

The only way that (2.3) and (2.4) can both be true is if

plimn→∞

1

nl(θ) = plim

n→∞

1

nl(θ0). (2.5)

Note that the result (2.5) alone does not prove that θ is consistent, simply because theinequality condition (2.3) may also imply that there are many values θ∗ satisfies

plimn→∞

1

nl(θ∗) = plim

n→∞

1

nl(θ0).

Therefore we must also assume that

plimn→∞

1

nl(θ∗) 6= plim

n→∞

1

nl(θ0) ∀θ∗ 6= θ0,

which is also known as a asymptotic identification condition.

2.2.2 Dependent Observations

For a sample of size n, the joint density of n dependent observations follows

f(yn) =n∏t=1

f(yt|yt−1),

where the vector yt is a t-vector with components y1, y2, ..., yt. If the model is to be estimatedby MLE, the density f(yn) depends on a k-vector of parameters θ,

f(yn,θ) =n∏t=1

f(yt|yt−1;θ).

The corresponding loglikelihood function is then

l(y,θ) =n∑t=1

lt(yt,θ).

2.2. ASYMPTOTIC PROPERTIES OF ML ESTIMATORS 9

2.2.3 The Gradient, the Information Matrix and the Hessian

We define the n × k matrix of contributions to the gradient G(y,θ) so as to havetypical element

Gti(yt,θ) ≡ ∂lt(y

t,θ)

∂θi.

We should be able to tell the difference between G(y,θ) and g(y,θ) that has typical element

gi(y,θ) ≡ ∂l(y,θ)

∂θi=

n∑t=1

∂lt(yt,θ)

∂θi.

The covariance matrix of the elements of the tth row Gt(yt,θ) of G(y,θ) is the k × k

matrix I t(θ), which is normally positive definite. The information matrix is the sum ofI t(θ)

I(θ) ≡n∑t=1

I t(θ) =n∑t=1

Eθ(G>t (yt,θ)Gt(y

t,θ)).

We have already defined the Hessian matrix H(y,θ) that has a typical element

Hij(y,θ) =∂2l(y,θ)

∂θi∂θj.

For asymptotic analysis, we are generally more interested in their asymptotic matrices

I(θ) ≡ plimθn→∞

1

nI(θ),

H(θ) ≡ plimθn→∞

1

nH(θ),

whereI(θ) = −H(θ).

2.2.4 Asymptotic Normality of the MLE

For a likelihood estimate g(θ), we perform a Taylor expansion around θ0

g(θ) = g(θ0) +H(θ)(θ − θ0) = 0,

which implies

n1/2(θ − θ0) = −(n−1H(θ)

)−1 (n−1/2g(θ0)

)a= −H−1(θ0)n−1/2g(θ0)a= I−1(θ0)n−1/2g(θ0).

Since

gi(y,θ) =n∑t=1

Gti(yt,θ) and E(Gti(y

t,θ)) = 0,

10 CHAPTER 2. THE METHOD OF MAXIMUM LIKELIHOOD

By CLT, we haven−1/2g(θ0)

a∼ N(0, I(θ0)

).

Finally, we obtain

n1/2(θ − θ0)a∼ N

(0, I−1(θ0)

)a∼ N

(0,−H−1(θ0)

)a∼ N

(0,H−1(θ0)I(θ0)H−1(θ0)

)2.3 ∗The Covariance Matrix of the ML Estimator

Based on the asymptotic result we just derived above, four covariance matrix estimators canbe formulated:

(i) The empirical Hessian estimator:

VarH(θ) = −H−1(θ).

(ii) The information matrix estimator:

VarIM(θ) = I−1(θ).

(iii) The outer-product-of-the-gradient estimator:

VarOPG(θ) =(G>(θ)G(θ)

)−1

,

which is based on the following theoretical result:

I(θ0) = E(G>(θ0)G(θ0)

).

(iv) The sandwich estimator:

VarS(θ) = H−1(θ)G>(θ)G(θ)H−1(θ).

2.3.1 ∗Example: the Classical Normal Linear Model

For a classical normal linear model

y = Xβ + u, u ∼ N(0, σ2I),

where X is n×k. The contribution to the loglikelihood function made by the tth observationis

lt(yt,β, σ) = −1

2log 2π − 1

2log σ2 − 1

2σ2(yt −X tβ)2.

A typical element of any of the first k columns of the matrix G is

Gti(β, σ) =∂lt∂βi

=1

σ2(yt −X tβ)xti, i = 1, ..., k.

2.3. ∗THE COVARIANCE MATRIX OF THE ML ESTIMATOR 11

A typical element of the last column of G is

Gt,k+1(β, σ) =∂lt∂σ

= − 1

σ+

1

σ3(yt −X tβ)2.

This implies the following:

(i) For i, j = 1, ..., k, the ijth element of G>G is

n∑t=1

(1

σ2(yt −X tβ)xti

)2

=n∑t=1

1

σ4(yt −X tβ)2xtixtj =

n∑t=1

1

σ2xtixtj.

(ii) The (k + 1), (k + 1)th element of G>G is

n∑t=1

(− 1

σ+

1

σ3(yt −X tβ)2

)2

=n

σ2−

n∑t=1

2

σ4(yt −X tβ)2 +

n∑t=1

1

σ6(yt −X tβ)4

=n

σ2− 2n

σ2+

3n

σ2=

2n

σ2.

(iii) The (i, k + 1)th element of G>G is

n∑t=1

(− 1

σ+

1

σ3(yt −X tβ)2

)(1

σ2(yt −X tβ)xti

)= −

n∑t=1

1

σ3(yt −X tβ)xti +

n∑t=1

1

σ5(yt −X tβ)3xti

= 0.

Therefore,

I(β, σ) = plimn→∞

1

nG>G = plim

n→∞

[n−1X>X/σ2 0

0> 2/σ2

].

If we want to estimate the covariance matrix using the IM estimator, we have

VarIM(β, σ) =

[σ2(X>X)−1 0

0> σ2/2n

].

Thus the standard error of σ is estimated as σ/√

2n and the C.I. can be constructed following

[σ − cα/2σ/√

2n, σ + cα/2σ/√

2n],

given the fact that u ∼ N(0, σ2I).

12 CHAPTER 2. THE METHOD OF MAXIMUM LIKELIHOOD

2.4 ∗Hypothesis Testing

Assume that the null hypothesis imposes r restrictions on θ and we can write these as

r(θ) = 0.

Let θ and θ denote, respectively, the restricted and unrestricted maximum likelihood esti-mates of θ. There are three classical tests:

(i) The likelihood ratio test

LR = 2(l(θ)− l(θ)

) a∼ χ2(r).

(ii) The Wald test

W = r>(θ)(R(θ)Var(θ)R>(θ)

)−1r(θ)

a∼ χ2(r),

where R(θ) = ∂r(θ)/∂θ is an r × k matrix.

(iii) The Lagrange multiplier test

LM = g>(θ)I−1g(θ),

= λ>R(θ)I

−1R>(θ)λ,

a∼ χ2(r).

where λ is estimated from the constrained maximization problem

maxθ

l(θ)− r>(θ)λ.

2.5 ∗Transformations of the Dependent Variable

We consider the following model

log yt = X tβ + ut, ut ∼ NID(0, σ2)

The loglikelihood for the log yt is

−n2

log 2π − n

2log σ2 − 1

2σ2(log y −Xβ)> (log y −Xβ) .

Using the Jacobian transformation of log yt, we establish the link between f(yt) and f(log yt),where

f(yt) = f(log yt)

∣∣∣∣d log ytdyt

∣∣∣∣ =f(log yt)

yt,

which implieslt(yt) = lt(log yt)− log yt

2.5. ∗TRANSFORMATIONS OF THE DEPENDENT VARIABLE 13

Therefore, the loglikelihood function of yt we are seeking is

−n2

log 2π − n

2log σ2 − 1

2σ2(log y −Xβ)> (log y −Xβ)−

n∑t=1

log yt.

And the loglikelihood function concentrated with respect only to σ is

−n2

log 2π − n

2− n

2log

(1

n

n∑t=1

(log yt −X tβ)2

)−

n∑t=1

log yt.

The associated G>G is a (k + 1)× (k + 1) matrix with three distinct elements similar

to those in example 3.1

(i) For i, j = 1, ..., k, the ijth element of G>G is

n∑t=1

(1

σ2(log yt −X tβ)xti

)2

=n∑t=1

1

σ4(log yt −X tβ)2xtixtj.

(ii) The (k + 1), (k + 1)th element of G>G is

n∑t=1

(− 1

σ+

1

σ3(log yt −X tβ)2

)2

=n

σ2−

n∑t=1

2

σ4(log yt −X tβ)2 +

n∑t=1

1

σ6(log yt −X tβ)4.

(iii) The (i, k + 1)th element of G>G is

n∑t=1

(− 1

σ+

1

σ3(log yt −X tβ)2

)(1

σ2(log yt −X tβ)xti

)= −

n∑t=1

1

σ3(log yt −X tβ)xti +

n∑t=1

1

σ5(log yt −X tβ)3xti.

The Hessian matrix can be obtained in a similar fashion. We need to consider threeelements of −H : the upper left k × k block that corresponds to β, the lower right scalarthat corresponds to σ, and the k × 1 vector that corresponds to β and σ. Of course, allthe parameters in −H must be replaced by a sample estimates (which means it must be“hatted”).

Chapter 3

Discrete and Limited DependentVariables

This chapter explores two “typical” limits of traditional regression models: when the depen-dent variable is discrete, or is continuous but is limited in a range.

3.1 Binary Response Models: Estimation

A binary dependent variable yt can take on only two values 0 and 1. We encounterthese variables a lot in survey data, which the economic agent usually chooses between twoalternatives. A binary response model explains the probability that the agent choosesalternative 1 as a function of some observed explanatory variables.

Let Pt denote the probability that yt = 1 conditional on the information set Ωt

Pt ≡ Pr(yt = 1|Ωt) = E(yt|Ωt) = F (X tβ) ∈ [0, 1],

where X tβ is an index function and the CDF function F (x) is also called a transforma-tion function. There are two popular choices for F (x) are

F (x) = Φ(x) ≡ 1√2π

∫ x−∞ exp

(−1

2X2)dX probit model

F (x) = γ(x) ≡ ex/(1 + ex) logit model

Since the probability that yt = 1 is F (X tβ), the loglikelihood function for observation twhen yt = 1 is logF (X tβ). Therefore the loglikelihood function for y is simply

l(y,β) =n∑t=1

(yt logF (X tβ) + (1− yt) log

(1− F (X tβ)

)),

which can be easily estimated by MLE.The binary response models can deal with models/dataset that is not generally suitable

to traditional regression models. For example, there is unobserved, or latent, variable yt .Suppose that

yt = X tβ + ut, ut ∼ NID(0, 1).

We observe only the sign of yt , which determines the value of the observed binary variableyt according to the relationship

yt = 1 if yt > 0

yt = 0 if yt ≤ 0

14

3.2. BINARY RESPONSE MODELS: INFERENCE 15

However, the binary response models do have drawbacks comparing to traditional regressionmodels. For example, the binary response models are likely to encounter the perfect clas-sifier problem when the sample size is small, and almost all of the yt are equal to 0 or 1.In that case, it is highly possible that we can sort the dependent variables by some linearcombination of the independent variables, say X tβ

•, such thatyt = 0 wheneverX tβ

• < 0,

yt = 1 wheneverX tβ• > 0.

Unfortunately, there is no finite ML estimator exists.

3.2 Binary Response Models: Inference

Follow the standard results for ML estimation, we can show that

Var

(plimn→∞

n1/2(β − β0)

)= plim

n→∞

(1

nX>Υ(β0)X

)−1

,

where

Υt(β) ≡ f 2(X tβ)

F (X tβ)(1− F (X tβ)).

In practice, the covariance matrix estimator is

Var(β) =(X>Υ(β)X

)−1

.

We can also test restrictions on binary response models simply using LR tests. As usual, theLR test statistic is asymptotically distributed as χ2(r), where r is the number of restrictions.

Another somewhat easier way to perform a hypothesis testing is to construct the binaryresponse model regression (BRMR)

V−1/2t (β)

(yt − F (X tβ)

)= V

−1/2t (β)f(X tβ)X tb+ resid,

where V−1/2t (β) ≡ F (X tβ)

(1− F (X tβ)

)and f(x) = ∂F (x)/∂x. The BRMR is a modified

version of the GNR.

We partition X as [X1,X2] and β as [β1

... β2], where β2 is a r−vector. Suppose wewant to test β2 = 0, we can do this by running the BRMR

V−1/2t

(yt − Ft

)= V

−1/2t ftX t1b1 + V

−1/2t ftX t2b2 + resid,

where Vt, Ft and ft are estimated under the restricted coefficients β = [β1

... 0]. Therefore,testing β2 = 0 is equivalent to test b2 = 0. The best test statistic to use is probably theESS from above regression , which is asymptotically distributed as χ2(r).

16 CHAPTER 3. DISCRETE AND LIMITED DEPENDENT VARIABLES

3.3 Models for Multiple Discrete Responses

It is quite common that discrete dependent variables can take on more than just two values.Models that deal with multiple discrete responses are referred to as qualitative responsemodels or discrete choice models. We usually divided these models into two groups: onesdesigned to deal with ordered responses, for example, rate your professor in the courseevaluation form; and ones designed to deal with unordered responses, for example, yourchoice of what-to-do in the reading-week.

The most widely-used model for ordered response data is the ordered probit model.The model for the latent variable is

yt = X tβ + ut, ut ∼ NID(0, 1).

Consider the number of responses is 3 for an example, we assume the relation between theobserved yt and the latent yt is given by

yt = 0 if yt < γ1

yt = 1 if γ1 ≤ yt ≤ γ2

yt = 2 if yt ≥ γ2

where γ1 and γ2 are threshold parameters that must be estimated. The probability thatyt = 0 is

Pr(yt = 0) = Pr(yt < γ1) = Pr(ut < γ1 −X tβ) = Φ(γ1 −X tβ).

similarly, we have

Pr(yt = 1) = Φ(γ2 −X tβ)− Φ(γ1 −X tβ),

Pr(yt = 2) = Φ(X tβ − γ2).

Therefore, the loglikelihood function for the ordered probit model is

l(β, γ1, γ2) =∑yt=0

log(Φ(γ1 −X tβ)

)+∑yt=2

log(Φ(X tβ − γ2)

)+∑yt=1

log(Φ(γ2 −X tβ)− Φ(γ1 −X tβ)

)The most popular model to deal with unordered responses is the multinomial logit

model or multiple logit model. This model is designed to handle J + 1 responses, forJ ≥ 1. The probability that a response j is observed is

Pr(yt = j) =exp(W tjβ

j)∑Jj=0 exp(W tjβ

j)for j = 0, ..., J,

where W tj is the explanatory variables for response j and βj is usually different for eachj = 0, ..., J . The loglikelihood function is simply

n∑t=1

(J∑j=0

I(yt = j)W tjβj − log

( J∑J=0

exp(W tjβj)))

,

3.3. MODELS FOR MULTIPLE DISCRETE RESPONSES 17

where I(·) is the indicator function.In practice, however, we often face a situation when the explanatory variables W tj are

the same for each j. In that case, we can denote the familiar X t as the explanatory variablesfor all responses and the probabilities for j = 1, ..., J is simply

Pr(yt = j) =exp(X tβ

j)

1 +∑J

j=1 exp(X tβj)

and for outcome 0, the probability is

Pr(yt = 0) =1

1 +∑J

j=1 exp(X tβj).

The unknown parameters in each vector βj are typically jointly estimated by maximum aposteriori (MAP) estimation, which is an extension of MLE.

An important property of the general multinomial logit model is that

Pr(yt = l)

Pr(yt = j)=

exp(W tlβl)

exp(W tjβj).

for any two responses l and j. This property is called the independence of irrelevantalternatives, or IIA. Unfortunately, the IIA property is usually incompatible with the data.The logit model that do not possess the IIA property is often denoted as the nested logitmodel. We can derive the probability of a nested logit model based on the multinomiallogit model:

• We partition the set of outcomes 0, 1, ..., J into m disjoint subsets Ai, i = 1, ...,m.

• Suppose that the choice among the members of Ai is governed by a standard multino-mial logit model

Pr(yt = j|yt ∈ Ai) =exp(W tjβ

j/θi)∑l∈Ai

exp(W tlβl/θi)

where θi is a scale parameter for the parameter vectors βj, j ∈ Ai.

• Specifically, we assume that

Pr(yt ∈ Ai) =exp(θihti)∑mk=1 exp(θkhtk)

where

hti = log

(∑j∈Ai

exp(W tjβj/θi)

).

• Finally, by Bayes’ rule, we have

Pr(yt = j) = Pr(yt = j|yt ∈ Ai(j))Pr(yt ∈ Ai(j))

=exp(W tjβ

j/θi)∑l∈Ai

exp(W tlβl/θi)

log

(∑j∈Ai

exp(W tjβj/θi)

).

18 CHAPTER 3. DISCRETE AND LIMITED DEPENDENT VARIABLES

3.4 Models for Count Data

Many economic data are nonnegative integers, for example, the total number of population.Data of this type are called event count data or count data. In probability theoryand statistics, the Poisson distribution (or Poisson law of small numbers) is a discreteprobability distribution that expresses the probability of a given number of events occurringin a fixed interval of time and/or space if these events occur with a known average rate andindependently of the time since the last event. Therefore, the Poisson distribution modelis one of the most popular model to deal with count data:

Pr(Yt = y) =exp

(− λt(β)

)(λ− t(β)

)yy!

, y = 0, 1, 2, ...

where

λt(β) ≡ exp(X tβ).

The loglikelihood function is simply

l(y,β) =n∑t=1

(− exp(X tβ) + ytX tβ − log yt!)

and the associated Hessian matrix is

H(β) = −n∑t=1

exp(X tβ)X>t X t = −X>Υ(β)X,

where Υ(β) is an n × n diagonal matrix with typical diagonal element equal to Υt(β) ≡exp(X tβ).

In practice, however, the Poisson regression model tends to under predict the variance ofthe actual data. Such failure is also called overdispersion. If the variance of yt is indeedequal to exp(X tβ), the quantity

zt(β) ≡(yt − exp(X tβ)

)2 − yt

has expectation 0. To test the overdispersion in the Poisson regression model, we make usethe artificial OPG regression:

ι = Gb+ cz + resid,

where G =(yt − exp(X tβ)

)X t and z =

(yt − exp(X tβ)

)2 − yt. We test c = 0 using t-statfollowing the asymptotic distribution N(0, 1) or we can examine the ESS by χ2(1).

3.5 Models for Censored and Truncated Data

A data sample is said to be truncated if some observations have been systematically ex-cluded from the sample. And a data sample has been censored if some of the informationcontained in them has been suppressed.

3.6. SAMPLE SELECTIVITY 19

Any dependent variable that has been either censored or truncated is said to be a limiteddependent variable. Consider the regression model

yt = β1 + β2xt + ut, ut ∼ NID(0, σ2).

What we actually observe is yt, which differs from yt because it is either truncated orcensored.

Suppose yt is truncated such that all the negative values are omitted. The probabilitythat yt is included in the sample is

Pr(yt ≥ 0) = Pr(X tβ + ut ≥ 0) = 1− Pr(ut/σ < −X tβ/σ) = Φ(X tβ/σ).

The density of yt is thenσ−1φ

((yt −X tβ)/σ

)Φ(X tβ/σ)

.

This implies that the log likelihood function of yt conditional on yt ≥ 0 is

l(y,β, σ) = −n2

log(2π)− n log(σ)− 1

2σ2

n∑t=1

(yt −X tβ)2 −n∑t=1

log Φ(X tβ/σ),

which can be estimated by MLE.The most popular model for censored data is the tobit model. Assume we impose the

following restriction yt = yt if yt > 0,

yt = 0 otherwise

The loglikelihood function for yt is simply

l(y,β, σ) =∑yt>0

log

(1

σφ((yt −X tβ)/σ

))+∑yt=0

log Φ(−X tβ/σ),

which is the sum of the logs of probability for the censored observations and the uncen-sored observations. Fortunately, we can still use MLE, even though the distribution of thedependent variable in a tobit model is a mixture of discrete and continuous random variables.

3.6 Sample Selectivity

Many samples are truncated on the basis of another variable that is correlated with thedependent variable. For example, if a student shows up to a tutorial depends on his owndesire of knowledge and the TA’s ability, when these two parameters do not meet certainstandard, the student will not show up. If we want to evaluate the performance of studentsin the tutorial, our evaluation will be upward biased because our sample only includes thosegeniuses who actually go to the tutorial. The consequences of this type of sample selection areoften said to be due to sample selectivity. Therefore, to have a fair evaluation of students’performance, we need to include the parameters of student’s desire and TA’s ability in ourmodel.

20 CHAPTER 3. DISCRETE AND LIMITED DEPENDENT VARIABLES

Suppose that yt and zt are two latent variables, generated by the bivariate process(ytzt

)=

(X tβ

W tγ

)+

(utvt

),

(utvt

)∼ NID

(0,

[σ2 ρσρσ 1

])The data is determined following

yt = yt if zt > 0

yt unobserved otherwiseand

zt = 1 if zt > 0

zt = 0 otherwise

The loglikelihood function is∑zt=0

log Pr(zt = 0) +∑zt=1

log(Pr(zt = 1)f(yt |zt = 1)

),

which can be estimated by ML.Another simpler technique to compute consistent estimates is to use the Heckman’s Two-

Step Method, which is based on the following model

yt = X tβ + ρσvt + et,

where the error term ut is divided into two parts: one perfectly correlated with vt an oneindependent of vt. The idea is to replace vt with its conditional mean

E(vt|zt = 1,W t) = E(vt|vt > −W tγ,W t) =φ(W tγ)

Φ(W tγ).

Therefore, we have

yt = X tβ + ρσφ(W tγ)

Φ(W tγ)+ et,

The first step is to obtain consistent estimates γ from

zt = W tγ + vt,

using ordinary probit model. Then replace the γ with the estimate

yt = X tβ + ρσφ(W tγ)

Φ(W tγ)+ et

and obtain the consistent estimates of β using OLS. Of course, the above regression can alsobe used to test if selectivity is a problem by examining if ρ = 0 or not following ordinary tstatistic.

3.7 Duration Models

Economists are sometimes interested in how much time elapses before some event occurs.From now on, we use i to index observations and denote ti to measure duration. Suppose

3.7. DURATION MODELS 21

that how long a state endures is measured by T , a nonnegative, continuous random variablewith PDF f(t) and CDF F (t). The survivor function is defined as

S(t) ≡ 1− F (t).

The probability that a state ends in the period from time t to time t+ ∆t is

Pr(t < T ≤ t+ ∆t) = F (t+ ∆t)− F (t).

Therefore, the conditional probability that the state survived time t is

Pr(t < T ≤ t+ ∆t|T ≥ t) =F (t+ ∆t)− F (t)

S(t).

We divide the RHS term by ∆t.

F (t+ ∆t)− F (t)

∆t

/S(t).

As ∆t→ 0, the numerator is simply the PDF f(t). We denote such function as the hazardfunction

h(t) ≡ f(t)

S(t)=

f(t)

1− F (t).

The function F (t) may take various forms due to different assumptions. For example,F (t) can follow the Weibull distribution

F (t, θ, α) = 1− exp(− (θt)α

).

The associated PDF, survivor function and hazard function are simply

f(t) = αθαtα−1 exp(− (θt)α

)S(t) = exp

(− (θt)α

)h(t) = αθαtα−1.

The loglikelihood function for t, the vector of observations with typical element ti, is just

l(t,β, α) =n∑i=1

log f(ti|X i,β, α)

=n∑i=1

log h(ti|X i,β, α) +n∑i=1

logS(ti|X i,β, α)

= n logα +n∑i=1

X iβ + (α− 1)n∑i=1

log ti −n∑i=1

tαi exp(X iβ).

We then obtain the estimates by MLE in the usual way.

Chapter 4

Multivariate Models

In this chapter, we discuss models which jointly determine the values of two or moredependent variables using two or more equations.

4.1 Seemingly Unrelated Linear Regressions (SUR)

We suppose that there are g dependent variables indexed by i. Each dependent variableis associated with n observations. The ith equation of a multivariate linear regression modelcan be written as

yi = X iβi + ui, E(uiu>i ) = σiiIn.

To allow correlation between error term in the same period, we further impose the followingassumption

E(utiutj) = σij ∀t, E(utiusj) = 0 ∀t 6= s,

where σij is the ijth element of the g × g PSD matrix Σ. We construct an n × g matrixU that contains the error terms ui from each equation, of which a typical row is the 1 × gvector U t. Then,

E(U>t U t) =1

nE(U>U ) = Σ.

4.1.1 MM Estimation

We can write the entire SUR system as

y• = X•β• + u•,

where

y• =

y1

y2...yg

X• =

X1 O · · · OO X2 · · · O...

......

...O O · · · Xg

β• =

β1

β2...βg

u• =

u1

u2...ug

and the OLS estimator for the entire system is simply

βOLS

• = (X>•X•)−1X>• y•,

22

4.1. SEEMINGLY UNRELATED LINEAR REGRESSIONS (SUR) 23

if we assume homoskedasticity.The covariance matrix of the vector u• is

E(u•u>• ) =

E(u1u>1 ) · · · E(u1u

>g )

......

...E(ugu

>1 ) · · · E(ugu

>g )

=

σ11In · · · σ1gIn...

......

σg1In · · · σggIn

≡ Σ•.

The matrix Σ• is a symmetric gn× gn matrix, which can be written more compactly as

Σ• ≡ Σ⊗ In.

For the Kronecker product Σ⊗In, we let each element in Σ times the matrix In and arrangeeach block into Σ•.

If there is heteroskedasticity, the feasible GLS estimator is

βFGLS

• =(X>• (Σ

−1⊗ In)X•

)−1

X>• (Σ−1⊗ In)y•,

where

Σ ≡ 1

nU>U with U = [u1, ..., ug].

The covariance matrix is then

Var(βFGLS

• ) =(X>• (Σ

−1⊗ In)X•

)−1

.

4.1.2 Maximum Likelihood Estimation

Let z be a randomm-vector with known density fz(z), and let x be another randomm−vecorsuch that z = h(x), then

fx(x) = fz(h(x)

) ∣∣∣∣det∂h(x)

∂x

∣∣∣∣ .We rewrite the SUR system

y• = X•β• + u•, u• ∼ N(0,Σ⊗ In).

Since u• = y• −X•β•, given the pdf of u• is

f(u•) = (2π)−gn/2|Σ⊗ In|−1/2 exp

(−1

2(y• −X•β•)>(Σ−1 ⊗ In)(y• −X•β•)

),

we have

f(y•) = (2π)−gn/2|Σ⊗In|−1/2 exp

(−1

2(y• −X•β•)>(Σ−1 ⊗ In)(y• −X•β•)

) ∣∣∣∣det∂h(y)

∂y

∣∣∣∣ .Note that h(y) = y• −X•β•, therefore, the determinant above equals to 1, if there are nolagged dependent variables in the matrix X•.

24 CHAPTER 4. MULTIVARIATE MODELS

Taking the log of the above equation, we have the loglikelihood function l(Σ,β•) equalsto

−gn2

log 2π − n

2log |Σ| − 1

2(y• −X•β•)>(Σ−1 ⊗ In)(y• −X•β•).

You probably won’t be able to obtain a result, if you only using the above loglikelihoodfunction for the maximum likelihood estimation. In fact, many modern software are takingadvantage of the following equation, with or without telling the user:

let σij be the ijth element in Σ, we have

σij =1

n(yi −X iβi)

>(yj −Xjβj).

If we define the n× g matrix U(β•) to have ith column yi −X iβi, then

Σ =1

nU>(β•)U(β•)

We can replace the Σ in the loglikelihood function with 1nU>(β•)U(β•), then taking the

FOC w.r.t β•. Once we obtained the βML

• , we can use it to compute ΣML. Consequently,

we can estimate the covariance matrix of βML

Var(βML

• ) =(X>• (Σ

−1

ML ⊗ In)X•

)−1

4.2 Linear Simultaneous Equations Models

The linear simultaneous equations models in this section are indeed multivariate IVmodels. The ith equation of a linear simultaneous system can be written as

yi = X iβi + ui = Ziβ1i + Y iβ2i + ui,

where the regressors X i can be partitioned into two parts: the predetermined Zi (n × k1i)and the endogenous Y i (n × k2i). Although we can still use a single equation to representthe whole system:

y• = X•β• + u•, E(u•u>• ) = Σ⊗ In.

But we need to keep in mind that X• is correlated with u•.

4.2.1 GMM Estimation

We can adopt the GMM estimation to solve the above model. The theoretical momentconditions that lead to the efficient GMM estimator is

E(X>Ω−1(y −Xβ)

)= 0. (4.1)

Let W denote an n× l matrix of exogenous and predetermined variables. This instrumentsmatrix contains all the Zi plus instruments for each Y i. Similar to IV estimation, we

4.2. LINEAR SIMULTANEOUS EQUATIONS MODELS 25

premultiply PW on the regressors X i and define X i = PWX i. We construct a block-diagonal matrix X•, with diagonal blocks the X i, to replace the X in (4.1). This allows usto write the estimating equations for efficient GMM estimation

X>• (Σ−1 ⊗ In)(y• −X•β•) = 0. (4.2)

The above equation looks like equations for GMM estimation. In fact, it is more like a groupequations for IV estimation. It becomes straightforward if we rewrite the above equation inthe form σ11X>1 PW · · · σ1gX>1 PW

.... . .

...σg1X>g PW · · · σggX>g PW

y1 −X1β1

...yg −Xgβg

= 0,

For a particular element i, we have

g∑j=1

σijX>i PW (yj −Xjβj) = 0.

org∑j=1

X>i(σij · PW

)(yj −Xjβj) = 0.

Therefore, we can rewrite the estimating equation (4.2) as

X>• (Σ−1 ⊗ PW )(y• −X•β•) = 0.

The above equation is not feasible unless Σ is known. We need to first obtain a consistentestimate of Σ, then compute the estimate of β•. The procedure can be simplified to thefollowing three steps:

(1) We first estimate the individual equations of the system by 2SLS.

(2) Then, we use the 2SLS residuals to compute the matrix Σ2SLS

Σ2SLS =1

nU>U ,

where U is an n× g matrix with ith column u2SLSi .

(3) Finally, we compute the β•

β3SLS

• =(X>• (Σ

−1

2SLS ⊗ PW )X•

)−1

X>• (Σ−1

2SLS ⊗ PW )y•.

This estimator is called the three-stage least squares, or 3SLS, estimator.

We can estimate the covariance matrix of the classical 3SLS estimator by

Var(β3SLS

• ) =(X>• (Σ

−1

2SLS ⊗ PW )X•

)−1

.

26 CHAPTER 4. MULTIVARIATE MODELS

4.2.2 Structural and Reduced Forms

When the system of models are written in the form of

yi = X iβi + ui = Ziβ1i + Y iβ2i + ui, (4.3)

it is normally the case that each equation has a direct economic interpretation. It is for thisreason that these are called structural equations. The full system of equations constituteswhat is called the structural form of the model.

Instead of stacking the equations vertically like we did before, we can also stack theequations horizontally. Define n×g matrix Y as [y1,y2, ...,yg]. Similarly, we define an n×gmatrix U of which each column corresponds to a vector ui of error term. In this notation,the structural form can be represented as

Y Γ = WB +U . (4.4)

According to the textbook, (4.3) is embodied in (4.4) through the following equation.

[yi Y i]

[1−β2i

]= Ziβ1i + ui.

The above equation is a bit tricky. Let me explain it in details:

(a) First, the endogenous variable Y i in equation i is the dependent variables in someother equation, say j. Therefore, Y i = yj. Therefore, the ith column of the LHS of(4.4) is

Y Γi = [y1 ...yi ...yj ...yg]

0...1...−β2i

...0

= [yi Y i]

[1−β2i

]

(b) For the ith column of the RHS of (4.4). It is more intuitive if we rewrite W as

W = [Z1 ...Zi ...Zg C],

where the instrumentsW contains all the exogenous variablesZi and some instrumentsvariables C. Therefore, the ith column of the RHS of (4.4) is

WBi + ui = [Z1 ...Zi ...Zg C]

0...β1i...0

0

+ ui = Ziβ1i + ui.

4.2. LINEAR SIMULTANEOUS EQUATIONS MODELS 27

We can postmultiply both sides of equation (4.4) by Γ−1 to obtain

Y = WBΓ−1 + V ,

where V ≡ UΓ−1. This representation is called the restricted reduced form or RRF.This is in contrast to the unrestricted reduced form or URF

Y = WΠ + V ,

where we simply consider Π as the RHS coefficients without imposing the restrictions: Π =BΓ−1.

4.2.3 Maximum Likelihood Estimation

For simplicity, we assume the error terms are normally distributed

y• = X•β• + u•, u• ∼ N(0,Σ⊗ In).

The maximum likelihood estimator of a linear simultaneous system is called the full infor-mation maximum likelihood, or FIML, estimator. The loglikelihood function is

−gn2

log 2π − n

2log |Σ|+ n log | det Γ| − 1

2(y• −X•β•)>(Σ−1 ⊗ In)(y• −X•β•)

As usual, we need the following equation to set up the link between Σ and β•, or, equivalently,B and Γ

Σ =1

n(Y Γ−WB)>(Y Γ−WB).

There are many numerical methods for obtaining FIML estimates. One of them is tomake use of the artificial regression

(Φ> ⊗ In)(y• −X•β•) = (Φ> ⊗ In)X•(B,Γ)b+ resid,

where ΦΦ> = Σ−1. We start from initial consistent estimates, then use the above equationto update the estimates of B and Γ. Another approach is to concentrate the loglikelihoodfunction with respect to Σ. The concentrated loglikelihood function can be written as

−gn2

(log 2π + 1) + n log | det Γ| − n

2log

∣∣∣∣ 1n(Y Γ−XB)>(Y Γ−XB)

∣∣∣∣ .We can obtain BML and ΓML directly by minimizing the above equation w.r.t B and Γ.

It is also important to test any overidentifying restrictions, in which it is natural to usea LR test. The restricted value of the loglikelihood function is the maximized value of

LRr = −gn2

(log 2π + 1) + n log | det Γ| − n

2log

∣∣∣∣ 1n(Y Γ−XB)>(Y Γ−XB)

∣∣∣∣ .The unrestricted value is

LRu = −gn2

(log 2π + 1)− n

2log

∣∣∣∣ 1n(Y −W Π)>(Y −W Π)

∣∣∣∣ ,

28 CHAPTER 4. MULTIVARIATE MODELS

where Π denotes the matrix of OLS estimates of the parameters of the URF. The test issimply

2(LRu − LRr)a∼ χ2(gl − k).

When a system of equations consists of just one structural equation, we can write it as

y = Xβ = Zβ1 + Y β2 + u,

where β1 is k1×1, β2 is k2×1 and k = k1 +k2. Since the above equation includes endogenousvariables Y , we can compose a complete simultaneous system by using the URF equations:

Y = WΠ + V = ZΠ1 +W 1Π2 + V .

Maximum likelihood estimation of the above equation is called limited-informationmaximum likelihood or LIML. We can treat LIML as a FIML applied to a system inwhich only one equation is overidentified. In fact, LIML is a single-equation estimationmethod. Although its name includes the term maximum likelihood, the most popular wayto calculate the coefficients is more like a least squares estimation. Such calculation wasinvestigated by statisticians as early as 1949 by Anderson and Rubin.

Anderson and Rubin proposed calculating the coefficients β2 by minimizing the ratio

κ =(y − Y β2)>MZ(y − Y β2)

(y − Y β2)>MW (y − Y β2).

This is also why the LIML estimator is also referred as the least variance ratio estimator.

Once we obtain the estimate of κ, we compute the LIML coefficients βLIML

following

βLIML

=(X>(I − κMW )X

)−1X>(I − κMW )y.

A suitable estimate of the covariance matrix of the LIML estimator is

Var(βLIML)

= σ2(X>(I − κMW )X

)−1,

where

σ2 =1

n

(y −Xβ

LIML)>(y −Xβ

LIML).

The one significant virtue of LIML (also FIML) is its invariance to the normalization ofthe equation. This is a very useful/important property, since simultaneous equations systemscan be parameterize in many different ways. Consider the following demand-supply model

qt = γdpt +Xdtβd + udt ,

qt = γspt +Xstβs + ust .

We can certainly reparameterize the whole system as

pt = γ′dqt +Xdtβ′d + resid,

pt = γ′sqt +Xstβ′s + resid.

whereγ′d = 1/γd, β′d = −βd/γd, γ′s = 1/γs, β′s = −βs/γs. (4.5)

The invariance property implies that the FIML or LIML estimates bear precisely the samerelationship as the true parameters in (4.5). This nice property is not shared by 2SLS, 3SLSor other GMM estimators.

Chapter 5

Methods for Stationary Time-SeriesData

5.1 AR Process

The pth order autoregressive, or AR(p), process can be written as

yt = γ + ρ1yt−1 + ρ2yt−2 + ...+ ρpyt−p + εt, εt ∼ IID(0, σ2ε ),

where the process for εt is often referred to as white noise. If we define ut ≡ yt − E(yt),with E(yt) = γ/(1−

∑pi=1 ρi), the above equation can be then rewritten as

ut =

p∑i=1

ρiut−i + εi,

or with the lag operator notation

ut = ρ(L)ut + ε, or as(1− ρ(L)

)ut = εt.

For ut an AR(1) process, the autocovariance matrix is

Ω(ρ) =σ2ε

1− ρ2

1 ρ · · · ρn−1

ρ 1 · · · ρn−2

......

...ρn−1 ρn−2 · · · 1

.Applications that involve autoregressions of order greater than two are relatively unusual.

Nonetheless, higher-order models can be handled in the same fashion. . Let vi denote thecovariance of ut and ut−i, for i = 0, 2, ..., p. The elements of the autocovariance matrix,autocovariances, will obey the Yule-Walker equations

v0 = ρ1v1 + ρ2v2 + ...+ ρpvp + σ2ε ,

v1 = ρ1v0 + ρ2v1 + ...+ ρpvp−1,...

vp = ρ1vp−1 + ρ2vp−2 + ...+ ρpv0.

29

30 CHAPTER 5. METHODS FOR STATIONARY TIME-SERIES DATA

We solve the above system of linear equations for vi, i = 0, ..., p. We compute the rest of vkwith k > p following

vk = ρ1vk−1 + ρ2vk−2 + ...+ ρpvk−p.

If there are n observations, the autocovariance matrix is simply

Ω =

v0 v1 v2 · · · vn−1

v1 v0 v1 · · · vn−2

v2 v1 v0 · · · vn−3...

......

...vn−1 vn−2 vn−3 · · · v0

.

If we normalize the above matrix by multiplying a scalar chosen to make the diagonalelements equal to unity, the result is the autocorrelation matrix, and we call its elementsthe autocorrelations.

There are three popular methods to estimate the AR(p) model:

(a) Drop the first p observations and estimate the nonlinear regression model

yt = X tβ +

p∑i=1

ρi(yt−i −X t−iβ) + εt

by NLS.

(b) Estimate by feasible GLS, possibly iterated.

(c) Estimate by the GNR that corresponds to the nonlinear regression with an extraartificial observation corresponding to the first observation.

5.2 MA Process and ARMA Process

A qth order moving-average, or MA(q), process can be written as

yt = µ+ α0εt + α1εt−1 + ...+ αqεt−q,

where E(yt) = µ. By defining ut ≡ yt − µ, we can write

ut =

q∑j=0

αjεt−j =(1 + α(L)

)εt.

The autocovariances of an MA process are much easier to calculate. Since the εt arewhite noise, and hence uncorrelated, the variance of the ut is

Var(ut) = E(u2t ) = σ2

ε

(1 +

q∑j=1

α2j

).

5.2. MA PROCESS AND ARMA PROCESS 31

Similarly, the jth order autocovariance is, for j > 0,

E(utut−j) =

σ2ε (αj +

∑q−ji=1 αj+iαi) for j < q,

σ2εαj for j = q, and

0 for j > q.

The autoregressive moving-average process, or ARMA process is the combinationof MA process with AR process. In general, we can write an ARMA(p, q) process withnonzero mean as (

1− ρ(L))yt = γ +

(1− α(L)

)εt.

For any known stationary ARMA process, the autocorrelation between ut and ut−j can beeasily calculated. For an ARMA process of possibly unknown order, we define the auto-correlation function or ACF expresses the autocorrelation as a function of the lag j forj = 1, 2, ... If we have a sample yt, t = 1, ..., n, the jth order autocorrelation ρ(j) can beestimated using the formula

ρ(j) =Cov(yt, yt−j)

Var(yt).

The autocorrelation function ACF(j) gives the gross correlation between yt and yt−j.But in this setting, we observe, for example, that a correlation between yt and yt−2 couldarise primarily because both variables are correlated with yt−1. Therefore, we might askwhat is the correlation between yt and yt−2 net of the intervening effect of yt−1. We use thepartial autocorrelation function, or PACF to characterize such relationship. The partialautocorrelation coefficient of order j is defined as the plim of the least squares estimator ofthe coefficient ρ

(j)j in the linear regression

yt = γ(j) + ρ(j)1 yt−1 + ...+ ρ

(j)j yt−j + εt

for j = 1, ..., J . We can calculate the empirical PACF up to order J by running the aboveregression and retaining only the estimate ρ

(j)j for each j.

In practice, it is necessary to allow yt to depend on exogenous explanatory variables. Suchmodels are sometimes referred to as ARMAX models. An ARMAX(p, q) model takes theform

yt = X tβ + ut, ut ∼ ARMA (p, q), E(ut) = 0.

Let Ω denote the autocovariance matrix of the vector y. Note that Ω is composed by the vithat includes parameters ρi and αj of the ARMA(p, q) process. The log of the joint densityof the observed sample is

−n2

log 2π − 1

2log |Ω| − 1

2(y −Xβ)>Ω−1(y −Xβ).

We can certainly build up our likelihood function based on the above equation and solve itwith MLE. However, this seemingly trivial process can be computationally difficult due tothe fact that Ω is a n× n matrix and each of its element is a nonlinear combination of twounknown variables.

32 CHAPTER 5. METHODS FOR STATIONARY TIME-SERIES DATA

5.3 Single-Equation Dynamic Models

When a dependent variable depends on current and lagged values of xt, but not on laggedvalues of itself, we have what is called a distributed lag model:

yt = δ +

q∑j=0

βjxt−j + ut, ut ∼ IID(0, σ2).

The OLS estimates of the βj may be quite imprecise. However, this is not a problem if weare merely interested in the long-run impact of changes in the independent variable

γ ≡q∑j=0

βj =

q∑j=0

∂yt∂xt−j

.

One popular alternative to distributed lag models is the partial adjustment model.Suppose that the desired level of an economic variable yt is yt . This desired level is assumedto depend on a vector of exogenous variables X t according to

yt = X tβ + et, et ∼ IID(0, σ2

e).

The term yt is assumed to adjust toward yt according to the equation

yt − yt−1 = (1− δ)(yt − yt−1) + vt, vt ∼ IID(0, σ2v),

where δ is an adjustment parameter that is assumed to be positive and strictly less than 1.Then, we have

yt = yt−1 − (1− δ)yt−1 + (1− δ)X tβ + (1− δ)et + vt

= X tβ + δyt−1 + ut,

where β ≡ (1 − δ)β and ut ≡ (1 − δ)et + vt. Under the assumption that |δ| < 1, we findthat

yt =∞∑j=0

δjX t−jβ +∞∑j=0

δjut−j,

which implies a particular form of distributed lag.An autoregressive distributed lag or ADL model can be written as

yt = β0 +

p∑i=1

βiyt−i +

q∑j=0

γjxt−j + ut, ut ∼ IID(0, σ2).

This is sometimes called an ADL(p, q) model. A widely encountered case is the ADL(1,1)model

yt = β0 + β1yt−1 + γ0xt + γ1xt−1 + ut.

It is straightforward to check that the ADL(1, 1) model can be rewritten as

∆yt = β0 + (β1 − 1)(yt−1 − λxt−1) + γ0∆xt + ut,

where λ = γ0+γ11−β1 . The above equation is called an error-correction model. It expresses the

ADL(1, 1) model in terms of an error-correction mechanism. Due to the reason that ECMis a nonlinear model, we need to adopt NLS to solve this model. ECM is not very popularto deal with stationary time series. However, it is very popular for estimating models withunit root or cointegration.

5.4. SEASONALITY 33

5.4 Seasonality

Many economic time series display a regular pattern of seasonal variation over the course ofevery year. There are two different ways to deal with seasonality in economic data:

(a) Try to model it explicitly. For example, seasonal ARMA process: AR(4), ARMA(12,4)...; seasonal ADL models: ADL(4, q)..., etc.

(b) Use seasonally adjusted data, which have been filtered to remove the seasonalvariation. Of course, there is severe consequence for doing that. So, don’t do it.

5.5 ARCH, GARCH

The concept of autoregressive conditional heteroskedasticity, or ARCH, was intro-duced by Engle (1982). The basic idea of ARCH models is that the variance of the errorterm at time t depends on the realized values of the squared error terms in previous timeperiods. Let ut denotes the error term and Ωt−1 denotes an information set that consists ofdata observed through period t− 1, an ARCH(q) process can be written as

ut = σtεt; σ2t ≡ E(u2

t |Ωt−1) = α0 +

q∑i=1

αiu2t−i,

where αi > 0 for i = 0, 1, ..., q and εt is white noise with variance 1. The above function isclearly autoregressive. Since this function depends on t, the model is also heteroskedastic.Also, the variance of ut is a function of ut−1 through σt, which means the variance of utis conditional on the past of the process. That is where the term conditional came from.The error term ut and ut−1 are clearly dependent. They are, however, uncorrelated. Thus,ARCH process involves only heteroskedasticity, but not serial correlation. The originalARCH process has not proven to be very satisfactory in applied work.

In fact, the ARCH model became famous because of its descendent: the generalizedARCH model, which was proposed by Bollerslev (1986). We may write a GARCH(p, q)process a

ut = σtεt; σ2t ≡ E(u2

t |Ωt−1) = α0 +

q∑i=1

αiu2t−i +

p∑i=1

δiσ2t−j,

The conditional variance here can be written more compactly as

σ2t = α0 + α(L)u2

t + δ(L)σ2t .

The simplest and by far the most popular GARCH model is the GARCH(1, 1) process, forwhich the conditional variance can be written as

σ2t = α0 + α1u

2t−1 + δ1σ

2t−1.

Unlike the original ARCH model, the GARCH(1, 1) process generally seems to work quitewell in practice. More precisely, GARCH(1, 1) cannot be rejected against any more generalGARCH(p, q) process in many cases.

34 CHAPTER 5. METHODS FOR STATIONARY TIME-SERIES DATA

There are two possible methods to estimate the ARCH and GARCH models:

(1) Feasible GLS: since ARCH and GARCH processes induce heteroskedasticity, it mightseem natural to use feasible GLS. However, this approach is very very rarely used,because it is not asymptotically efficient. In case of a GARCH(1, 1), σ2

t depends onu2t−1 which in turn depends on the estimates of the regression function. Because of

this, estimating the following function together yields more efficient estimates

yt = X tβ + ut

σ2t = α0 + α1u

2t−1 + δ1σ

2t−1.

(2) MLE: the most popular way to estimate GARCH models is to assume that the errorterms are normally distributed and use ML method. To do that, we first write a linearregression model with GARCH errors defined in terms of a normal innovation processas

yt −X tβ

σt(β,θ)= εt, εt ∼ N(0, 1).

The density of yt conditional on Ωt−1 is then

1

σt(β,θ)φ

(yt −X tβ

σt(β,θ)

),

where φ(·) denotes the standard normal density. Therefore, the contribution to theloglikelihood function made by the tth observation is

lt(β,θ) = −1

2log 2π − 1

2log(σ2t (β,θ)

)− 1

2

(yt −X tβ)2

σ2t (β,θ)

.

This function is not easy to calculate due to the skedastic function σ2t (β,θ). It is

defined implicitly by the recursion

σ2t = α0 + α1u

2t−1 + δ1σ

2t−1

and there is no good starting values for σ2t−1. An ARCH(q) model does not have the

lagged σ2t term, therefore, does not have such problem. We can simply use the first

q observations to compute the squared residuals so as to form the skedastic functionσ2t (β,θ). For the starting values of lagged σ2

t , there are some popular ad hoc proce-dures:

(a) Set all unknown pre-sample values of u2t and σ2

t to zero.

(b) Replace them by an estimate of their common unconditional expectation: anappropriate function of the θ parameters, or use the SSR/n from OLS estimation.

(c) Treat the unknown starting values as extra parameters.

Anyway, different procedures can produce very different results. For STATA or anyblack-box programs, the users should know what the packages are actually doing.

Chapter 6

Unit Roots and Cointegration

Before we get started, there are several definitions we need to be familiar with:(1) I(0): as t → ∞, the first and second moments tend to fixed stationary values, and

the covariances of the elements yt and ys tend to stationary values that depend onlyon |t− s|. Such a series is said to be integrated to order zero, or I(0).This is a necessary, but not sufficient condition for a stationary process. Therefore, allstationary processes are I(0), but not all I(0) processes are stationary.

(2) I(1) or unit root: A nonstationary time series is said to be integrated to order one,or I(1), if the series of its first differences

∆yt ≡ yt − yt−1 = (1− L)yt

is I(0). We also say the I(1) series has a unit root.

(3) I(d): a series is said to be integrated to order d, or I(d), if it must be differenced dtimes to obtain an I(0) series. Note, if yt = εt ∼ I(0), then ∆yt = εt − εt−1 ∼ I(−1).This is what we called an over differencing problem.

(4) Cointegration: if two or more series are individually integrated, but some linearcombination of them has a lower order of integration, then the series are said to becointegrated.

(5) Brownian motion: also referred as standardized Wiener process. This process,denoted W (r) for 0 ≤ r ≤ 1, can be interpreted as the limit of the standardized randomwalk wt as the length of each interval becomes infinitesimally small. It is defined as

W (r) ≡ plimn→∞

n−1/2w[rn] = plimn→∞

n−1/2

[rn]∑t=1

εt, εt ∼ IID(0, 1)

where [rn] means the integer part of the quantity rn.

35

36 CHAPTER 6. UNIT ROOTS AND COINTEGRATION

6.1 Unit Root

A random walk process is defined as

yt = yt−1 + et, y0 = 0, et ∼ IID(0, σ2).

This process is obviously nonstationary. An obvious generalization is to add a constant term,which gives us the random walk with drift model

yt = γ1 + yt−1 + et, y0 = 0, et ∼ IID(0, σ2).

The first differences of the yt is

∆yt = γ1 + e1,

which is I(0). Therefore, yt is I(1), or say, has a unit root.There are many methods to test for unit roots. The simplest and most widely-used tests

are variants of the Dickey-Fuller tests, or DF tests. Consider the model

yt = βyt−1 + et, et ∼ IID(0, σ2).

When β = 1, this model has a unit root. If we subtract yt−1 from both sides, we obtain

∆yt = (β − 1)yt−1 + et,

The obvious way to test the unit root hypothesis is to use test the t statistic for the hypothesisβ − 1 = 0 against the alternative that this quantity is negative. This statistic is usuallyreferred as a τ statistic. Another possible test statistic is n times the OLS estimate ofβ−1. This statistic is called a z statistic. If we wish to test the unit root in a model wherethe random walk has a drift, the appropriate test regression is

∆yt = γ0 + γ1t+ (β − 1)yt−1 + et,

and if we wish to test the unit root with the random walk has both a drift and a trend, theappropriate test regression is

∆yt = γ0 + γ1t+ γ2t2 + (β − 1)yt−1 + et.

The asymptotic distributions of the Dickey-Fuller test statistics are referred to as non-standard distributions or as Dickey-Fuller distributions. We adopt one of simplestrandom walk model as an example. For model

wt = wt−1 + σεt, w0 = 0, εt ∼ IID(0, 1),

the associated z-statistic is

znc = n

∑nt=2wt−1εt∑n−1t=1 w

2t

=n−1

∑nt=2wt−1εt

n−2∑n−1

t=1 w2t

. (6.1)

6.1. UNIT ROOT 37

(a) for the numerator of (6.1): since

n∑t=1

w2t =

n−1∑t=0

(wt + (wt+1 − wt)

)2=

n−1∑t=0

w2t + 2

n−1∑t=0

wtεt+1 +n−1∑t=0

ε2t+1,

impliesn−1∑t=0

wtεt+1 =1

2

(w2n −

n−1∑t=0

ε2t+1

).

Therefore,

plimn→∞

1

n

n∑t=2

wt−1εt = plimn→∞

1

n

n−1∑t=0

wtεt+1 =1

2

(W 2(1)− 1

).

(b) for the denominator of (6.1): since

n−2

n−1∑t=1

w2ta=

1

n

n−1∑t=1

W 2

(t

n

).

And ∫ 1

0

f(x)dx ≡ limn→∞

1

n

n∑t=1

f

(t

n

).

We have

plimn→∞

n−2

n−1∑t=1

w2t =

∫ 1

0

W 2(r)dr.

These results imply that

plimn→∞

znc =12

(W 2(1)− 1

)∫ 1

0W 2(r)dr

.

In practice, we need to simulate this distribution, since there is no simple, analytical expres-sion for it.

The models we have seen so far do not include any economic variables beyond yt−1. Ifthe error terms are serially correlated, the DF tests are no longer asymptotically valid. Themost popular approach is to use the so called augmented Dickey-Fuller, or ADF tests.For a unit root test regression model

∆yt = X tγ + (β − 1)yt−1 + ut,

we assume that the error term ut follows AR(1) process

ut = ρ1ut−1 + et, et ∼ white noise.

The the test regression becomes

∆yt = X tγ − ρ1X t−1γ

+ (ρ1 + β − 1)yt−1 − βρ1yt−2 + et,

= X tγ + (β − 1)(1− ρ1)yt−1 + δ1∆yt−1 + et.

= X tγ + πyt−1 + δ1∆yt−1 + et.

Here, we let π = (β− 1)(1−ρ1). The τ statistic is simply the t test of π = 0. The z statisticis a little bit tricky. Since ρ1 ∈ (−1, 1), to test for a unit root, we only need to consider ifβ − 1 = π/(1− ρ1) = 0. Therefore, a valid ADF z statistic is nπ/(1− ρ1).

38 CHAPTER 6. UNIT ROOTS AND COINTEGRATION

6.2 Cointegration

We begin by considering the simplest case, namely, a VAR(1) model with just two variables.The model can be written as

yt1 = φ11yt−1,1 + φ12yt−1,2 + ut1,yt2 = φ21yt−1,1 + φ22yt−1,2 + ut2,

[ut1ut2

]∼ IID(0,Ω).

Let zt and ut be 2-vectors and let Φ be the 2× 2 matrix. With proper definitions of zt, utand Φ, the above equations can be represented by

zt = Φzt−1 + ut, ut ∼ IID(0,Ω).

Both yt1 and yt2 are I(1) if the coefficients on the lagged dependent variable are equal tounity, or equivalently, at least one of the eigenvalues of the matrix Φ is equal to 1. The seriesyt1 and yt2 are cointegrated, if there exists a 2-vector η with elements η1 and η2 such that

η>zt = η1yt1 − η2yt2

is I(0). The vector η is called a cointegrating vector. We call the whole system of modelsas CI(1, 1), which is short for cointegration with each process following I(1).

In practice, we usually expect the relationship between yt1 and yt2 to change over timeby adding a constant term and trend terms,

η>zt = X tγ + vt, (6.2)

where X t denotes a deterministic row vector that may or may not have any elements.As we can see, η is not unique. Usually, we normalize η by setting the first element

to 1. In that case, equation (6.2) can be written as

η>zt = X tγ + vt,[1 −η2

] [ yt1yt2

]= X tγ + vt,

yt1 = X tγ + η2yt2 + vt.

The OLS estimator ηL2 is known as the levels estimator. It might be weird to use OLS here,since yt2 is correlated with vt and vt is probably serially correlated to itself. In fact, we canshow that not only doing OLS is valid, but the OLS estimator ηL2 is also super-consistent.Normally, the OLS estimator is root-n consistent, in the sense of βOLS − β0 goes to zero liken−1/2 as n → ∞. The estimator ηL2 is n-consistent or super-consistent, as ηL2 − η2 goes tozero like n−1 at a much faster rate.

We assume that the first eigenvalue of Φ, λ1 = 1 and the second eigenvalue |λ2| < 1.Then, the whole system can also be presented by an Error Correction Model

∆yt1 = X tγ + η2∆yt2 + (λ2 − 1)(yt−1,1 − η2yt−1,2) + resid, (6.3)

where λ2 is the second eigenvalue of Φ (Of course, if we have strong belief that the secondeigenvalue λ2 = 1, it is no longer appropriate to use the ECM representation).

6.2. COINTEGRATION 39

The equation (6.3) is too restrictive such that the parameter η2 appears twice. For thepurpose of estimation and testing, we usually use the following unrestricted equation

∆yt1 = X tγ + α∆yt2 + δ1yt−1,1 + δ2yt−1,2 + resid.

The ECM estimator is ηE2 ≡ −δ2/δ1, using the OLS estimates of δ1 and δ2.

6.2.1 VAR Representation

What we have discussed so far for estimating cointegrating vectors are all single-equationmethods. When there are more than one cointegrating relation among a set of more thantwo variables, we need to adopt the VAR representation. Consider the VAR

Y t = X tB +

p+1∑i=1

Y t−iΦi +U t,

where Y t is a 1 × g vector of observations on the levels of a set of variables, each of whichis assumed to be I(1), X t is a row vector of deterministics, B is a matrix of coefficients ofthose deterministics, U t is a 1 × g vector of error terms, and the Φi are g × g matrices ofcoefficients. We can reparameterized the above equation as

∆Y t = X tB + Y t−1Π +

p∑i=1

∆Y t−iΓi +U i, (6.4)

where

Γp = −Φp+1, Γi = Γi+1 −Φi+1, Π =

p+1∑i=1

Φi − Ig.

Suppose that the matrix Π has rank r, with 0 ≤ r < g. In this case, we can always write

Π = ηα>,

where η and α are both g × r matrices. Therefore, (6.4) can be rewritten as

∆Y t = X tB + Y t−1ηα> +

p∑i=1

∆Y t−iΓi +U i,

We can easily tell that not all the elements of η and α can be identified. One convenientnormalization is to set the first element in η as 1. To estimate the above nonlinear regression,we can use the GNR or the MLE by assuming normality to the error terms.

6.2.2 Testing for Cointegration

There are three popular tests we can use to test for cointegration. To make our lives easier,we consider a CI(1, 1) model

40 CHAPTER 6. UNIT ROOTS AND COINTEGRATION

(a) Engle-Granger Tests: the idea is to estimate the cointegrating regression

yt1 = X tγ + yt2η2 + vt, (6.5)

where Y t ≡ [yt1 yt2] and η = [1... − η2]. And test the resulting estimates of vt by a

Dickey-Fuller test. If there is no cointegration vt ∼ I(1) and vt follows I(0) if there iscointegration.

The augmented EG test is performed in almost exactly the same way as an augmentedDF test, by running the regression

∆vt = X tγ + βvt−1 +

p∑j=1

δj∆vt−j + ei

where p is chosen to remove any evidence of serial correlation in the residuals. Theseries of vt are estimated residuals from EG test in (6.5)

(b) ECM Tests: a second way to test for cointegration involves the estimation of an errorcorrection model. For the same CI(1, 1) model, we have the restricted ECM:

∆yt1 = X tγ + η2∆yt2 + (λ2 − 1)(yt−1,1 − η2yt−1,2) + resid,

If there is no cointegration, the second eigenvalue λ2 must be 1 (there will be cointe-gration if |λ2| < 1). Which means, for the unrestricted version

∆yt1 = X tγ + α∆yt2 + δ1yt−1,1 + δ2yt−1,2 + resid.

the coefficient δ1 and δ2 should be zeros. In practice, however, we do not need to testboth terms equal to 0 at the same time. We can simply test if δ1 = 0 by t statistics.The distribution of such statistic is certainly non-normal due to unit root. If Y t is a1 × g vector, Ericsson and MacKinnon (2002) call it the κd(g) distribution. See thepaper for further details.

(c) VAR Test: as its name indicates, this test is based on a vector autoregression repre-sentation. The idea is to estimate a VAR model

∆Y t = X tB + Y t−1Π +

p∑i=1

∆Y t−iΓi +U i,

subject to the constraintΠ = ηα>

for various values of the rank r of the impact matrix Π, using ML estimation. Due tothis design, we can actually test the null for which there are any number of cointegratingrelations from 0 to g− 1 against alternatives with a greater number of relations, up toa maximum of g. The most convenient test statistics are LR statistics

In practice, we can perform the VAR LR test according to following steps:

6.2. COINTEGRATION 41

(1) Regress both ∆Y t and Y t−1 on the deterministic variables X t and the lags∆Y t−1, ∆Y t−2 through ∆Y t−p, which yields two sets of residuals,

V t1 = ∆Y t −∆Y t,

V t2 = Y t−1 − Y t−1,

where V t1 and V t2 are both 1×g vectors. ∆Y t and Y t−1 denote the fitted valuesfrom the regressions.

(2) Compute the g × g sample covariance matrices

Σjl =1

n

n∑t=1

V>tjV tl, j = 1, 2, l = 1, 2

(3) Then, we find the solutions λi and zi, for i = 1, ..., g, to the equations

(λiΣ22 − Σ21Σ−1

11 Σ12)zi = 0.

where λi is a scalar and zi is a g × 1 vector. This is similar to solving for theeigenvalues and eigenvectors. We first compute the Ψ22, where

Ψ22Ψ>22 = Σ

−1

22 .

We find the eigenvalues for the PSD matri A, in which

A ≡ Ψ>22Σ21Σ

−1

11 Σ12Ψ22.

The eigenvalues for A are the same λi we are looking for. We sort the eigenvaluesλi from the largest to smallest.

(4) It can be shown that the maximized loglikelihood function for the restricted modelis

−gn2

(log 2π + 1)− n

2

r∑i=1

log(1− λi).

(5) Therefore, to test the null hypothesis that r = r1 against the alternative thatr = r2, for r1 < r2 ≤ g, we compute the LR statistic

−nr2∑

i=r1+1

log(1− λi),

which follows a non-standard distribution as usual.

Chapter 7

Testing the Specification ofEconometric Models

Estimating a misspecified regression model generally yields biased and inconsistent pa-rameter estimates. In this chapter, we discuss a number of procedures that are designed fortesting the specification of econometric models.

7.1 Specification Tests Based on Artificial Regressions

Let M denotes a model, µ denotes a DGP which belongs to that model, and plimµ means aprobability limit taken under the DGP µ.

7.1.1 RESET Test

One of the oldest specification tests for linear regression models, but one that is still widelyused, is the regression specification error test, or RESET test. The idea is to test thenull hypothesis that

yt = X tβ + ut, ut ∼ IID(0, σ2),

where the explanatory variables X t are predetermined with respect tot the error terms ut,against the alternative that E(yt|X t) is a nonlinear function of the elements of X t. Thesimplest version of RESET involves regressing yt on X t to obtain fitted values X tβ andthen running the regression

yt = X tβ + γ(X tβ)2 + ut.

The test statistic is the ordinary t statistic for γ = 0.

7.1.2 Conditional Moment Tests

If a model M is correctly specified, many random quantities that are functions of the de-pendent variables should have expectation of zero. Often, these expectations are takenconditional on some information set. Let mt(yt,θ) be a moment function, θ is the parame-ter vector and yt is the dependent variables. Then, m(yt, θ) is referred to as an empirical

42

7.2. NONNESTED HYPOTHESIS TESTS 43

moment. The idea of the conditional moment test is then to test the quantity

m(y, θ) ≡ 1

n

n∑t=1

mt(yt, θ)

is significantly different from zero.

7.1.3 Information Matrix Tests

For a model estimated by ML with parameter θ, the asymptotic information matrix isequal to minus the asymptotic Hessian. Therefore, we should expect that, in general, theinformation matrix equality does not hold when the model we are estimating is misspecified.The null hypothesis for the IM test is that

plimn→∞

1

n

n∑t=1

(∂2lt(θ)

∂θi∂θj+∂lt(θ)

∂θi

∂lt(θ)

∂θj

)= 0,

for i = 1, ..., k and j = 1, ..., i. We can calculate IM test statistics by means of the OPGregression.

ι = G(θ) +M(θ)c+ resid.

The matrix M(θ) is constructed as an n× 1/2k(k + 1) matrix with typical element

∂2lt(θ)

∂θi∂θj+∂lt(θ)

∂θi

∂lt(θ)

∂θj.

The test statistic is then the ESS from the above regression. If the matrix [G(θ),M(θ)]has full rank, this test statistic is asymptotically distributed as χ2

(1/2k(k + 1)

). If it does

not have full rank, one or more columns of M (θ) have to be dropped, and the number ofdof reduced accordingly.

7.2 Nonnested Hypothesis Tests

Two models are said to be nonnested, if neither model can be written as a special case ofthe other without imposing restrictions on both models. We can write the two models as

H1 : y = Xβ + u1,

H2 : y = Zγ + u2.

The simplest and most widely-used nonnested hypothesis tests start from the artificialcomprehensive model

y = (1− α)Xβ + αZγ + u, (7.1)

where α is a scalar parameter. Ideally, we can test if α = 1 or 0, to determine whether H1 orH2 represent the true DGP. However, this is not possible since at least one of the parameterscan not be identified.

44 CHAPTER 7. TESTING THE SPECIFICATION OF ECONOMETRIC MODELS

7.2.1 J Tests

One popular way to make (7.1) identified is to replace the unknown vector γ by its estimates.This idea was first suggested by Davidson and MacKinnon (1981). Thus, equation (7.1)becomes

y = Xβ + αZγ + u,

= Xβ + αPZy + u.

The ordinary t statistic for α = 0 is called the J statistics, which is asymptotically dis-tributed as N(0, 1) under the null hypothesis H1.

J test tends to overreject, often quite severe when at least one of the following conditionsholds:

• The sample size is small;

• The model under test does not fit very well;

• The number of regressors in H2 that do not appear in H1 is large.

There are many ways we can to do to improves the its finite sample performance. We canuse a fully parametric or semiparametric bootstrap method. Or we can adopt the JA testby Fisher and McAleer (1981). The test regression is

y = Xβ + αZγ + u,

= Xβ + αPZPXy + u,

and the JA statistic is the t statistic for α = 0.

7.2.2 P Tests

The J test can be extended to nonlinear regression models. Suppose the two models are

H1 : y = x(β) + u1,

H2 : y = z(γ) + u2.

The J statistic is the t statistic for α = 0 in the nonlinear regression

y = (1− α)x(β) + αz + resid,

where z ≡ z(γ), γ being the vector of NLS estimates of the regression model H2. We can,as usual, run the GNR which corresponds to the above equation, evaluated at α = 0 andβ = β. This GNR is

y − x = Xb+ a(z − x) + resid,

where x = x(β), and X = X(β) is the matrix of derivatives of x(β) with respect to β,evaluated at the NLS estimates β. The ordinary t statistic for a = 0 is called the P statistic.

7.3. MODEL SELECTION BASED ON INFORMATION CRITERIA 45

7.2.3 Cox Tests

If the two nonnested models are estimated by ML, and that their loglikelihood functions are

l1(θ1) =n∑t=1

l1t(θ1) and l2(θ2) =n∑t=1

l2t(θ2),

for models H1 and H2, respectively, we can use another approach called Cox tests. Cox’sidea was to extend the idea of a likelihood ratio test. Cox’s test statistic is

T1 ≡ 2n−1/2(l2(θ2)− l1(θ1)

)− 2n−1/2Eθ1

(l2(θ2)− l1(θ1)

),

which follows a normal distribution asymptotically.

7.3 Model Selection Based on Information Criteria

The true model is rarely feasible to economists due to the complexity of economic processes.Thus, economists formulate approximation models to capture the effects or factors supportedby the empirical data. However, using different approximation models usually end up withdifferent empirical results, which give rise to the so-called model uncertainty. One of the twopopular ways to deal with model uncertainty is model selection.

Model selection is a procedure of selecting the best model from a set of approxima-tion models. Such a procedure generally involves calculating a criterion function for allapproximation models and ranking them accordingly. One of the most widely used criterionfunctions is the Akaike information criterion (AIC) proposed by Akaike (1973). The simplestversion of AIC is

AICi = li(βi)− ki.We then choose the model that maximizes AICi.

A popular alternative to AIC is the Bayesian information criterion (BIC) of Schwarz(1978), which takes the following form

BICi = li(βi)−1

2ki log n.

BIC has a stronger penalty for complexity. There are other methods based on various criteriaincluding the Mallows Criterion (Mallows’ Cp) by Mallows (1973), the prediction criterionby Amemiya (1980), the focused information criterion (FIC) by Claeskens and Hjort (2003),etc.

7.4 Nonparametric Estimation

(For this part, you can also read “Nonparametric Econometrics: Theory and Practice” byQi Li and Jeffrey Racine. )

Nonparametric regression is a form of regression analysis in which the predictor doesnot take a predetermined form but is constructed according to information derived fromthe data. Nonparametric regression requires larger sample sizes than regression based onparametric models because the data must supply the model structure as well as the modelestimates.

46 CHAPTER 7. TESTING THE SPECIFICATION OF ECONOMETRIC MODELS

7.4.1 Kernel Estimation

One traditional way of estimating a PDF is to form a histogram. Given a sample xt,t = 1, ..., n, of independent realizations of a random variable X, for any arbitrary argumentx, the empirical distribution function (EDF) is

F (x) =1

n

n∑t=1

I(xt ≤ x).

The indicator function I is clearly discontinuous, which makes the above EDF discontinuous.In practice and theory, we always prefer to have a smooth EDF for various reasons, forexample, differentiation. For these reasons, we replace I with a continuous CDF, K(z),with mean 0. This function is called a cumulative kernel. It is convenient to be able tocontrol the degree of smoothness of the estimate. Accordingly, we introduce the bandwidthparameter h as a scaling parameter for the actual smoothing distribution. This gives thekernel CDF estimator

Fh(x) =1

n

n∑t=1

K

(x− xth

). (7.2)

There are many arbitrary kernels we can choose and a popular one is the standard normaldistribution or say the Gaussian kernel. If we differentiate equation (7.2) with respect tox, we obtain the kernel density estimator

fh(x) =1

nh

n∑t=1

k

(x− xth

).

This estimator is very sensitive to the value of the bandwidth h. Two popular choices for hare h = 0.1059sn−1/5 and h = 0.785(q.75− q.75)n−1/5, where s is the standard deviation of xtand q.75 − q.75 is the difference between the estimated .75 and .25 quantiles of the data.

7.4.2 Nonparametric Regression

The nonparametric regression estimates E(yt|xt) directly, without making any assumptionsabout functional form. We suppose that two random variables Y and X are jointly dis-tributed, and we wish to estimate the conditional expectation µ(x) ≡ E(Y |x) as a functionof x, using a sample of paired observations (yt, xt) for t = 1, ..., n. For given x, we define

g(x) ≡∫ ∞−∞

yf(y, x)dy = f(x)

∫ ∞−∞

yf(y|x)dy = f(x)E(Y |x),

where f(x) is the marginal density of X, and f(y|x) is the density of Y conditional on X = x.Then,

µ(x) =g(x)

f(x)=

∫ ∞−∞

yf(x, y)

f(x)dy.

We use kernel density estimation for the joint distribution f(x, y) and f(x) with a kernel k,

f(x, y) =1

nhhy

n∑i=1

k

(x− xih

)k

(y − yihy

), f(x) =

1

nh

n∑i=1

k

(x− xih

)

7.4. NONPARAMETRIC ESTIMATION 47

Therefore,∫ ∞−∞

yf(x, y)dy =1

nhhy

n∑i=1

k

(x− xih

)∫ ∞−∞

yk

(y − yihy

)dy

=1

nhhy

n∑i=1

k

(x− xih

)∫ ∞−∞

(yi + hyv)k(v)(hy)dv =1

nh

n∑i=1

k

(x− xih

)yi

Finally, we obtain the so-called Nadaraya-Watson estimator,

µ(x) =

∑nt=1 ytkt∑nt=1 kt

, kt ≡ k

(x− xth

).

Appendix A

Sample Assignment and Solution

1. Show that the difference between the matrix

(J ′W ′X)−1J ′W ′ΩWJ(X ′WJ)−1

and the matrix

(X ′W (W ′ΩW )−1W ′X)−1

is a positive semidefinite matrix.

Solution: Our goal is to show that

(JTW TX)−1JTW TΩWJ(XTWJ)−1 −(XTW (W TΩW )−1W TX

)−1is psd. (A.1)

Note

(JTW TX)−1JTW TΩWJ(XTWJ)−1 =[(XTWJ)(JTW TΩWJ)−1(JTW TX)

]−1

Hence, proving (1) is psd. is equivalent to prove

XTW (W TΩW )−1W TX − (XTWJ)(JTW TΩWJ)−1(JTW TX) is psd. (A.2)

by ETM exercise 3.8

Define PΩ

12W

= Ω12W (W TΩW )−1W T (Ω

12 )T , we have

XTW (W TΩW )−1W TX = XTΩ−12P

Ω12W

(Ω−12 )TX

Similarly, define PΩ

12WJ

= Ω12WJ(JTW TΩWJ)−1JTW T (Ω

12 )T , we have

(XTWJ)(JTW TΩWJ)−1(JTW TX) = XTΩ−12P

Ω12WJ

(Ω−12 )TX

Hence,

(2) = (XTΩ−12 ) ·(P

Ω12W− P

Ω12WJ

)· (XTΩ−

12 )T (A.3)

48

49

Note SΩ

12W⊇ S

Ω12WJ

, hence, PΩ

12W· P

Ω12WJ

= PΩ

12WJ

, which implies(P

Ω12W− P

Ω12WJ

)T (P

Ω12W− P

Ω12WJ

)= P

Ω12W− P

Ω12WJ

Finally,

(3) = (XTΩ−12 ) ·(P

Ω12W− P

Ω12WJ

)T (P

Ω12W− P

Ω12WJ

)· (XTΩ−

12 )T

= ‖(P

Ω12W− P

Ω12WJ

)(XTΩ−

12

)T||2 ≥ 0

Hence, (2) = (3) is psd, which implies (1) is also psd.

50 APPENDIX A. SAMPLE ASSIGNMENT AND SOLUTION

2. Suppose ut follows an AR(1) process

ut = ρut−1 + εt, εt ∼ iid(0, 1), , |ρ| < 1

(a) Compute Eu2t , Eutut−1, and Eutut−2. Confirm that they do not depend on t

(b) Define γ(h) = Eutut+h. Find∑∞

h=−∞ γ(h)

Solution:

(a) we have

ut = ρut−1 + εt

= εt + ρεt−1 + ρ2εt−2 + · · ·

Since εt ∼ iid(0, 1), and the εts are uncorrelated. Hence

Eu2t = E(ε2t + ρ2ε2t−1 + ρ4ε2t−2 + · · · ) + 0

= 1 + ρ2 + ρ4 + · · ·

=1

1− ρ2

Similarly,

Eutut−1 = E(ρε2t−1 + ρ3ε2t−2 + · · · ) + 0

= ρ+ ρ3 + ρ5 + · · ·=

ρ

1− ρ2

And,

Eutut−2 = E(ρ2ε2t−2 + ρ4ε2t−3 + · · · ) + 0

= ρ2 + ρ4 + ρ6 + · · ·

=ρ2

1− ρ2

Hence, we concluded that they don’t depend on time t.

(b) Note that Eutut+1 = Eut+1ut, which means the value of γ(h) is symmetric aroundh = 0, where we know γ(0) = Eu2

t = 11−ρ2 . Hence

∞∑h=−∞

γ(h) = 2∞∑h=1

γ(h) + γ(0)

We only need to consider the case which h > 0, hence

ut+h = εt+h + ρεt+h−1 + · · ·+ ρhεt + · · ·

51

Then we have

Eut+hut = E(ρhε2t + ρh+2ε2t−1 + · · · )= ρh + ρh+2 + · · ·

=ρh

1− ρ2= γ(h)

Hence,

∞∑h=1

γ(h) =1

1− ρ2(ρ+ ρ2 + ρ3 + · · · )

=1

1− ρ2· ρ

1− ρ=

ρ

(1− ρ2)(1− ρ)

Finally,

∞∑h=−∞

γ(h) =2ρ

(1− ρ2)(1− ρ)+

1

1− ρ2

=1 + ρ

(1− ρ2)(1− ρ)

=1

(1− ρ)2

3. Consider the linear regression model

yt = βxt + ut, t = 1, · · · , n,

where ut is generated by ut = ρut−1 + εt, εt ∼ iid(0, 1), |ρ| < 1. β is a scalar, andxt = 1 for all t. Let βGMM be the efficient GMM estimator using the instrumentW = (1, 1, · · · , 1)′. Show that βGMM is equal to β, the OLS estimator of β.

Solution: We know the efficient GMM estimator is

βGMM =(XTW (W TΩ0W )−1W TX

)−1XTW (W TΩ0W )−1W Ty

Since, we know W = [1, 1, · · · , 1]T , we have

W TΩ0W =n∑i=1

σ2i (A.4)

Define the value of (4) as a, since a is a scalar. Hence,

W (W TΩ0W )−1W T = Wa−1W T = a−1WW T = na−1 (A.5)

Define the value of (5) as b, we then transform the βGMM as

βGMM = (XT bX)−1XT by

= b−1b(XTX)−1XTy = βOLS

52 APPENDIX A. SAMPLE ASSIGNMENT AND SOLUTION

4. Show that

E

(∂2 log f(yt, θ)

∂θ∂θ′

)= −E

(∂ log f(yt, θ)

∂θ· ∂ log f(yt, θ)

∂θ′

)Solution:

∂ log f(yt, θ)

∂θ=

1

f(yt, θ)· ∂f(yt, θ)

∂θ

Then, we have

∂2 log f(yt, θ)

∂θ∂θ′=

1

f(yt, θ)· ∂

2f(yt, θ)

∂θ∂θ′− 1

f 2(yt, θ)· ∂f(yt, θ)

∂θ· ∂f(yt, θ)

∂θ′

∂2 log f(yt, θ)

∂θ∂θ′+

1

f 2(yt, θ)· ∂f(yt, θ)

∂θ· ∂f(yt, θ)

∂θ′=

1

f(yt, θ)· ∂

2f(yt, θ)

∂θ∂θ′

∂2 log f(yt, θ)

∂θ∂θ′+∂ log f(yt, θ)

∂θ· ∂ log f(yt, θ)

∂θ′=

1

f(yt, θ)· ∂

2f(yt, θ)

∂θ∂θ′

Hence,

E

(∂2 log f(yt, θ)

∂θ∂θ′+∂ log f(yt, θ)

∂θ· ∂ log f(yt, θ)

∂θ′

)= E

(1

f(yt, θ)· ∂

2f(yt, θ)

∂θ∂θ′

)=

∫ (1

f(yt, θ)· ∂

2f(yt, θ)

∂θ∂θ′

)f(yt, θ)dyt

=∂2

∂θ∂θ′

∫ftdyt

=∂21

∂θ∂θ′= 0

Finally,

E

(∂2 log f(yt, θ)

∂θ∂θ′+∂ log f(yt, θ)

∂θ· ∂ log f(yt, θ)

∂θ′

)= 0

E

(∂2 log f(yt, θ)

∂θ∂θ′

)= −E

(∂ log f(yt, θ)

∂θ· ∂ log f(yt, θ)

∂θ′

)5. Show that

Var(W (X)) ≥(ddθEW (X)

)2

E[∂∂θ

log f(x|θ)]2

Solution: First,

E

(∂

∂θlog f(x|θ)

)=

∫1

f(x|θ)∂f(x|θ)∂θ

f(x|θ)dx

=d

∫f(x|θ)dx = 0

53

Hence, we yields

E

(∂

∂θlog f(x|θ)

)2

= Var

(∂

∂θlog f(x|θ)

)By Cauchy-Schwartz Inequality,

Var(W (X)) · Var

(∂

∂θlog f(x|θ)

)≥ Cov

(W (X),

∂θlog f(x|θ)

)2

Note,

Cov

(W (X),

∂θlog f(x|θ)

)=

∫(W (X)− EW (X)) · ∂

∂θlog f(x|θ) · f(x|θ)dx

=

∫(W (X)− EW (X)) · ∂

∂θf(x|θ)dx

=

∫W (X)

∂θf(x|θ)dx− EW (X)

∫∂

∂θf(x|θ)dx

=d

∫W (X)f(x|θ)dx− 0

=d

dθEW (X)

Hence,

Var(W (X)) ≥(ddθEW (X)

)2

Var(∂∂θ

log f(x|θ)) =

(ddθEW (X)

)2

E[∂∂θ

log f(x|θ)]2

6. Consider the linear regression models

y = Xβ + resid

y = Xβ + Zγ + resid

y = Xβ +MXZγ + resid

where y and u are n × 1, X is n × k1, Z is n × k2, β is k1 × 1, and γ is k2 × 1. Letβ(0) be the OLS estimate of β from the first regression. Let β(1) and β(2) be the OLSestimate of β from the second and the third regression, respectively. Let γ(1) and γ(2)

be the OLS estimate of γ from the second and third regression, respectively.

(a) show β(0)=β(2) and γ(1)=γ(2)

(b) show PX,MXZy = PXy + PMXZy

(c) show S([X,Z]) = S([X,MXZ]) and conclude PX,Zy = PXy + PMXZy for anyy ∈ Rn.

Solution:

54 APPENDIX A. SAMPLE ASSIGNMENT AND SOLUTION

(a) Define MXZ = H, then

y = Xβ(2) +Hγ(2) + resid

Using FWL, we obtain

β(2) = (X ′MHX)−1X ′MHy

= (X ′X −X ′PHX)−1(X ′y −X ′PHy)

Note, from MXZ = H, we derive, X ′H = X ′MXZ = 0, hence, X ′PH = 0. Then

β(2) = (X ′X)−1X ′y = β(0)

To prove γ(1) = γ(2), using FWL, we obtain,

γ(1) = (X ′MZX)−1X ′MZy

Premultiply MX on equation (3), we obtain

MXy = MXMXZγ(2) + resid

Hence,γ(2) = (X ′MZX)−1X ′MZy = γ(1)

(b) To prove PX,MXZy = PXy+ PMXZy is equivalent to prove PX,MXZ = PX + PMXZ .That means, we need to show

i. S([X,MXZ]) is invariant under the action of PX + PMXZ

ii. S⊥([X,MXZ]) is annilated by PX + PMXZ

Prove (i), we know any vector in S([X,MXZ]) can be expressed as [X,MXZ]γfor some k1 + k2 vector γ. Then

(PX + PMXZ)[X,MXZ]γ = [(PX + PMXZ)X, (PX + PMXZ)MXZ]γ

= [X + 0, 0 +MXZ]γ

= [X,MXZ]γ

Note, PMXZX = 0 since X and MXZ are orthogonal, and PXMXZ = 0.Hence, S([X,MXZ[) is invariant under the action of PX + PMXZ .

Prove (ii), let w be any element in S⊥([X,MXZ]), then

[X,MXZ]′w = 0

implies, X ′w = 0 and Z ′MXw = 0. To finish our proof, we premultiply PX+PMXZ

on w, we obtain

(PX + PMXZ)w = X(X ′X)−1X ′w +MXZ(Z ′MXZ)−1Z ′MXw

= X(X ′X)−1 ∗ 0 +MXZ(Z ′MXZ)−1 ∗ 0 = 0

Hence, S⊥([X,MXZ]) is annilated by PX + PMXZ .

Finally, we obtain PX,MXZ = PX + PMXZ , then

PX,MXZy = PXy + PMXZy

55

(c) To prove PX,Zy = PXy + PMXZy, we can use result from part (b). We only needto show S([X,Z]) is equivalent to S([X,MXZ]).

Let a be any element from S([X,Z]), then a can always be expressed as

a = [X,Z]γ = Xγ1 + Zγ2

for some k1 × 1 vector γ1, and some k2 × 1 vector γ2. Then, we rewrite Zγ2

as MXZγ2 + PXZγ2. Note, S(PXZ) ⊆ S(X). Hence, the element PXZγ2 fromS(PXZ) can always be expressed as Xγ for some k1 × 1 γ. Then, we obtain

a = Xγ1 +MXZγ2 + PXZγ2 = Xγ1 +MXZγ2 +Xγ

= X(γ1 + γ) +MXZγ2

= [X MXZ]

[γ1 + γγ2

]Hence, a ∈ S([X,MXZ]), implies S([X,Z]) ⊆ S([X,MXZ]).

Let b be any element from S([X,MXZ]), then b can always be expressed as

b = [X,MXZ]δ = Xδ1 +MXZδ2

for some k1× 1 vector δ1, and some k2× 1 vector δ2. Then, we rewrite MXZδ2 asZδ2 − PXZδ2. Note, S(PXZ) ⊆ S(X). Hence, the element PXZδ2 from S(PXZ)can always be expressed as Xδ for some k1 × 1 δ. Then, we obtain

b = Xδ1 + Zδ2 − PXZδ2 = Xδ1 + Zδ2 −Xδ= X(δ1 − δ) + Zδ2

= [X Z]

[δ1 − δδ2

]Hence, b ∈ S([X,Z]), implies S([X,MXZ]) ⊆ S([X,Z]). Together with S([X,Z]) ⊆S([X,MXZ]), we yield

S([X,Z]) = S([X,MXZ])

Recall in part (b), we already proved that

i. S([X,MXZ]) is invariant under the action of PX + PMXZ

ii. S⊥([X,MXZ]) is annilated by PX + PMXZ

By S([X,Z]) = S([X,MXZ]), we have

i. S([X,Z]) is invariant under the action of PX + PMXZ

ii. S⊥([X,Z]) is annilated by PX + PMXZ

This impliesPX,Z = PX + PMXZ

Hence, PX,Zy = PXy + PMXZy

56 APPENDIX A. SAMPLE ASSIGNMENT AND SOLUTION

7. The file tbrate.data contains data for 1950:1 to 1996:4 for three series: rt, the interestrate on 90-day treasury bills, πt, the rate of inflation, and yt, the logarithm of real GDP.For the period 1950:4 to 1996:4, run the regression

∆rt = β1 + β2πt−1 + β3∆yt−1 + β4∆r−1 + β5∆rt−2 + ut,

where ∆ is the first-difference operator, defined so that ∆rt = rt − rt−1. Plot theresiduals and fitted values against time. Then regress the residuals on the fitted valuesand on a constant. What do you learn from this second regression? Now regress thefitted values on the residuals and on a constant. What do you learn from this thirdregression?


Recommended