Slide Mancini Eco No Metrics EPFL2011

transcript

Master in Financial Engineering

Econometrics

Prof. Loriano Mancini

Swiss Finance Institute at EPFL

First semester

Slides version: September 2011

Information about the course

• Material: slides, exercises, data, etc., at

http://sfi.epfl.ch/mfe → “Courses online” → “Mancini E”

Username: StudMFE

Password: Fall2011

Also, register at http://is-academia.epfl.ch

• Book: “Econometric Analysis”, sixth edition, W. Greene, Prentice Hall, 2008

• Assignments: each week, due next Monday, groups max 3 persons

• Exams: written, closed book, closed notes, one A4 page hand-written notes

• Grade: 30% homework, 30% midterm, 40% final exams

• Assistants: Benjamin Junge (E-mail: benjamin.junge@epfl.ch)

Emmanuel Leclercq (E-mail: emmanuel.leclercq@epfl.ch)

Information about the course

• Exercise sessions start on October 3rd

i.e. no exercise session on September 26th

• Prerequisites: W. Greene “Econometric Analysis” book

Appendix A on matrix algebra

Appendix B on probability and distributions

Appendix D on Laws of Large Numbers and Central Limit Theorems

Agenda of the course

• Linear regression model

• Generalized regression model

• Panel data model

• Instrumental variables

• Generalized method of moments

• Maximum likelihood estimation

• Hypothesis testing

Chapter 2: Econometric model

Econometrics: intersection of Economics and Statistics

Econometric model = association between yi and xi

E.g.: stock return yi (IBM) and market return xi (S&P 500 index)

Econometric model provides “approximate” description of the association

The relation will be stochastic and not deterministic

Econometric model provides probabilistic description of the association

Model: yi = f(xi) + εi

Linear regression model

yi = f(xi1, . . . , xiK) + ε = xi1β1 + · · ·+ xiKβK + εi

yi: dependent or explained variable

xi: regressors or covariates or explanatory variables

εi: error term or random disturbance

Each observation in a sample yi, xi1, . . . , xiK , i = 1, . . . , n, comes from

yi = xi1β1 + · · ·+ xiKβK︸︷︷︸

“deterministic”

+ εi︸︷︷︸

random

Goal: estimate β1, . . . , βK

Assumptions of the linear regression model

Assumptions on the data generating process

1. Linearity: linear relationship between yi and xi1, . . . , xiK

2. Full rank: X = [x1, . . . , xK] is an n×K matrix with rank K

3. Exogeneity of the independent variables: E[εi|xj1, . . . , xjK] = 0, ∀i, j

4. Homoscedasticity and nonautocorrelation: Var[εi|X] = σ2, i = 1, . . . , n, and

Cov[εi, εj|X] = 0, ∀i 6= j

5. Data generation: X can include constants and random variables

6. Normal distribution: ε|X ∼ N(0, σ2I)

Assumptions 4 and 6 simplify life but are too restrictive and will be relaxed

Linearity of the regression model

The same linear model holds for all n observations yi, xi1, . . . , xiKni=1

y = x1β1 + · · ·+ xKβK + ε = Xβ + ε

Notation: y is an n× 1 vector; X = [x1, . . . , xK] is an n×K matrix;

ε is an n× 1 vector; β is a K × 1 vector

In the design matrix X: columns are variables, rows are observations

E.g. for the i-th observation: yi = x′i β + εi

Remark: we are modeling E[y|X] = Xβ, as E[ε|X] = 0 by assumption

Linearity refers to β and ε, not X

E.g. g(yi) = β h(xi) + εi is a linear model for any function g and h

Error term ε

By assumption E[ε|X] = 0 =⇒ E[ε] = 0

Note: εi does not depend on any xj, neither past nor future xs

Let X = E[X]. By the “tower property” or “law of iterated expectations”

Cov[ε,X] = E[ε(X − X)] = Ex[E[ε(X − X)|X]] = Ex[E[ε|X]︸︷︷︸

(X − X)] = 0

E[ε|X] = 0 implies E[y|X] = Xβ, i.e. Xβ is the conditional mean of y|X

Our analysis is conditional on design matrix X which can be stochastic

Spherical error term ε

Assumptions:

Homoscedasticity Var[εi|X] = σ2, i = 1, . . . , n

Nonautocorrelation Cov[εi, εj|X] = 0, ∀i 6= j

In short: E[ε ε′|X] = σ2I

Data generating process for the regressors

X may include constants and random variables

“Golden rule”: include a column of 1s in X

Crucial assumption: ε ⊥ X

Chapter 3: Least squares

Regression model: yi = x′i β + εi

Goal: statistical inference on β, e.g. estimate β

Population quantities, not observed: E[yi|xi] = x′i β, β, εi

Sample quantities, estimated from sample data: yi = x′i b, b, ei

Least squares estimator

Let b0 be the least squares estimator:

b0 = arg minb0

(yi − x′i b0)

S(b0) :=n∑

(yi − x′i b0)

2 =n∑

= e′0 e0 = (y −Xb0)′(y −Xb0)

= y′y − 2y′Xb0 + b′0X′Xb0

Least squares estimator: normal equations

Necessary condition for a minimum:

∂S(b0)

∂b0=

∂(y′y − 2y′Xb0 + b′0X′Xb0)

∂b0= −2X ′y + 2X ′Xb0 = 0

Let b be the solution, normal equations:

X ′Xb = X ′y

By assumption X has full column rank,

b = (X ′X)−1X ′y

Since X has full column rank, the following matrix is positive definite

∂2S(b0)

∂b0 ∂b′0= 2X ′X

Example: regression with simulated data

DGP: yi = x′i β + εi, with x′

i = [1 x2i], x2i ∼ U [0, 1], εi ∼ N(0, 1)

i = 1, . . . , 100, β = [1 2]′, in this sample b = [1.01 2.07]′

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1

= Uniform[0,1]

y i = x

i β +

True regression lineEstimated regression line

Algebraic aspects of the least squares solution

“Golden rule”: include a column of 1s in X

Normal equations: 0 = X ′y −X ′Xb = X ′(y −Xb) = X ′e

First column of X, x1 = 1s, then the first normal equation is

0 = x′1e = [1 · · · 1]e =

(yi − x′i b)

Implications:

1. Least square residuals have zero mean, e = 0

2. Estimated regression line through the means of the data, y = x′b

3. Mean of fitted values equals mean of actual data, y = y

None of these implications holds if X does not include a column of 1s

Projection

Estimated residuals e = y −Xb, LS estimator b = (X ′X)−1X ′y

e = y −Xb = y −X(X ′X)−1X ′y = (I −X(X ′X)−1X ′)y = My

M is called residual Maker matrix as My = e

As MX = 0, e′ y = y′MXb = 0, LS partitions y in two orthogonal parts

y = Xb + e = y + e

P is called the Projection matrix

y = y − e = (I −M)y = X(X ′X)−1X ′y = Py

Properties of M and P matrices

M and P are symmetric, idempotent and orthogonal (PM = MP = 0)

Orthogonal decomposition of y

y = Xb + e = Py + My = projection + residual

Pythagorean theorem:

y′y = y′P ′Py + y′M ′My = y′y + e′e

Partitioned regression

E.g. regression model: income = β0 + β1 age + β2 education + error

Goal: study income–education association; age is a control variable

Model: y = Xβ + ε = X1β1 + X2β2 + ε, solve normal equations for b2

b2 = (X ′2M1X2)

−1X ′2M1y

= (X ′2M

′1 M1X2)

−1X ′2M

′1 M1y

= (X∗2′X∗

2)−1X∗2′y∗

M1 is the residual maker matrix based on columns of X1

b2 is obtained regressing y∗ on X∗2

y∗ (resp. X∗2) are residuals from regression of y (resp. X2) on X1

Partial correlation

When education ↑ ⇒ income ↑, but education and age both ↑ in time

What is the net effect of education on income?

Partial correlation, r∗yz:

y∗ = residuals in a regression of income on a constant and age

z∗ = residuals in a regression of education on a constant and age

r∗yz = simple correlation between y∗ and z∗

Goodness of fit

Goal: measure how much variation in y is explained by variation in x

Suppose y = x = 0. Recall yi ⊥ ei, for each observation i

yi = yi + ei

SST = SSR + SSE

Good regression model: SST ≈ SSR, hence SSE ≈ 0

Goodness of fit (when means 6= 0)

When y, x 6= 0, consider deviations from the means

yi − y = yi − y + ei

= (x′i − x′)b + ei

(yi − y)2 =

((x′i − x′)b)2 +

Define M0 = [I − ii′/n], n× n, symmetric, idempotent; i′ = [1 · · · 1]

M0 transforms observations in deviations from sample means, M0y = y− iy

y′M0′ M0y = b′X ′M0′ M0Xb + e′e

y′M0y = b′X ′M0Xb + e′e

SST = SSR + SSE

Coefficient of determination, R2

R2 =SSR

b′X ′M0Xb

y′M0y= 1− e′e

y′M0y

Properties:

R2 measures the linear association between X and y

0 ≤ R2 ≤ 1, as 0 ≤ SSR ≤ SST

R2 ↑ when a regressor is added, from X = [x1 · · ·xK] to X = [x1 · · ·xK+1]

Adjusted R2 = 1− e′e/(n−K)

y′M0y/(n− 1)

Remark: X should include a column of 1s ⇒ M0e = e and e ⊥ X

Chapter 4: Statistical properties of LS estimators

LS estimator enjoys various good statistical properties:

1. Easy to compute

2. Explicit use of model assumptions

3. Optimal linear predictor

4. Most efficient, under certain conditions

Orthogonality conditions

Assumptions: X stochastic or not, linear model, E[εi|X] = 0 =⇒

E[xi εi] = Ex[E[xi εi|X]] = Ex[xiE[εi|X]] = 0 = Ex[xiE[(yi − x′iβ)|X]]

which implies the population orthogonality conditions:

ExE[xi yi|X] = ExE[xi x′iβ|X]]

E[xi yi] = E[xi x′i]β

LS normal equations are sample counterpart of orthogonality conditions:

X ′y = X ′X b

xi yi =1

xi x′i b

Optimal linear predictor

Goal: find linear function of xi, x′iγ, that minimizes MSE

MSE = E[(yi − x′iγ)2]

= E[(yi − E[yi|X] + E[yi|X]− x′iγ)2]

= E[(yi − E[yi|X])2] + E[(E[yi|X]− x′iγ)2]

MSE = minγ

E[(E[yi|X]− x′iγ)2]

0 = −2E[xi(E[yi|X]− x′iγ)]

E[xi yi] = E[xi x′i]γ

which are the LS normal equations

Implicit assumption: all these expectations exist, i.e. E[·] <∞

Unbiased estimation

LS estimator is unbiased in every sample:

b = (X ′X)−1X ′y = (X ′X)−1X ′(Xβ + ε) = β + (X ′X)−1X ′ε

Using law of iterated expectations, and assumption E[ε|X] = 0

E[b] = EX[E[β + (X ′X)−1X ′ε|X]]

= β + EX[(X ′X)−1X ′E[ε|X]]

Monte Carlo simulation: b2 slope estimates

DGP: yi = x′i β + εi, with x′

i = [1 x2i], x2i ∼ U [0, 1], εi ∼ N(0, 1)

i = 1, . . . , 100, β = [1 2]′, repeat simulation and estimation 1,000 times

0 0.5 1 1.5 2 2.5 3 3.50

Variance of LS estimator

LS estimator is linear in ε: b = β + (X ′X)−1X ′ε

Easy to derive variance of linear estimator:

Var[b|X] = E[(b− β)(b− β)′|X]

= E[(X ′X)−1X ′ε ε′X(X ′X)−1|X]

= (X ′X)−1X ′E[ε ε′|X]X(X ′X)−1

= (X ′X)−1X ′(σ2I)X(X ′X)−1

= σ2(X ′X)−1

Note: assumption of spherical errors, Var[ε|X] = σ2I, is crucial

Gauss–Markov theorem

Any linear unbiased estimator b0 = Cy, where C is a K × n matrix

Unbiasedness: E[Cy|X] = E[CXβ + Cε|X] = β ⇒ CX = I

Define C = D+(X ′X)−1X ′ ⇒ CX = I = DX +(X ′X)−1X ′X =

=0︷︸︸︷DX +I

Var[b0|X] = CVar[y|X]C ′ = CVar[ε|X]C ′ = σ2CC ′

= σ2(D + (X ′X)−1X ′)(D + (X ′X)−1X ′)′

= σ2(D + (X ′X)−1X ′)(D′ + X(X ′X)−1)

= σ2DD′ + σ2(X ′X)−1 = σ2DD′ + Var[b|X]

= Var[b|X] + nonnegative definite matrix

LS estimator is BLUE (when X is constant and/or stochastic)

Estimating the variance of LS estimator

Estimate σ2 in Var[b|X] = σ2(X ′X)−1 ⇒ use ei sample analog of εi

But ei = yi − x′ib = εi − x′

i(b− β) is an “imperfect estimate” of εi

Sample residual: e = My = M(Xβ + ε) = Mε, as MX = 0

E[e′e|X] = E[ε′Mε|X] = E[tr(ε′Mε)|X] = E[tr(Mεε′)|X]

= tr(ME[εε′|X]) = tr(M)σ2

= tr(I −X(X ′X)−1X ′)σ2 = (tr(I)− tr(X ′X(X ′X)−1))σ2

= (tr(In)− tr(IK))σ2 = (n−K)σ2

Unbiased estimator of σ2 (conditionally on X and unconditionally):

s2 =e′e

n−K=

∑ni=1 e2

Normality of LS estimator

Assumption ε|X ∼ N(0, σ2I), and linearity b = β + (X ′X)−1X ′ε⇒

joint normality of b (multivariate normal distribution)

b|X ∼ N(β, σ2(X ′X)−1)

and each slope, bk, is normally distributed

bk|X ∼ N(βk, σ2(X ′X)−1

Note: exact distribution in finite samples

Distribution of b2 slope estimates: simulation

DGP: yi = [1 x2i] [1 2]′ + εi, with εi ∼ N(0, 1), 1,000 estimations

Comparison between simulated and true normal density of b2

0 0.5 1 1.5 2 2.5 3 3.50

Hypothesis testing on a coefficient

As b|X ∼ N(β, σ2(X ′X)−1)

(bk − βk)/√

σ2(X ′X)−1kk ∼ N(0, 1)

Unfortunately σ2 is not known but estimated via s2. Useful statistic:

(bk − βk)/√

σ2(X ′X)−1kk

[e′e/σ2]/(n−K)∼ N(0, 1)√

χ2(n−K)/(n−K)

∼ t-Student(n−K)

Note: σ2 is unknown but cancels in the ratio above

Need to show: e′e/σ2 ∼ χ2(n−K) and e′e independent on bk

χ2 distribution of e′e

Recall: M is residual maker matrix, e = My = Mε as MX = 0

As ε|X ∼ N(0, σ2I)⇒ ε/σ|X ∼ N(0, I)

which is an idempotent quadratic form in ε/σ, and in Appendix B.11.4

σ∼ χ2

rank(M)

where rank(M) = tr(M) = n−K

Independence of b and e′e

To show independence between

b− β

σ= (X ′X)−1X ′ ε

σ∼ N(0, LL′)

ande′e

σM ′ M

it suffices to show that LM = 0 because this implies, conditional on X,

= E[Lε

σ)′] = E[L

σM ′] = L

σ2M ′ = LM

= (X ′X)−1X ′ (I −X(X ′X)−1X ′) = 0

which implies independence as ε|X ∼ N

Significance of a coefficient: t-statistic

Common test H0 : βk = 0

tk = t-statistic =(bk − 0)/

(X ′X)−1kk

[e′e]/(n−K)=

s2(X ′X)−1kk

Example: Significance of a coefficient

True β = [1 2]′, estimate b = [1.01 2.07]′, n = 100, K = 2

Is b2 statistically different from zero?

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1

= Uniform[0,1]

−8 −6 −4 −2 0 2 4 6 80

b / (s2 (X´ X)−1)1/2

t−Student distributionNormal distribution

Confidence intervals for parameters

Point estimates are useless without confidence intervals or standard errors

Use t-Student distribution

(bk − βk)/√

σ2(X ′X)−1kk

[e′e/σ2]/(n−K)∼ t-Student(n−K)

to set confidence intervals:

Pr(bk − tα/2 sbk≤ βk ≤ bk + tα/2 sbk

) = 1− α

where tα/2 is t-Student quantile, e.g. α = 0.05, sbk=√

s2(X ′X)−1kk

Significance of the regression

Common test H0 : β2 = · · · = βK = 0 (except intercept)

or equivalently H0 : R2 = 0

F -test statistic:

F [K − 1, n−K] =R2/(K − 1)

(1−R2)/(n−K)∼

χ2(K−1)/(K − 1)

χ2(n−K)/(n−K)

R2 ≈ 1 ⇒ large F ⇒ reject H0

Marginal distribution of test statistics

Under H0 : βk = β0k, and conditionally on X, bk|X ∼ N(β0

k, σ2(X ′X)−1kk )

Unconditionally, bk ∼ ?, hard to find, depends on distribution of X

Key property: t-statistic

tk = t|X =bk − β0

s2(X ′X)−1kk

but t-Student(n−K) does not depend on X ⇒ unconditionally

tk ∼ t-Student(n−K)

Multicollinearity

Multicollinearity = variables in X are linearly dependent ⇒ X ′X is singular

In practice, variables in X are often close to be linearly dependent

“Symptoms” of multicollinearity:

• Small changes in data produce large changes in b

• Var[b|X] very large (⇒ t-statistic close to zero) but R2 is high

• Coefficient estimates “wrong” sign or implausible

Multicollinearity: analysis

Demeaned variables, X = [X(k) xk], where xk (n× 1) is the k-th variable

Use Appendix A.5.3 on inverse of partitioned matrix:

Var[bk|X] = σ2(X ′X)−1kk

= σ2(

x′kxk − x′

kX(k)(X′(k)X(k))

−1X ′(k)xk

x′kxk

1−x′

kX(k)(X′(k)X(k))

−1X ′(k)xk

x′kxk

])−1

x′kxk

1− x′kP(k)xk

x′kxk

])−1

x′kxk

1− x′kxk

x′kxk

])−1

= σ2(x′

[1−R2

])−1

x′kxk [1−R2

Multicollinearity: interpretation

Hence, as column variables in X are demeaned,

Var[bk|X] =σ2

x′kxk [1−R2

∑ni=1(xik − xk)2 [1−R2

where R2k. is R2 from regression of xk on X(k) (i.e. X \ xk)

Var[bk|X] ↑ when

• R2k. → 1, i.e. multicollinearity

•∑n

i=1(xik − xk)2 → 0

• σ2 ↑, i.e. ↑ dispersion of yi around regression line

Large sample properties of LS estimator

ε|X ∼ N is a strong assumption and can be relaxed, but now

Assumption 5a (DGP of X):

• (xi, εi) i = 1, . . . , n, sequence of independent observations

• plimX ′X/n = Q positive definite matrix

Notation: plim = probability limit, i.e. convergence in probability

plimZn = Z stands for

limn→∞

Pr(|Zn − Z| > ǫ) = 0,∀ǫ > 0

where Z can be either random or constant

Consistency of LS estimator

Consistency means plim b = β

Highly desirable property of any estimator

Recall: distribution of ε|X is unknown

b = β +

(X ′X

)−1X ′ε

plim b = β + plim

(X ′X

plimX ′ε

= β + Q−1 plimX ′ε

If plimX ′ε/n = 0, then b is consistent

Random term X ′ε/n

E[X ′ε

n] = EX[E[

X ′ε

n|X]] =

=0︷︸︸︷

E[εi|X]] = 0

Var[X ′ε

n] = E[Var[

X ′ε

n|X]] + Var[

=0︷︸︸︷

E[X ′ε

n2X ′E[εε′|X]X] =

X ′X

As E[X ′ε/n] = 0 and limn→∞ Var[X ′ε/n] = 0,

X ′ε

m.s.−→ 0 =⇒ plimX ′ε

Remark: Var[X ′ε/n] decays as 1/n

Example: convergence of X ′ε/n

xi ∼ U [−0.5, 0.5], σ2 = 2, hence Var[X ′ε/n] = σ2/nE[∑n

i=1 x2i/n]

−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.50

X ε / n

n=100n=50n=10

Asymptotic distribution of LS estimator

Key idea: stabilize the distribution of X ′ε/n

Recall: Var[X ′ε/n] decays as 1/n

Var[√

nX ′ε/n] = nVar[X ′ε/n] ∈ O(1)

√n(b− β) =

(X ′X

)−1√nX ′ε/n

−→ Q−1 × asymptotic distribution of√

nX ′ε/n

Random term√

nX ′ε/n

Recall: E[√

nX ′ε/n] = 0; (xi, εi) independent; regressors well behaved

Var[√

nX ′ε/n] =1

nVar[X ′ε] =

xi εi]

Var[xi εi] =1

σ2E[xi x′i] = σ2Q

By Central Limit Theorem:√

nX ′ε/nd−→ N(0, σ2Q)

√n(b− β)

d−→ Q−1 ×N(0, σ2Q)

d= N(0, σ2Q−1)

ba∼ N(β,

nQ−1)

Asymptotic normality of LS estimator

If regressors well behaved and observations independent, then

asymptotic normality of LS estimator follows from CLT, not ε|X ∼ N

In practice in

ba∼ N(β,

nQ−1)

Q is estimated by X ′X/n

σ2 is estimated by s2 = e′e/(n−K) (as plim s2 = σ2)

If ε|X ∼ N(0, σ2I), then b ∼ N(β, σ2(X ′X)−1) for every sample size n

Asymptotic dist. of nonlinear function: Delta method

f(b): J possibly nonlinear C1 functions

∂f(b)

∂b′=: C(b) (J ×K)

Goal: find asymptotic distribution of f(b)

Slutsky theorem: plim f(b) = f(plim b) = f(β), and plimC(b) = C(β)

First order Taylor expansion (remainder negligible if plim b = β)

f(b) = f(β) + C(β)× (b− β) + remainder

f(b)a∼ N(f(β), C(β)

nQ−1C(β)′)

t-Statistic: remark

To test H0 : βk = 0, t-statistic tk = bk/√

s2(X ′X)−1kk

If in finite sample ε|X ∼ N , then tk ∼ t-Student(n−K)

If only asymptotically ε|X ∼ N (not in finite sample), then tk ∼ N(0, 1)

−5 −4 −3 −2 −1 0 1 2 3 4 50

b / (s2 (X´ X)−1)1/2

t−statistic (n=10)

Normalt−Student

Missing observations

Common issue in applied work

• Missing at random: least serious case, just discard those observations, sample

size reduced

• Not missing at random: most difficult case, selection bias, mechanism should

be studied

Read Chapter 4.8.2

Chapter 5: Inference

Goal: test implications of economic theory

Example: unrestricted model of investment, It,

ln It = β1 + β2 it + β3∆pt + β4 lnYt + β5 t + εt

where it nominal interest rate, ∆pt inflation rate, Yt real output

H0 : “investors care only about real interest rate, (it −∆pt)”

⇒ restricted (or nested) model of investment:

ln It = β1 + β2(it −∆pt) + β4 lnYt + β5 t + εt

⇒ β3 = −β2 ⇒ β2 + β3 = 0, in the unrestricted model

Linear restrictions

In the linear regression model, y = Xβ + ε, consider J linear restrictions

Rβ = q

R is J ×K and usually J ≪ K

Example: β = (β1 β2 β3 β4)′

1. H0 : β2 = 0 tested with R = (0 1 0 0) and q = 0

2. H0 : β2 = β3 = β4 = 0 tested with

0 1 0 0

0 0 1 0

0 0 0 1

and q = (0 0 0)′

Two approaches to testing hypothesis

1. Fit unrestricted model and check whether estimates satisfy restrictions

2. Fit restricted model and check loss of fit (in terms of R2)

The two approaches are equivalent in the linear regression model

Working assumption: ε|X ∼ N(0, σ2I) (to be relaxed)

Approach 1: discrepancy vector

Null hypothesis: J linear restrictions, R is J ×K

H0 : Rβ − q = 0

Alternative hypothesis:

H1 : Rβ − q 6= 0

Discrepancy vector, m = Rb− q, will not be exactly zero (most likely)

Decide whether m is not exactly zero because of

(a) sampling variability (do not reject H0)

(b) or restrictions are not satisfied by the data (reject H0)

Wald criterion

Under H0 : Rβ − q = 0, discrepancy vector m = Rb− q

E[m|X] = RE[b|X]− q = Rβ − q = 0

Var[m|X] = Var[Rb− q|X] = RVar[b|X]R′ = σ2R(X ′X)−1R′

Recall, as ε|X ∼ N(0, σ2I) by assumption, b|X ∼ N(β, σ2(X ′X)−1)

=⇒ m|X ∼ N(0, σ2R(X ′X)−1R′)

Wald statistic:

W = m′ (Var[m|X])−1 m

= (Rb− q)′ (σ2R(X ′X)−1R′)−1 (Rb− q)

∼ χ2(J)

χ2 distribution ⇐ Full Rank Gaussian Quadratic form, Appendix B.11.6

Wald statistic feasible and F -statistic

In the Wald statistic, need to get rid of unknown σ2

F =(Rb− q)′ (σ2R(X ′X)−1R′)−1 (Rb− q)/J

[e′e/σ2]/(n−K)

∼χ2

χ2(n−K)/(n−K)

∼ F(J, n−K)

• Numerator: under H0, (Rb− q)/σ = R(b− β)/σ = R(X ′X)−1X ′ε/σ

i.e. standardized Gaussian quadratic form in R(X ′X)−1X ′ε/σ ⇒ χ2(J)

• Denominator: standardized Gaussian quadratic form in Mε/σ ⇒ χ2(n−K)

As MX = 0, Cov(R(X ′X)−1X ′ε/σ,Mε/σ) = 0⇒ Num. Den. independent

Hypothesis testing on a single coefficient

H0 : βk = β0 can be tested with “t-statistic”

t :=(bk − β0)/

σ2(X ′X)−1kk

[e′e/σ2]/(n−K)∼ N(0, 1)√

χ2(n−K)/(n−K)

or with linear restriction R = (0 · · · 0 1 0 · · · 0) and q = β0

F =(bk − β0) (σ2(X ′X)−1

kk )−1 (bk − β0)

[e′e/σ2]/(n−K)

∼χ2

χ2(n−K)/(n−K)

∼ F(1, n−K)

As t2 = F the two tests are equivalent

Approach 2: restricted least squares

Fit of restricted model cannot be better than unrestricted model

Restricted LS:

b∗ = arg minb0

(y −Xb0)′(y −Xb0) subject to Rb0 = q

= b− (X ′X)−1R′[R(X ′X)−1R′]−1(Rb− q)

(b∗ − b) = −(X ′X)−1R′[R(X ′X)−1R′]−1(Rb− q)

e∗ residuals from restricted model. Loss of fit due to constraints:

e∗ = y −Xb∗ = y −Xb−X(b∗ − b) = e−X(b∗ − b)

e′∗e∗ = e′e + (b∗ − b)′X ′ X(b∗ − b) ≥ e′e

e′∗e∗ − e′e = (b∗ − b)′X ′ X(b∗ − b)

= (Rb− q)′[R(X ′X)−1R′]−1(Rb− q)

Loss of fit and F -statistic

F -statistic for H0 : Rβ = q

F =(Rb− q)′ [R(X ′X)−1R′]−1 (Rb− q)/J

e′e/(n−K)

=(e′∗e∗ − e′e)/J

e′e/(n−K)

=(R2 −R2

∗)/J

(1−R2)/(n−K)∼ F(J, n−K)

Recall R2 = 1− e′e/y′M0y and y′M0y not depend on b. Similarly for R2∗.

Special case: overall significance of the regression

β2 = . . . = βK = 0 (except intercept) ⇒ R2∗ = 0 with J = K − 1

Nonnormal disturbances and large sample tests

Drop assumption ε|X ∼ N which implies b|X ∼ N(β, σ2(X ′X)−1)

All previous tests hold asymptotically, when n→∞

Key ingredient: asymptotic distribution of b

ba∼ N(β,

nQ−1), where Q = plim (X ′X/n)

Recall √n(b− β)

d−→ N(0, σ2Q−1), from CLT

plim s2 = σ2, where s2 = e′e/(n−K)

Example: limiting distribution of Wald statistic

n(b− β)d−→ N(0, σ2Q−1) and H0 : Rβ − q = 0, then

√n(Rb− q) =

√nR(b− β)

d−→ N(0, σ2RQ−1R′)

which implies

√n(Rb− q)′(σ2RQ−1R′)−1

√n(Rb− q)

d−→ χ2(J)

which has the same limiting distribution as W

W = (Rb− q)′ (s2R(X ′X)−1R′)−1 (Rb− q)d−→ χ2

when plim s2(X ′X/n)−1 = σ2Q−1. Note: in W all n’s cancel

Remark: W is only approximately distributed as χ2(J) in finite samples,

in practice n 6→ ∞

Testing nonlinear restrictions

Test H0 : c(β) = q, where c is J × 1 nonlinear functions

Apply delta method: first order Taylor expansion of c

c(β) ≈ c(β) +∂c(β)

∂β′(β − β)

Var[c(β)] ≈ ∂c(β)

∂β′Var[β]

∂c(β)′

In ∂c(β)/∂β′ replace β by β

Wald statistic

W = (c(β)− q)′(

Var[c(β)])−1

(c(β)− q)d−→ χ2

Prediction

Prominent use of regression model

y0, x0 not in our sample, not observed. Predict y0 using

E[y0|x0, X] = x0′b

as y0 = x0′β + ε0, and assuming that x0 is known

Forecast error: e0 = y0 − y0 = (β − b)′x0 + ε0

Prediction variance: Var[e0|x0, X] = σ2 + x0′[σ2(X ′X)−1]x0 > 0

Prediction interval at (1− λ) confidence level:

y0 ± zλ/2

Var[e0|x0, X]

where zλ/2 is the λ/2-quantile of N(0, 1), e.g. λ = 0.05

Prediction of y0 and x0

If x0 is known, prediction of y0

E[y0|x0, X] = x0′b

with Var[e0|x0, X]

If x0 is not known and needs to be predicted too, prediction of y0

Ex0E[y0|x0, X] = Ex0[x0′b|X]

depends on distribution of x0, usually unknown and computed by simulation,

with Var[e0|X] > Var[e0|x0, X]

Measure of predictive accuracy

Notation: yi realized values, yi predicted values, n0 number of predictions

• Not scale invariant:

– Root mean square error (RMSE) =√∑

i(yi − yi)2/n0

– Mean absolute error (MAE) =∑

i |yi − yi|/n0

• Scale invariant:

– Theil U statistic

√∑

i(yi − yi)2/n0

i y2i /n0

Chapter 6: Functional form

Very general functional form of regression model:

L independent variables: zi = [z1i · · · zLi]

K linearly independent functions of zi: f1i(zi) · · · fKi(zi)

g(yi) observable function of yi

usual assumptions on εi

The following model is still linear and can be estimated by LS:

g(yi) = β1 f1i(zi) + . . . + βK fKi(zi) + εi

= β1 x1i + . . . + βK xKi + εi

yi = x′i β + εi

Nonlinearity in variables

A linear model, e.g., yi = β1 + xi β2 + wi β3 + εi is typically enriched with

• dummy variables

• nonlinear functions of regressors (e.g. quadratic function)

• interaction terms (i.e. cross products)

yi = β1 + xi β2 + wi β3 + β4 di + β5 x2i + β6 xiwi + εi

= x′i β + εi

where x′i = [1 xi wi di x2

i xiwi] and the dummy variable

1, i ∈ D0, otherwise

Dummy variable

Easy to use: one dummy variable is one more column in X

To study various effects (treatment, grouping, seasonality, thresholds, etc.)

yi = β1 + x2i β2 + di β3 + εi

= (β1 + di β3) + x2i β2 + εi

= x′i β + εi

1, i ∈ D0, otherwise

In this model the dummy variable “shifts” the intercept: β1 ←→ (β1 + β3)

Example: regression with dummy variable

yi = x′i β + εi, x′

i = [1 x2i di], x2i ∼ U [0, 1], di = 1x2i>0.5, εi ∼ N(0, 1)

i = 1, . . . , 100, β = [1 2 2]′, in this sample b = [0.99 2.13 1.96]′

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1

= Uniform[0,1]

y i = β

2 + β

3 di +

Structural break

Previous graph shows a structural break in the model

β1 + x2i β2 + εi, x2i ≤ 0.5

(β1 + β3) + x2i β2 + εi, x2i > 0.5

Structural change can be tested with F -test

Note: the break point is supposed to be known a priori

Testing for a structural break

Split the sample in two parts, according to potential structural break

nb observations on yb and Xb (nb × k) before potential structural break

na observations on ya and Xa (na × k) after potential structural break

• Unrestricted model allows for potential structural break, βb 6= βa:

• Restricted model, no structural break, β′ = [β′b β′

βb = βa

βb − βa = 0

[Ik...− Ik] β = R β = 0

F -test for a structural break

H0 : R β = q, with q = 0, R = [Ik...−Ik], dim(R) = k×2k, dim(β) = 2k×1

F =(Rb− q)′ [R(X ′X)−1R′]−1 (Rb− q)/J

e′e/(n−K)∼ F(J, n−K)

J = k = number of restrictions = number of rows in R

n−K = (nb + na)− 2k = total number of observations minus dim(β)

Alternative ways exist to test for structural break (e.g., Wald statistic)

Typical issue: limited sample sizes before, nb, and/or after, na, the break

Chapter 7: Specification analysis

Implicit assumption: the model y = Xβ + ε is correct

Common model misspecification:

• Omission of relevant variables

• Inclusion of superfluous variables

Omitted relevant variables

True regression model: y = X1 β1 + X2 β2 + ε

Use wrong regression model: “y = X1 β1 + ε”

Regress y on X1 only:

b1 = (X ′1X1)

−1X ′1 y = (X ′

1X1)−1X ′

1(X1 β1 + X2 β2 + ε)

= β1 + (X ′1X1)

−1X ′1X2 β2 + (X ′

1X1)−1X ′

E[b1|X] = β1 + (X ′1X1)

−1X ′1X2 β2

Unless X ′1X2 = 0 or β2 = 0,

E[b1|X] 6= β1, i.e. b1 is biased

plim (b1) 6= β1, i.e. b1 is inconsistent

Inference procedures (t-test, F -test, etc.) are invalid

Inclusion of superfluous variables

True regression model: y = X1 β1 + ε

Use “wrong” regression model: y = X1 β1 + X2 β2 + ε

Rewrite y = X1 β1 + X2 β2 + ε = Xβ + ε

where X = [X1 X2] and β′ = [β′1 β′

2] = [β′1 0′]

Model used, per se, it is not wrong, simply β2 = 0

Regress y on X: LS estimator is unbiased estimator of β

E[b|X] = β =

Price to pay for not using information β2 = 0: reduced precision of estimates

“Var[b|X] ≥ Var[b1|X]”

Model building

• Simple-to-general:

not a good strategy, omitted variables induce biased and inconsistent

estimates

• General-to-simple:

better strategy, computing power is cheap, but variable selection is a difficult

Choosing between nonnested models

F -test of H0 : Rβ = q is only for nested models

R represents (linear) restrictions on the model y = Xβ + ε

Various nonnested hypothesis can be of interest

e.g., choosing between linear or loglinear functional forms:

yi = β1 + xi β2 + εi or log(yi) = β1 + log(xi)β2 + εi

Typically, these tests are based on likelihood function

Likelihood function: digression

Probability theory : given population model, what is the probability of

observing that sample?

Inference procedure : given that sample, what is the population model?

Likelihood function = probability of observing that sample as a function of

model parameters

Likelihood function: simple example

Fair coin, H, T, Pr(toss = T ) = p0 = 0.5

Goal: estimate p0 (unknown to us)

Observed sample: n = 60 tossing, total T = k = 28

L(p) =

pk(1− p)n−k =

p28(1− p)32

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

Likeli

Choosing between nonnested models: Vuong’s test

Goal: choose between two nonnested models

No model is favored, as in classical hypothesis testing

Models can be both wrong: choose the least misspecified

Assumption: observations are independent (conditionally on regressors)

True model: yi ∼ h(yi), density with parameter α

Model 0: yi ∼ f(yi), density with parameter θ

Model 1: yi ∼ g(yi), density with parameter γ

KLIC0 = E [(lnh(yi)− ln f(yi))| h is true ] ≥ 0

KLIC0 = distance between model h (true) and f in terms of log-likelihood

Vuong’s statistic

Decision criteria: model 1 is better than model 0 if KLIC1 < KLIC0

KLIC1 −KLIC0 = E [(ln f(yi)− ln g(yi))| h is true ]

(ln f(yi)− ln g(yi)) =

Vuong’s statistic:

V =√

∑ni=1 mi/n

√∑ni=1(mi −m)2/n

• Vd−→ N(0, 1) when model 0 and 1 are “equivalent”

• Va.s.−→ +∞ when model 0, f(yi), is “better”

• Va.s.−→ −∞ when model 1, g(yi), is “better”

Vuong’s test: application to linear models

Assume ε ∼ N(0, σ2)

Model 0: yi ∼ f(yi), with yi = x′iθ + ε0i

Model 1: yi ∼ g(yi), with yi = x′iγ + ε1i

f(yi) =1√

2πσ2e−0.5(yi−x′

iβ)2/σ2

ln f(yi) = −1

2ln(2πσ2)− 1

2(yi − x′

iβ)2/σ2

= −1

2[ln 2π + ln(σ2) + ε2

0i/σ2]

ln f(yi)− ln g(yi) =

2[ln(e′0e0/n) +

e′0e0/n]

2[ln(e′1e1/n) +

e′1e1/n]

Model selection criteria

Various criteria have been proposed

Adjusted R2:

= 1− e′e/(n−K)∑n

i=1(yi − y)2/(n− 1)

Akaike Information Criteria:

lnAIC(K) = ln(e′e/n) + K2/n

Bayesian Information Criteria:

lnBIC(K) = ln(e′e/n) + K lnn/n

Chapter 8: Generalized regression model

Spherical error E[ε ε′|X] = σ2I is a restrictive assumption

Allow for heteroscedasticity, σ2i 6= σ2

j , and autocorrelation, σij 6= 0, ∀i, j:

E[ε ε′|X] = σ2Ω = Σ =

σ21 σ12 · · · σ1n

σ12 σ22 · · · σ2n

σ1n σ2n

Total number of parameters in Σ = n + (n2 − n)/2 = n(n + 1)/2≫ n

E.g., n = 100 ⇒ n(n + 1)/2 = 5,050 too many!

Need to impose structure on Σ

Heteroscedasticity: asset returns and stochastic volatility

S&P 500 daily returns, 1999–2003, and asymmetric GARCH volatility

0 100 200 300 400 500 600 700 800 900 1000−8

0 100 200 300 400 500 600 700 800 900 10000.1

tility

Least square estimator

When Var[ε|X] = σ2Ω

LS estimator, b = β + (X ′X)−1X ′ε, has still good properties:

unbiased, consistent, and asymptotically normal

E[b|X] = β

Var[b|X] = (X ′X)−1X ′ Var[ε|X]X(X ′X)−1

(X ′X

)−1(X ′ΩX

)(X ′X

If plim (X ′X/n) and plim (X ′ΩX/n) are positive definite, plim b = β

√n(b− β) =

(X ′X

)−1√nX ′ε/n

d−→ Q−1 ×N(0, σ2 plimX ′ΩX

Generalized least square estimator

Var[ε|X] = σ2Ω, assume Ω is known; decompose Ω = CΛC ′

Ω−1 = CΛ−1/2 Λ−1/2C ′ = P ′ P , where Λ = diag(λ1, . . . , λn), C ′C = I

Transformed model : Py = PXβ + Pε ⇒ Var[Pε|X] = σ2PΩP ′ = σ2I

β = (X ′P ′ PX)−1X ′P ′ Py

= (X ′Ω−1X)−1X ′Ω−1y

= arg minβ0

(y′ − β′0X

′)Ω−1 (y −Xβ0)

Heteroscedasticity case: Ω = diag(w1, . . . , wn)

β = arg minβ0

(yi − x′i β0)

Recall: OLS case Ω = I

GLS efficient estimator

In the classical model, y = Xβ + ε, where Var[ε|X] = σ2I:

OLS is minimum variance, BLUE, estimator

In the transformed model, Py = PXβ + Pε, where Var[Pε|X] = σ2I:

GLS estimator = OLS in the transformed model

⇒ GLS estimator is efficient (not OLS)

Feasible generalized least square estimator (FGLS)

Var[ε|X] contains n(n + 1)/2 parameters: impossible to estimate all

Var[ε|X] = σ2Ω parameterized with few unknown parameters θ

E.g. Time series: Ωij = θ|i−j|, where |θ| < 1

E.g. Heteroscedasticity: Ωii = exp(z′i θ)

FGLS estimator relies on Ω = Ω(θ)

β(Ω) = (X ′Ω−1X)−1X ′Ω−1y

Key result: when n→∞, β(Ω) behaves like β(Ω)

using any consistent (not necessarily efficient) estimator of Ω(θ)

Heteroscedasticity

Var[ε|X] = σ2Ω = σ2diag(w1, . . . , wn)

Scaling: tr(σ2Ω) =∑n

i=1 σ2i = σ2

∑ni=1 wi = σ2n⇒ σ2 =

∑ni=1 σ2

Interpretation: wi positive weight

When form of heteroscedasticity is

• known: parameterize and estimate Ω, then FGLS

• unknown: OLS can still be applied, but Var[b|X]?

Estimating Var[b|X] under unknown heteroscedasticity

White’s heteroscedasticity consistent estimator:

Var[b|X] =σ2

(X ′X

)−1(X ′ΩX

)(X ′X

(X ′X

)−1(

σ2i xi x

)(X ′X

(X ′X

)−1(

e2i xi x

)(X ′X

Proof sketch: As σ2i xi x

′i = E[ε2

i xi x′i|xi],

σ2i xi x

′i = plim

ε2i xi x

′i = plim

e2i xi x

Remark: equalities above are in plim , X ′ΩX/n never estimated

Test for heteroscedasticity: Breusch–Pagan test

Form of heteroscedasticity: σ2i = σ2f(α0 + α′zi)

Note: functional form f does not need to be specified

H0 : α = 0, i.e. homoscedasticity

Under H0, E[ε2i/(σ2f(α0))− 1] = 0 and does not depend on zi

Regress gi := (e2i/(e′e/n)− 1) on Z ′

i := [1 z′i] (1× k), i = 1, . . . , n

calculate b = (Z ′Z)−1Z ′g and g = Zb

Under H0, test statistic:

2g′g =

2g′Z(Z ′Z)−1Z ′g

d−→ χ2(k−1)

Multiplicative heteroscedasticity: example

Goal: explain firms profit, yi, i = 1, . . . , n

Model: yi = x′i β + εi, where

Var[εi|X] = σ2 exp(z′i α) (Harvey’s model)

Step 1: regress yi on xi using OLS and compute ei

Step 2: regress log(e2i ) on [1 z′i] using OLS to estimate σ2 (biased) and α

Step 3: regress yi on xi using FGLS with Ωii = exp(z′i α) to estimate β

LS applied twice to model yi = x′i β + εi: two-stage least squares

Remark: LS estimate of σ2 biased (but not important for FGLS) because

E log ε2i < log Eε2

i = log σ2i = log σ2 + z′iα

E log ε2i = −c + log σ2

i , where c > 0

log e2i = −c + log σ2 + z′iα + νi, where νi error term

Chapter 9: Panel data models

Time series: yit, t = 1, . . . , T

Cross sectional: yit, i = 1, . . . , n

Panel or longitudinal: yit, i = 1, . . . , n, t = 1, . . . , T , with n≫ T

y1t y2t y3t · · · ynt

... ...

y1T y2T y3T · · · ynT

Why panel data model

Reach panel databases are available, e.g. labor market, industrial sectors

Certain phenomena can be studied only in panel data models

E.g. Analysis of production function:

technological change (over time) and

economies of scale (across firms of different sizes)

General framework for panel data model

Typically n≫ T

yit = x′it β + z′i α + εit

= x′it β + ci + εit

xit: K × 1, without constant term

zi: individual specific variables, observed or unobserved, with constant term

ci: individual effect, often unobserved and stochastic, e.g. “health”, “ability”

Goal: estimate partial effects β = ∂E[yit|xit]/∂xit and E[ci|xi1, xi2, . . .]

Note: if zi observed ∀i ⇒ linear model estimated by LS

Modeling frameworks

Panel data model: yit = x′it β + ci + εit

1. Pooled model: ci = α constant term. Use OLS to estimate α, β

2. Fixed effects: ci unobserved and correlated with xit: E[ci|Xi] = αi

yit = x′it β + αi + εit + (ci − αi)

Regress yit on xit omits variables: LS biased, inconsistent estimate of β

3. Random effects: ci unobserved and uncorrelated with xit: E[ci|Xi] = α

yit = x′it β + α + εit + (ci − α)

Regress yit on xit and constant: OLS consistent, inefficient estimate of α, β

Pooled model

Assumption: ci = α constant term

yit = x′it β + ci + εit

= x′it β + α + εit

E[εit|Xi] = 0

Var[εit|Xi] = σ2ε

Cov[εit εjs|Xi, Xj] = 0, if i 6= j or t 6= s

If assumptions of linear regression model are met: OLS unbiased and efficient

But this is hardly the case

LS estimation of pooled model

Pooled model: yit = x′it β + ci + εit = x′

it β + α + εit

If FE true model, Cov[ci, xit] 6= 0: LS is inconsistent (omitted variables)

If RE true model, Cov[ci, xit] = 0: LS consistent but inefficient

In RE model:

= x′it β + E[ci|Xi] + (ci − E[ci|Xi]) + εit

= x′it β + α + ui + εit

= x′it β + α + wit

Autocorrelation (within group i): Cov[wit wis] = σ2u 6= 0, t 6= s

Pooled regression with random effects

RE model: yit = x′it β + α + ui + εit. Stack Ti observations for individual i:

yi = [ii xi]

+ (εi + iiui) = Xi β + wi

Shocks, wi, are heteroscedastic (across individuals) and autocorrelated:

Var[wi] = Var[εi + iiui] =

σ2ε · · · 0

0 · · · σ2ε

σ2u · · · σ2

u... . . . ...

σ2u · · · σ2

= σ2εITi

Recall: i = 1, . . . , n, and goal is to estimate β

LS pooled regression with random effects

Stack all observations for all individuals, (T1 + . . . + Tn):

b = (X ′X)−1X ′y = β +

X ′iXi

]−11

X ′iwi

p−→ β

Asy.Var[b] =1

X ′iXi

X ′iwi w

′iXi

X ′iXi

LS consistent; Asy.Var[b] called robust covariance matrix

If data are well behaved

X ′iXi

and plim

X ′iwi w

′iXi

are positive definite

but second matrix needs to be “estimated”

“Estimating” center matrix in Asy.Var[b]

Use White’s approach (not White’s heterosc. estimator):

X ′iwi w

′iXi

= plim

X ′iΩiXi

= plim

X ′iwi w

′iXi

= plim

xitwit

6= plim

w2itxit x′

Correlations across observation (not heterosc.) contribute most to Asy.Var[b]

Pooled regression: group means estimator

To estimate β use n group means, e.g. for yit, t = 1, . . . , Ti:

(1/Ti)

yit = (1/Ti)i′i yi = yi.

Averaging eliminates time series dimension of panel data (≈ cross section)

yi = Xi β + wi

(1/Ti)i′i yi = (1/Ti)i

′i Xi β + (1/Ti)i

′i wi

yi. = x′i. β + wi.

In Pooled model wi. = εi.; in RE model wi. = εi. + ui heteroscedastic

Sample data (yi., xi.), i = 1, . . . , n

Estimation: LS for β and White’s heterosc. estimator for Asy.Var[b]

Pooled regression: first difference estimator

General panel data model: yi,t = x′i,t β + ci + εi,t, where

ci correlated (fixed effects) or uncorrelated (random effects) with xi,t

yi,t − yi,t−1 = (x′i,t − x′

i,t−1)β + εi,t − εi,t−1

∆yi,t = (∆x′i,t)β + ui,t

Advantage: first difference removes all individual specific heterogeneity ci

Disadvantage: first difference removes all time-invariant variables too

ui,t: moving average (MA), covariance matrix tridiagonal, two-stage GLS

Fixed effects model

Assumption: unobservable individual effect, ci, correlated with xit

= x′it β + h(Xi) + νi + εit

= x′it β + αi + εit

Further assumption: Var[ci|Xi] = Var[νi|Xi] is constant

In general: Cov[εit, εis|Xi] = E[(νi + εit)(νi + εis)|Xi] = E[ν2i |Xi] 6= 0

Assumption: Var[εi|Xi] = σ2ε ITi

⇒ classical regression model

Parameters to estimate (K + n): [β1 · · ·βK]′ and αi, i = 1, . . . , n

Fixed effects model: drawback

Time invariant variables in xit are absorbed in αi

x′it = [1x

′it 2x

′i] time variant and time invariant variables

yit = x′it β + αi + εit

= 1x′it β1 + 2x

′i β2 + αi + εit

= 1x′it β1 + αi + εit

β2 cannot be estimated (not identified)

Fixed effects model: Least Squares Dummy Variable

Recall i = (T × 1) column of ones. Stack T observations for individual i:

yi = Xi β + i αi + εi

Stack all regression models for n individuals, LSDV model:

i 0 · · · 0... ...

0 0 · · · i

y = [X d1 · · · dn]

= X β + D α + ε

Fixed effects model: least squares estimation

Model for nT observations: y = X β + D α + ε, interest on β

Partitioned regression, MD y on MD X, reduces size of computation

b = [X ′MDX]−1X ′MDy

Asy.Var[b] = s2 [X ′MDX]−1

Individual effect, αi, estimated using only T observations on individual i:

ai = yi. − x′i.b =

(αi + x′it β + εit)− x′

ai − αi =1

εit +1

x′it(β − b) = εi. + x′

i.(β − b)

Asy.Var[ai] =σ2

T+ x′

i. Asy.Var[b] xi. 6→ 0, when n→∞

Testing differences across groups

Null hypothesis H0 : α1 = · · · = αn

α1 − α2 = 0

α2 − α3 = 0...

αn−1 − αn = 0

1 −1 0 0′

0 1 −1 0′

... ...

0′ 0 1 −1

= R α = 0

that is J = n− 1 restrictions on α.

F -statistic: compare unrestricted R2 vs. restricted R2

F [n− 1, nT −K − n] =(R2

LSDV −R2Pooled)/(n− 1)

(1−R2LSDV)/(nT −K − n)

Random effects model

Assumption: unobservable individual effect, ci, uncorrelated with xit

= x′it β + E[ci] + (ci − E[ci]) + εit

= x′it β + α + ui + εit

= x′it β + α + ηit

For T observations on individual i:

Var[ηi] = Var[εi + iT ui] =

σ2ε · · · 0

0 · · · σ2ε

σ2u · · · σ2

u... . . . ...

σ2u · · · σ2

= σ2ε IT + σ2

u iT i′T

Random effects model: Generalized least squares

Observations i ⊥ j ⇒ nT × nT cov. matrix block diagonal, Ω = In ⊗ Σ

Remark: Σ does not depend on i

β = (X ′Ω−1X)−1X ′Ω−1y =

X ′i Σ−1 Xi

)−1( n∑

X ′i Σ−1 yi

σ2ε and σ2

u in Σ are usually unknown: estimate them and then FGLS

FGLS of random effects model: estimating σ2ε

Taking deviations from group means remove heterogeneity ui

yit = x′it β + α + ui + εit

yi. = x′i. β + α + ui + εi.

yit − yi. = (xit − xi.)′β + εit − εi.

= (xit − xi.)′b + eit − ei.

σ2ε =

∑ni=1

∑Tt=1(eit − ei.)

nT − n−K

p−→ σ2ε

Degrees of freedom: nT observations − n yi. means − K slopes

Note ei. = 0; Note σ2ε = s2

LSDV as

eit = yit − yi. − (xit − xi.)′b = yit − x′

itb− (yi. − x′i.b)

= yit − x′itb− ai = residual in FE LSDV model

FGLS of random effects model: estimating σ2u

OLS consistent, unbiased, not efficient estimator of α and β in

yit = x′it β + α + ui + εit = x′

it β + α + ηit

plim s2Pooled = plim

nT −K − 1= Var[ηit] = σ2

u + σ2ε

Consistent estimator of σ2u:

σ2u = s2

Pooled − s2LSDV

If negative, change degrees of freedom

Random Effects or Fixed Effects model?

FE: flexible, Cov[ci, xit] 6= 0, but many parameters to estimate: α1, . . . , αn

RE: parsimonious but assumption Cov[ci, xit] = 0 might be violated

Hausman’s specification test, H0 : RE model

• H0 : Cov[ci, xit] = 0⇒ OLS in LSDV and GLS in RE model both consistent,

but OLS inefficient

• H1 : Cov[ci, xit] 6= 0⇒ only OLS in LSDV consistent,

but GLS in RE model inconsistent

Under H0 : OLS in LSDV model ≈ GLS in RE model

Hausman’s specification test

b OLS in LSDV model; β GLS in RE model. Under H0 : b− β ≈ 0

Var[b− β] = Var[b] + Var[β]− Cov[b, β]− Cov[β, b]

Hausman’s key result:

0 = Cov[efficient estimator, (efficient estimator − inefficient estimator)]

0 = Cov[β, (β − b)] = Var[β]− Cov[β, b]

This implies, under H0,

Var[b− β] = Var[b]−Var[β]

Wald criterion, based on K estimated slopes, excluding intercept:

W = [b− β]′ (Var[b]−Var[β])−1 [b− β] ∼ χ2(K)

Mundlak’s approach

Fixed effects model: E[ci|Xi] = αi, one parameter for each individual i

Random effects model: E[ci|Xi] = α, one parameter for all individuals

Mundlak’s approach: E[ci|Xi] = x′i.γ, parameters γ for all individuals

Model:

= x′it β + x′

i.γ + ui + εit

Drawback: x′i.γ can only include time varying variables

Dynamic panel data model

Model yit = x′it β + ci + εit describes static relation

Dynamic model yit = γ yi,t−1 + x′it β + ci + εit fits data much better

OLS and GLS inconsistent: ci correlated with yi,t−1

FE model, deviations from means, first difference: inconsistent estimates

Instrumental variable estimator: consistent estimates

Read about SUR and CAPM in Chapter 10

Chapter 12: Instrumental variables

Linear regression model: y = Xβ + ε

b = β + (X ′X)−1X ′ε

b unbiased when E[ε|X] = 0

b consistent when plimX ′ε/n = 0

In many situations (e.g., dynamic panel models, measurement error on X),

X and ε are correlated ⇒ OLS (and GLS) biased and inconsistent

Solution: Instrumental variables (IV), consistent estimates

Assumptions of the model

1. Linearity: E[y|X] linear in β

2. Full rank: X is an n×K matrix with rank K

3. Endogeneity of independent variables: E[εi|xi] 6= 0

4. Homoscedasticity and nonautocorrelation of εi

5. Stochastic or nonstochastic X

6. Normal distribution: ε|X ∼ N(0, σ2I)

Instrumental variable: Definition

Instrumental variables Z = [z1 · · · zL] (n× L), L ≥ K, have two properties:

1. Exogeneity: Z uncorrelated with ε

2. Relevance: Z correlated with X

Further assumptions of the model:

• [xi, zi, εi], i = 1, . . . , n, i.i.d.

• E[εi|zi] = 0

• plimZ ′Z/n = Qzz, finite, positive definite matrix

• plimZ ′ε/n = 0 (Exogeneity)

• plimZ ′X/n = Qzx, finite, L×K matrix, rank K (Relevance)

Insight on IV estimation

When plimX ′ε/n = 0

y = X β + ε

X ′y/n = X ′X β/n + X ′ε/n

X ′y/n ≈ X ′X β/n

β ≈ (X ′X)−1X ′y

When plimX ′ε/n 6= 0, but plimZ ′ε/n = 0 (and L = K)

Z ′y/n = Z ′X β/n + Z ′ε/n

Z ′y/n ≈ Z ′X β/n

β ≈ (Z ′X)−1Z ′y

Remark: ≈ are = in plim

Instrumental variable estimator (L = K)

L instruments, observed variables, Z is n× L matrix, when L = K

bIV = (Z ′X)−1Z ′y

= β + (Z ′X)−1Z ′ε

plim bIV = β + (plimZ ′X/n)−1 plimZ ′ε/n

= β√

n(bIV − β) = (Z ′X/n)−1√

n Z ′ε/n

d−→ Q−1zx ×N(0, σ2Qzz)

d= N(0, σ2Q−1

zx QzzQ−1xz )

bIVa∼ N(β, σ2Q−1

zx QzzQ−1xz /n)

Exogeneity ⇒ consistency; Relevance ⇒ low variance

Instrumental variable estimator (L > K)

When L > K, Z ′X is L×K, not invertible matrix

X correlated with ε ⇒ inconsistency

Z uncorrelated with ε (Exogeneity)

Idea: project X on Z to get X, than regress y on X to estimate β

X = Z × slope of X on Z = Z(Z ′Z)−1Z ′X

Regressing y on X:

bIV = [X ′X]−1X ′y

= [X ′Z(Z ′Z)−1Z ′ Z(Z ′Z)−1Z ′X]−1X ′Z(Z ′Z)−1Z ′y

= [X ′Z(Z ′Z)−1Z ′X]−1X ′Z(Z ′Z)−1Z ′y

Two-stage least squares (2SLS) estimator (only logically)

Which instruments?

Instrumental variables are generally difficult to find

Z can include variables in X uncorrelated with ε

In time series settings, lagged values of x and y are typical instruments

Relevance ⇒ high correlation between X and Z (otherwise Q−1xz large)

But then Z might be correlated with ε (as ε is correlated with X)

Example: Dynamic panel data model

Model: yit = γ yi,t−1 + x′it β + ci + εit

ci correlated or uncorrelated with xit

ci certainly correlated with yi,t−1 ⇒ LS inconsistent

Taking first difference, ∆yit = yit − yi,t−1

∆yit = γ∆yi,t−1 + ∆x′it β + ∆εit

Cov[∆yi,t−1,∆εit] 6= 0 ⇒ LS still inconsistent

To estimate γ and β, valid instruments, e.g., yi,t−2 and ∆yi,t−2

Measurement error

Measurement errors are very common in practice

E.g., variables of interest are not available but only approximated by others

E.g., GDP, consumption, capital, . . . , cannot be measured exactly

Regression model with measurement error

True, latent (unobserved), univariate model

y∗i = x∗

i β + εi

Observed data: yi = y∗i + vi and xi = x∗

i + ui

where vi ∼ (0, σ2v), vi ⊥ y∗

i , x∗i , ui and ui ∼ (0, σ2

u), ui ⊥ y∗i , x

∗i , vi

Working model, derived from true model:

yi − vi = (xi − ui)β + εi

yi = xi β + (−ui β + εi + vi)

Measurement error on yi, i.e. vi, absorbed in the error term

Measurement error on xi, i.e. ui, makes LS inconsistent

LS estimation with measurement error

Set vi = 0 for simplicity. Working model:

yi = xi β + (−ui β + εi) = xi β + wi

LS estimation of β inconsistent because

Cov[xi, wi] = Cov[x∗i + ui,−ui β + εi] = −β σ2

u 6= 0

)−1 n∑

xi yi/n

plim b =

(x∗i + ui)

(x∗i + ui)(x

∗i β + εi)/n

=(Q∗ + σ2

)−1βQ∗ = β/(1 + σ2

u/Q∗)

→ 0 when σ2u →∞

IV estimation with measurement error

Instrument zi has the two properties:

1. Exogeneity: Cov[zi, ui] = 0

2. Relevance: Cov[zi, x∗i ] = Q∗

zx 6= 0

Recall true model yi = x∗i β + εi, observed regressor xi = x∗

i + ui

xi zi/n

)−1 n∑

zi yi/n

plim bIV =

(x∗i + ui)zi/n

zi(x∗i β + εi)/n

= (Q∗zx)

−1βQ∗

zx = β

IV estimation of generalized regression model

In generalized regression model E[ε ε′|X] = σ2Ω

bIV = [X ′Z(Z ′Z)−1Z ′X]−1X ′Z(Z ′Z)−1Z ′y

= β + [X ′Z(Z ′Z)−1Z ′X]−1X ′Z(Z ′Z)−1Z ′ε

plim bIV = β + Qxx.z × plimZ ′ε/n = β√

n(bIV − β)d−→ Qxx.z ×N(0, σ2 plim (Z ′ΩZ/n))

d= N(0, σ2Qxx.z plim (Z ′ΩZ/n)Q′

bIVa∼ N(β, σ2Qxx.z plim (Z ′ΩZ/n)Q′

xx.z/n)

Same derivation as when E[ε ε′|X] = σ2I

Chapter 15: Generalized method of moments (GMM)

General framework for estimation and hypothesis testing

LS, NLS, GLS, IV, etc. special cases of GMM

GMM relies on “weak” assumptions about first moments

(existence and convergence of first moments)

Strength (and limitation) of GMM:

No assumptions about distribution ⇒ Robust to misspecification of DGP

Widely used in Econometrics, Finance, . . .

Logic behind method of moments

Sample momentsp−→ Population moments = function(parameters)

E.g., random sample yi, i = 1, . . . , n, with E[yi] = µ and Var[yi] = σ2

yip−→ E[yi] = µ

p−→ E[y2i ] = σ2 + µ2

Assumptions of Law of Large Numbers need to hold

Orthogonality conditions: Example

Parameters are implicitly defined by two orthogonality conditions:

E[yi − µ] = 0

E[y2i − σ2 − µ2] = 0

To estimate µ and σ2, replace E[·] by empirical distribution

and solve two moment equations:

(yi − µ) = 0

(y2i − σ2 − µ2) = 0

Moment estimators: µ =∑n

i=1 yi/n and σ2 =∑n

i=1(yi − µ)2/n

Example: Gamma distribution

Gamma distribution used to model positive r.v. yi, e.g. waiting time

f(y) =λp

Γ[p]e−λyyp−1, y ≥ 0, p > 0, λ > 0

(Some) orthogonality conditions:

yi − p/λ

y2i − p(p + 1)/λ2

ln yi − d ln Γ[p]/dp + lnλ

1/yi − λ/(p− 1)

Orthogonality conditions are (general) nonlinear functions of sample data

More orthogonality conditions (four) than parameters (two)

Any two orthogonality conditions give (p, λ): need to reconcile all of them

Orthogonality conditions

K parameters to estimate, θ = (θ1, . . . , θK)′

L moment conditions (L ≥ K):

mi1(yi, xi, θ)...

mil(yi, xi, θ)...

miL(yi, xi, θ)

= E[mi(yi, xi, θ)] = 0

θ implicitly defined by equation above

estimated via empirical counter part of E[·]

Exactly identified case

When L = K, i.e. # moment conditions = # parameters,

sample moment equations have a unique solution and are all exactly satisfied

• E.g., previous method of moments estimator of µ and σ2

• E.g., LS estimator: E[mi(yi, xi, θ)] = E[xi (yi − x′i θ)] = 0

Solving sample moment equations (or normal equations)

[xi (yi − x′i θ)] = 0

xiyi −1

xix′i θ = 0

xix′i

)−1( n∑

Overidentified case

When L > K, i.e. # moment conditions > # parameters,

system of L equations in K unknown parameters

mil(yi, xi, θ) = 0, l = 1, . . . , L

has no solution (equations functionally independent) in finite samples

although

mil(yi, xi, θ) = E[mil(yi, xi, θ)] = 0, l = 1, . . . , L

E.g., previous estimation of parameters of Gamma distribution

E.g., IV estimation when # instruments L > # parameters K

Criterion function

When L > K, to reconcile different estimates, minimize criterion function

q = m(θ)′ Wn m(θ)

where m(θ) =∑n

i=1 mi(yi, xi, θ)/n, L× 1 moment conditions

Wn: positive definite, weighting matrix, with plimWn = W

• When Wn = I

q = m(θ)′ m(θ) =L∑

ml(θ)2

where ml(θ) =∑n

i=1 mil(yi, xi, θ)/n, l = 1, . . . , L

• When Wn inversely proportional to variance of m(θ) ⇒ Efficiency gains

same logic that makes GLS more efficient than OLS

Optimal weighting matrix

L orthogonality conditions, possibly correlated

optimal weighting matrix:

W = Asy.Var[√

n m(θ)]−1 = Φ−1

Recall, Var[m(θ)] = Var[∑n

i=1 mi(yi, xi, θ)/n] ∈ O(1/n)

Efficient GMM estimator based on Φ−1

• When L > K, W = I (or any W 6= Φ−1) produces inefficient estimates of θ

• When L = K =⇒ moment equations satisfied exactly, i.e. m(θ) = 0,

=⇒ q = 0 and W irrelevant

Assumptions of GMM estimation

θ0 true parameter vector, K × 1

L population orthogonality conditions: E[mi(θ0)] = 0, L ≥ K

L sample moments: mn(θ0) =∑n

i=1 mi(θ0)/n

E.g., IV estimation: mn(θ0) =∑n

i=1 zi(yi − x′iθ0)/n,

L ≥ K instruments, one orthogonality condition for each instrument

Assumption 1: Convergence of empirical moments

Data generating process satisfies assumptions of Law of Large Numbers

mn(θ0) =1

mi(θ0)p−→ E[mi(θ0)] = 0

Empirical moment equations continuous and continuously differentiable

=⇒ L×K matrix of partial derivatives

Gn(θ0) =∂mn(θ0)

∂θ′0=

∂mi(θ0)

∂θ′0

p−→ G(θ0)

Law of Large Numbers apply to moments and derivatives of moments

Assumption 2: Identification

For any n > K, if θ1 6= θ2, then mn(θ1) 6= mn(θ2)

plim qn(θ) = plim (mn(θ)′ Wn mn(θ)) has unique minimum (= zero) at θ0

Identification ⇒ L ≥ K and rank(Gn(θ0)) = K

Assumptions 1 and 2 =⇒ θ can be estimated

Assumption 3: Asymptotic distribution of empirical moments

Empirical moments obey a Central Limit Theorem

√n mn(θ0)

d−→ N(0,Φ)

Asymptotic properties of GMM

Under previous assumptions

θGMMp−→ θ0

θGMMa∼ N(θ0, [G(θ0)

′Φ−1G(θ0)]−1/n)

Consistency of GMM estimator

Recall criterion function qn(θ) = mn(θ)′ Wn mn(θ)

Assumption 1 and continuity of moments =⇒ qn(θ)p−→ q0(θ)

Wn positive definite, for any finite n

0 ≤ qn(θGMM) ≤ qn(θ0)

When n→∞, qn(θ0)p−→ 0 =⇒ qn(θGMM)

p−→ 0

W positive definite and identification assumption =⇒ θGMMp−→ θ0

Asymptotic normality of GMM estimator

First order condition for the GMM estimator:

∂qn(θGMM)

∂θGMM

= 2Gn(θGMM)′ Wn mn(θGMM) = 0

Assumption: moment equations continuous and continuously differentiable

Mean Value Theorem and Taylor expansion at θ0 of moment equations:

mn(θGMM) = mn(θ0) + Gn(θ)(θGMM − θ0)

where θ0 < θ < θGMM componentwise. Fist order condition becomes:

2 Gn(θGMM)′ Wn [mn(θ0) + Gn(θ)(θGMM − θ0)] = 0

Solve for (θGMM − θ0) and ×√n give:

Asymptotic normality of GMM estimator

√n(θGMM − θ0) = −[Gn(θGMM)′ Wn Gn(θ)]−1Gn(θGMM)′ Wn

√n mn(θ0)

When n→∞

• θGMMp−→ θ0 and θ

p−→ θ0 as θ0 < θ < θGMM componentwise

• Gn(θGMM)p−→ G(θ0) and Gn(θ)

p−→ G(θ0)

• Wnp−→W by construction of weighting matrix

• √n mn(θ0)d−→ N(0,Φ) by Assumption 3

√n(θGMM − θ0)

d−→ −[G(θ0)′ W G(θ0)]

−1G(θ0)′ W ×N(0,Φ)

d= N [0, [G(θ0)

′ W G(θ0)]−1G(θ0)

′ WΦ . . . . . .′]d= N [0, [G(θ0)

′ Φ−1 G(θ0)]−1] using W = Φ−1

θGMMa∼ N(θ0, [G(θ0)

′ Φ−1 G(θ0)]−1/n)

Weighting matrix

Any W positive definite matrix produces consistent GMM estimates

W determines efficiency of GMM estimator:

Optimal W = Asy.Var[√

n m(θ)]−1 depends on unknown θ

Feasible two-step procedure:

Step 1. Use W = I to obtain a consistent estimator, θ(1), then estimate

mi(yi, xi, θ(1)) mi(yi, xi, θ(1))′

(when mi(yi, xi, θ0), i = 1, . . . , n uncorrelated sequence)

Step 2. Use W = Φ−1 to compute GMM estimator

Testing hypothesis in GMM framework

Two sets of tests:

1. Testing restrictions induced by moment equations

2. GMM counterparts to Wald, LM, and LR tests

Specification test

In exactly identified case, L moment equations = K parameters:

θ exists such that m(θ) = 0

In overidentified case, L moment equations > K parameters:

L−K moment equations imply moment restrictions on θ

Intuition: •K moment equations “set to zero to compute” the K parameters

• L−K “free” moment equations

Test of overidentifying restrictions, using W = Asy.Var[√

n m(θ)]−1:

J-stat = nq =√

n m(θ)′ W√

n m(θ)d−→ χ2

(L−K)

Note: no parametric restrictions on θ in the specification test

Testing parametric restrictions

To test J (linear or nonlinear) parametric restrictions on θ

Given L moment equations, now only K − J free parameters

nqR =√

n m(θR)′ W√

n m(θR)d−→ χ2

(L−(K−J))

nqR − nqd−→ χ2

as for degrees of freedom (L− (K − J))− (L−K) = J

Note: same optimal weighting matrix W in qR and q =⇒ qR ≥ q

Application of GMM: Asset pricing model estimation

Asset pricing model:

E[rej,t] = δAER βAER,j + δHML βHML,j + δCLS βCLS,j = δ′β

Stochastic Discount Factor (SDF) representation, demeaned factors ft:

mt = 1− bAER˜AERt − bHML

˜HMLt − bCLS˜CLSt = 1− b′ft

Euler pricing equation:

E[mt rej,t] = 0

⇒ N moment conditions j = 1, . . . , N

Market price of risks, δ, and SDF loadings, b:

δ = E[ft f ′t]b

GMM estimation resultsModel (1) (2) (3)

δAER 11.43 4.34

(7.26) (11.36)

δHML 18.70 7.48

(13.08) (35.93)

δCLS 13.16 27.13

(3.35) (17.77)

bAER 0.13 −0.03

(0.05) (0.06)

bHML 0.05 −0.16

(0.04) (0.07)

bCLS 0.26 0.75

(0.07) (0.24)

J-stat 0.0467 0.0444 0.0349

p-value 6.03% 7.71% 14.14%

Table 1: Parameter estimates (Newey–West standard errors).

Chapter 16: Maximum Likelihood Estimation

Maximum Likelihood Estimation (MLE): very important inference method

Maximum likelihood principle:

Given sample data generated from parametric model,

find parameters that maximize probability of observing that sample

Basic, strong assumption:

DGP has parametric, known (up to θ) distribution

Fundamental result:

MLE makes “best use” of this information

Likelihood function

Likelihood function = probability of observing that sample

Formally, joint density of n i.i.d. observations, y1, . . . , yn

f(y1, . . . , yn; θ) =n∏

f(yi; θ) = L(θ; y)

L(θ; y) is the likelihood function, with θ unknown

Log-likelihood is usually easier to deal with

lnL(θ; y) =n∑

ln f(yi; θ)

Identification

Identification means parameters are estimable. It depends on the model

Check identification before estimating or testing the model

Definition: θ is identified (or estimable) if

L(θ; y) 6= L(θ∗; y)

∀θ∗ 6= θ and some data y

E.g. Linear regression model not identified when rank [x1, . . . , xK] < K

E.g. Threshold model for yi > 0 or yi ≤ 0

Pr(yi > 0) = Pr(β1 + β2 xi + εi > 0) = Pr(εi/σ > −(β1 + β2 xi)/σ)

not identified, σ, β1, β2 not estimable (normalization required, e.g. σ = 1)

Maximum likelihood estimator

Maximum likelihood estimator, θ, solves

θ = arg maxθ

L(θ; y)

= arg maxθ

lnL(θ; y)

or equivalently the likelihood equation

∂ lnL(θ; y)

∂θ= 0

Maximum likelihood estimator: Example

i.i.d. normal random variables, yi ∼ N(µ, σ2), i = 1, . . . , n

lnL(µ, σ2; y) = −n

2ln(2π)− n

2lnσ2 − 1

(yi − µ)2/σ2

∂ lnL/∂µ =

(yi − µ)/σ2 = 0

∂ lnL/∂σ2 = − n

(yi − µ)2 = 0

Solve likelihood equations:

µML =1

σ2ML =

(yi − µML)2

Asymptotic efficiency

An estimator is asymptotically efficient if it is

• consistent,

• asymptotically normally distributed (CAN), and has

• asy. covariance matrix not larger than that of any other CAN estimator

Under some regularity conditions, MLE is asymptotically efficient

Finite sample properties usually not optimal

E.g., σ2ML =

∑ni=1(yi− y)2/n biased (no correction for degrees of freedom)

Properties of MLE

Under regularity conditions, MLE θ has the following properties:

M1 Consistency: plim θ = θ0

M2 Asymptotic normality: θa∼ N(θ0, −E0[∂

2 lnL/∂θ0 ∂θ′0]−1)

M3 Asymptotic efficiency: θ reaches Cramer–Rao lower bound in M2

M4 Invariance: MLE of γ0 = c(θ0) is γ0 = c(θ) if c ∈ C1

Regularity conditions on f(yi; θ)

R1 First three derivatives of ln f(yi; θ) w.r.t. θ are continuous and finite ∀θ

R2 Conditions for E[∂ ln f(yi; θ)/∂θ] <∞, E[∂2 ln f(yi; θ)/∂θ ∂θ′] <∞ hold

R3 |∂3 ln f(yi; θ)/∂θj ∂θk ∂θl| < h, where E[h] <∞, ∀θ

Definition: Regular densities satisfy R1–R3

Goals: use Taylor approximation; interchange differentiation and expectation

Notation: gradient gi = ∂ ln f(yi; θ)/∂θ, Hessian Hi = ∂2 ln f(yi; θ)/∂θ ∂θ′

Properties of regular densities

Moments of derivatives of log-likelihood:

D1 ln f(yi; θ), gi, Hi, i = 1, . . . , n are random samples

D2 E0[gi(θ0)] = 0

D3 Var0[gi(θ0)] = −E0[Hi(θ0)]

D1 implied by assumption: yi, i = 1, . . . , n is random sample

To prove D2: by definition 1 =∫

f(yi; θ0) dyi

∂θ0=

∂θ0

f(yi; θ0) dyi

∫∂f(yi; θ0)

∂θ0dyi =

∫∂ ln f(yi; θ0)

∂θ0f(yi; θ0) dyi = E0[gi(θ0)]

Information matrix equality

To prove D3: differentiate previous integral once more w.r.t. θ0

∂θ′0=

∂θ′0

∫∂ ln f(yi; θ0)

∂θ0f(yi; θ0) dyi

∫ [∂2 ln f(yi; θ0)

∂θ0 ∂θ′0f(yi; θ0) +

∂ ln f(yi; θ0)

∂θ0

∂f(yi; θ0)

∂θ′0

∫ [∂2 ln f(yi; θ0)

∂θ0 ∂θ′0f(yi; θ0) +

∂ ln f(yi; θ0)

∂θ0

∂ ln f(yi; θ0)

∂θ′0f(yi; θ0)

= E0[Hi(θ0)] + Var0[gi(θ0)] =⇒ D3

D1 (random sample) ⇒ Var0[∑n

i=1 gi(θ0)] =∑n

i=1 Var0[gi(θ0)]

gi(θ0)] =: Var0

[∂ lnL(θ0; y)

∂θ0

= −E0

[∂2 lnL(θ0; y)

∂θ0 ∂θ′0

︸︷︷︸

Information matrix equality

:= −E0[

Hi(θ0)]

Likelihood equation

Score vector at θ:

g =∂ lnL(θ; y)

∂θ=

∂ ln f(yi; θ)

∂θ=

D1 (random sample) and D2 (E0[gi(θ0)] = 0) ⇒ Likelihood equation at θ0:

[∂ lnL(θ0; y)

∂θ0

Consistency of MLE

In any finite sample, lnL(θ) ≥ lnL(θ0) (and in general ∀θ 6= θ, not only θ0)

From Jensen’s inequality, if θ0 6= θ (and in general ∀θ 6= θ0, not only θ)

lnL(θ)

L(θ0)

< lnE0

L(θ0)

∫L(θ)

L(θ0)L(θ0) dy = ln 1 = 0

E0[lnL(θ)/n] < E0[lnL(θ0)/n] (♣)

Under previous assumptions, using inequality in the very first row:

plim lnL(θ)/n ≥ plim lnL(θ0)/n

E0[lnL(θ)/n] ≥ E0[lnL(θ0)/n]

and combining with (♣): E0[lnL(θ0)/n] > E0[lnL(θ)/n] ≥ E0[lnL(θ0)/n]

⇒ plim lnL(θ)/n = E0[lnL(θ0)/n] and plim θ = θ0

Asymptotic normality of MLE

MLE solves sample likelihood equation: g(θ) =∑n

i=1 gi(θ) = 0

First order Taylor expansion: g(θ) = g(θ0) + H(θ)(θ − θ0) = 0

As θ = w θ0 + (1− w) θ, 0 < w < 1, plim θ = θ0 ⇒ plim θ = θ0

Hessian is continuous in θ. Rearranging, scaling by√

n, taking limit n→∞

√n(θ − θ0) = −H(θ)−1

√n g(θ0) =

Hi(θ)

−1√n

gi(θ0)

d−→

Hi(θ0)

0,−E0

Hi(θ0)

θa∼ N

θ0, −E0 [H(θ0)/n]−1/n)

θ0, I(θ0)−1)

Asymptotic efficiency

Cramer–Rao lower bound :

Assume that f(yi; θ0) satisfies regularity conditions R1–R3,

the asymptotic variance of a consistent and asy. normally distributed

estimator of θ0 is at least as large as

I(θ0)−1 =

[∂2 lnL(θ0)

∂θ0 ∂θ′0

Asymptotic variance of MLE reaches the Cramer–Rao lower bound

Invariance

MLE of γ0 = c(θ0) is γ0 = c(θ) if c ∈ C1

MLE invariant to one-to-one transformation

Useful application: lnL(θ0) can be “complicated” function of θ0

re-parameterize the model to simplify calculations using lnL(γ0)

E.g. Normal log-likelihood, precision parameter γ2 = 1/σ2

lnL(µ, γ2; y) = −n

2ln(2π) +

2ln γ2 − γ2

(yi − µ)2

∂ lnL

∂γ2=

γ2− 1

(yi − µ)2 = 0

γ2ML =

i=1(yi − µML)2=

Estimating asymptotic covariance matrix of MLE

Asy.Var[θ] depends on θ0. Three estimators, asymptotically equivalent:

1. Calculate E0[H(θ0)] (very difficult) and evaluate it at θ to estimate

I(θ0)−1 =

[∂2 lnL(θ0)

∂θ0 ∂θ′0

2. Calculate H(θ0) (still quite difficult) and evaluate it at θ to get

I(θ)−1 =

−∂2 lnL(θ)

∂θ ∂θ′

−n∑

Hi(θ)

3. BHHH or OPG estimator (very easy): use D3 E0[−Hi(θ0)] = Var0[gi(θ0)]

ˆI(θ)−1 =

−n∑

Hi(θ)

gi(θ) gi(θ)′

Conditional likelihood

Econometric models involve exogenous variables xi ⇒ yi not i.i.d.

E.g. Model: yi = x′i β + εi, xi can be stochastic, correlated across i’s, etc.

Usually f(y; θ) not interesting, data generated by f(y, x) not known

Way out: DGP of xi exogenous and well-behaved (LLN applies),

xi ∼ f(xi, δ), θ and δ no common elements, no restrictions between θ and δ

f(yi, xi; θ, δ) = f(yi|xi; θ) f(xi; δ)

lnL(θ, δ; y, x) =

ln f(yi|xi; θ) +

ln f(xi; δ)

θML = arg maxθ

ln f(yi|xi; θ)

Maximizing log-likelihood

Log-likelihoods are typically highly nonlinear functions of parameters

E.g., GARCH in mean model for asset return, yt = pt/pt−1−1 = Et−1[yt]+εt

with Et−1[yt] = γ0 + γ1σ2t and Vart−1[yt] = σ2

t = β0 + β1 ε2t−1 + β2 σ2

lnL = −0.5∑T

t=1[ln(2π) + lnσ2t + (yt − γ0 − γ1σ

2/σ2t ]

Maximizing log-likelihood is a numerical problem, various methods:

• “Brute force” (but using good routines, e.g. FMINSEARCH in Matlab)

• Newton’s method: θ(i+1) = θ(i) −H−1(i) g(i), use actual Hessian

• Score method: θ(i+1) = θ(i) −H−1 g(i), use expected Hessian

Hypothesis testing

Test of hypothesis H0 : c(θ) = 0

Three tests, asymptotically equivalent (not in finite sample)

• Likelihood ratio: If c(θ) = 0, then lnLU − lnLR ≈ 0

Both unrestricted (ML) and restricted estimators are required

• Wald test: If c(θ) = 0, then c(θML) ≈ 0

Only unrestricted (ML) estimator is required

• Lagrange multiplier test: If c(θ) = 0, then ∂ lnL/∂θR ≈ 0

Only restricted estimator is required

Likelihood ratio test

LU = L(θU), where θU is MLE, unrestricted

LR = L(θR), where θR is restricted estimator

Likelihood ratio: LR/LU

0 ≤ LR/LU ≤ 1

Limiting distribution of likelihood ratio: 2 (lnLU − lnLR) ∼ χ2df

with df = # of restrictions

Remarks:

• LR test cannot be used to test two restricted models, θU must be MLE

• Likelihood function L must be the same in LU and LR

Wald test

Wald test based on full rank quadratic forms

Recall: If x ∼ N(µ,Σ), quadratic form (x− µ)′ Σ−1 (x− µ) ∼ χ2(J)

If E[x] 6= µ, (x−µ)′ Σ−1 (x−µ) ∼ noncentral χ2(J) (> χ2

(J) on average)

If H0 : c(θ) = q is true, c(θML)− q ≈ 0 (not “= 0” for sampling variability)

If H0 : c(θ) = q is false, c(θML)− q ≪ 0 or ≫ 0

Wald test statistic:

W = [c(θML)− q]′Asy.Var[c(θML)− q]−1[c(θML)− q] ∼ χ2df

Drawbacks: no H1 ⇒ limited power; not invariant to restriction formulation

Lagrange multiplier test

Lagrange multiplier (or score) test based on restricted model

Restrictions H0 : c(θ) = q, Lagrangean: lnL(θ) + (c(θ)− q)′λ

First order conditions for restricted θ, i.e. θR:

∂ lnL(θ)

∂θ+

∂c(θ)′

∂θλ = 0

If restrictions not binding ⇒ λ = 0 (first term MLE) and can be tested.

Simpler, equivalent approach: at restricted maximum

∂ lnL(θR)

∂θR

+∂c(θR)′

∂θR

λ = 0 =⇒ −∂c(θR)′

∂θR

λ =∂ lnL(θR)

∂θR

Under H0 : λ = 0, gR =∑n

i=1 gi(θR) = 0

Recall, Var0[∑n

i=1 gi(θ0)] = −E0[∂2 lnL/∂θ0 ∂θ′0] = I(θ0)

Lagrange multiplier test statistic

As in Wald test, LM statistic is a full rank quadratic form:

∂ lnL(θR)

∂θR

I(θR)−1

∂ lnL(θR)

∂θR

∼ χ2df

Alternative calculation of LM test: define G′R = [g1(θR), . . . , gn(θR)] (K × n),

regress a column of 1s, i, on GR ⇒ slope bi = (G′R GR)−1G′

uncentered R2i =

i′ i=

(b′i G′R) (GR bi)

i′ i

=(i′GR (G′

R GR)−1G′R) (GR (G′

R GR)−1G′R i)

=i′GR G′

R GR−1G′R i

Application of MLE: Linear regression model

Model: yi = x′i β + εi, and yi|xi ∼ N(x′

i β, σ2)

Log-likelihood based on n conditionally independent observations:

lnL = −n

2ln(2π)− n

2lnσ2 − 1

(yi − x′i β)2

= −n

2ln(2π)− n

2lnσ2 − (y −Xβ)′ (y −Xβ)

Likelihood equations:

∂ lnL

∂β=

X ′(y −Xβ)

σ2= 0

∂ lnL

∂σ2= − n

(y −Xβ)′ (y −Xβ)

2 σ4= 0

MLE of linear regression model

Solving likelihood equations:

βML = (X ′X)−1X ′y and σ2ML =

βML = b =⇒ OLS has all desirable asymptotic properties of MLE

σ2ML 6= s2 = e′e/(n−K) =⇒ σ2

ML biased in finite samples, but

E[σ2ML] = E

[(n−K)

=(n−K)

nσ2 −→ σ2, n→∞

Cramer–Rao lower bound for θ′ML = (β′ML σ2

ML) can be computed explicitly:

I(θ)−1 =

[∂2 lnL(θ)

∂θ ∂θ′

σ2(X ′X)−1 0

0′ 2σ4/n

MLE and Wald test

Testing J (possibly nonlinear) restrictions, H0 : c(β) = 0 vs. H1 : c(β) 6= 0

Idea: check whether unrestricted estimator (i.e. MLE) “satisfies” restrictions

Under H0, Wald statistic

W = c(b)′

∂c(b)

∂b′[σ2(X ′X)−1

] ∂c(b)′

c(b)d−→ χ2

where σ2(X ′X)−1 = Asy.Var[b], using Delta method:

c(b) ≈ c(β) +∂c(β)

∂β′(b− β)

Asy.Var[c(b)] =∂c(β)

∂β′Asy.Var[b]

∂c(β)′

and plim b = β, plim c(b) = c(β), plim ∂c(b)/∂b′ = plim ∂c(β)/∂β′

MLE and Likelihood ratio test

Idea: check whether unrestricted L “significantly” larger than restricted L∗

Likelihood ratio test: b unrestricted, b∗ restricted slopes

FOC of σ2 implies: Est.[σ2] = (y −Xβ)′(y −Xβ)/n, with β = b or b∗

LR = 2[lnL− lnL∗]

−n ln σ2 − (y −Xb)′ (y −Xb)

−n ln σ2∗ −

(y −Xb∗)′ (y −Xb∗)

σ2∗

= n ln σ2∗ − n ln σ2 d−→ χ2

plugging σ2 in lnL, and σ2∗ in lnL∗, (i.e. concentrating log-likelihood)

⇒ second terms in square brackets both equal n and cancel out

MLE and Lagrange multiplier test

Idea: gradient of lnL at restricted maximum, gR, should be “close” to zero

From Lagrangean: gR(β) = ∂ lnL(β)/∂β = −∂c(β)′/∂β λ

Under H0 : λ = 0, E0[gR(β)] = E0[X′ε/σ2] = 0⇒ X ′e∗ ≈ 0

Lagrange multiplier: apply Wald-type test to restricted gradient of lnL

LM = e′∗X(Est.Var[X ′ε])−1X ′e∗

= e′∗X(σ2∗X

′X)−1X ′e∗

=e′∗X(X ′X)−1X ′e∗

e′∗e∗/n= nR2

∗d−→ χ2

R2∗ in the regression of restricted residuals e∗ = (y −Xb∗) on X

Intuition: if restrictions not binding, b∗ = b, e∗ ⊥ X, LM = 0

Pseudo Maximum Likelihood estimation

ML requires complete specification of f(yi|xi; θ)

What if the density is misspecified?

Under certain conditions, the estimator retain some good properties

even if the wrong likelihood is maximized

E.g., In the model yi = x′i β + εi, OLS is MLE when εi ∼ N(0, σ2)

but under certain conditions LS is still consistent, even when εi 6∼ N(0, σ2)

When εi 6∼ N(0, σ2), OLS is maximizing the wrong likelihood

Key point: OLS solves normal equations E[xi(yi − x′i β)] = 0

These equations might hold even when εi 6∼ N(0, σ2)

Pseudo Maximum Likelihood estimator

θML = arg maxθ

∑ni=1 ln f(yi|xi; θ), where f(yi|xi; θ) true p.d.f.

θPML = arg maxθ

∑ni=1 lnh(yi|xi; θ), where h(yi|xi; θ) ∈ Exponential family

Key point: h(yi|xi; θ) 6= f(yi|xi; θ), possibly

If h(yi|xi; θ) = f(yi|xi; θ), then θPML = θML

E.g., Estimate θ when f(yi|xi; θ) = N(x′i θ, σ

2i ) and h(yi|xi; θ) = N(x′

i θ, σ2)

PML estimator solves first order conditions:

∂ lnh(yi|xi; θPML)

∂θPML

Asymptotic distribution of PML estimator

Usual method: first order Taylor expansion of FOC, mean value theorem,

rearrange to have (θPML − θ0) in LHS, scale by√

n, take limit n→∞:

√n(θPML − θ0) =

∂2 lnh(yi|xi; θ)

∂θ ∂θ′

−1√n

∂ lnh(yi|xi; θ0)

∂θ0

d−→−H(θ0)

−1 ×N (0,Φ)

(0,H(θ0)

−1 Φ H(θ0)−1)

θPMLa∼ N

(θ0,H(θ0)

−1 Φ H(θ0)−1/n

If h(yi|xi; θ0) is true p.d.f., then information matrix equality holds,

Φ = −H(θ0), and θPML = θML

θMLa∼ N

(θ0,−H(θ0)

−1/n)

Estimator of Asy.Var[θPML]

Sandwich (or robust) estimator of

Asy.Var[θPML] = H(θ0)−1 Φ H(θ0)

−1/n

based on

• Empirical counterpart (no expectation) of the Hessian H(θ0):

Est.[H(θ0)] =1

∂2 lnh(yi|xi; θPML)

∂θPML ∂θ′PML

• Sample variance of gradients:

Est.[Φ] =1

∂θPML

∂θ′PML

Remarks on PML estimation

In general, maximizing wrong likelihoods gives inconsistent estimates

(in those cases, sandwich estimator of Asy.Var[θPML] useless)

Under certain conditions, θPML robust to some model misspecification

Major advantage of PML: if h(yi|xi; θ0) is true p.d.f., then θPML = θML

(in those cases, sandwich estimator should not be used)

Typical application of PML in Finance: daily asset returns are not normal,

but GARCH volatility models typically estimated using Gaussian likelihoods

Summary of the course

• Linear regression model: OLS estimator, specification and hypothesis testing

• Generalized regression model: heteroscedastic data, GLS estimator

• Panel data model: Fixed and Random effects, Hausman’s specification test

• Instrumental variables: regressors correlated with disturbances

• Generalized method of moments: general framework for inference, weak

assumptions

• Maximum likelihood estimation: assume parametric DGP, best use of this

information

• Hypothesis testing: Likelihood ratio, Wald, Lagrange multiplier tests

Slide Mancini Eco No Metrics EPFL2011

Documents