Statistical Inference for the Linear Regression...

transcript

Bradford S. Jones, UC-Davis, Dept. of Political Science

Statistical Inference for the Linear RegressionModel

Brad Jones1

1Department of Political ScienceUniversity of California, Davis

January 22, 2010

Jones POL 212: Research Methods

Bradford S. Jones, UC-Davis, Dept. of Political Science Today: Variance Components and Inference

Today: Variance Components and Inference

Variance Components from a Regression Model

I Useful to think about the “standard error” of the regression.

I Quantity minimized:∑

I Suppose we compute the variance of the residuals:

var(r) =

∑r2i

n − k − 1

I Why n− k − 1? [These are the consumed degrees of freedom.Note again what must happen as k → n.]

I Since the variance gives us the average squared deviationbetween the observed Y and Y , we take the square root:

s.e(r) =

√ ∑r2i

n − k − 1

I This gives us the standard error of the regression.

I . . . or the “average prediction error.” The smaller the residualcomponent, the smaller the s.e. of the regression.

I For the pedagogical regression using the calcount data, thes.e. is about 6.15.

I Average prediction error in the model is about 6 percent.

I The s.e. is scaled by Y so it is easy to interpret.

I It is clear the residual sums of squares are just half of theoverall variance in the regression model.

I If the RSS gives us “error variance” then what informs usabout predictive improvement over and above the mean?

I Recall that if βj = 0 then β0 = Y .

I Deviations in predictions, Y from the mean, Y tells theimprovement gain in using X to predict Y over simply guessthe mean every time.

I The calcounty data:

Regression Model

I So the deviation Y − Y gives the signed difference betweenpredicted and the mean.

I Intuition: if the fitted values do not depart from the mean, Xis not doing a “good job” of predicting Y .

I Square and sum:n∑

(Y − Y )2

I This is regression sum of squares (or sum of squares due tothe regression). Fox refers to it as RegSS.

I It should be clear that the sum of RSS and RegSS accountsfor the total variance in the model.

I Total Sums of Squares (TSS):

TSS =∑

(Yi − Yi )2 +

∑(Y − Y )2

(Yi − Y )2

= RegSS + RSS

I This shows us again that the regression function must passthrough the point of averages.

I From these variance components, an intuitive fit measureemerges:

∑(Y − Y )2∑

(Yi − Yi )2 +∑

(Y − Y )2

=RegSS

I In multiple regression, this is the squared multiple correlationor equivalently the square of rY Y .

I Obama model: RSS = 2119; RegSS = 7667.

I R2 = 7667/(2119 + 7667) ≈ .78.

I In terms of the total variance in the model, about 78 percentis accounted for by the linear regression of votes on Prop. 8support (n.b.).

I Issues with the R2

I The R2 is nondecreasing in X (why must this be the case?; beable to show this mathematically)

I Usually better to use “adjusted R2:

R2 = 1− (1− R2)n − 1

n − k − 1

= 1− RSS

TSS× dfTSS

= 1− RSS/n − k − 1

TSS/n − 1(1)

I The degrees-of-freedom are used as a correction factor.

I In passing, note that R2 can be negative (see next slide).

I If R2 is nondecreasing, it is not very useful for modelcomparisons.

Adj. R2 from Regression Model

Properties

I What are some properties of the model?

I Start with assumption that the error is not systematicimplying:

E (εi ) = E (εi | X ) = 0 (2)

I Linearity: E (Y ) is a linear function of Xk :

µi = E (Y | xi ) = E (β0 + β1xi + εi )

= β0 + β1xi + E (εi )

= β0 + β1xi + 0

= β0 + β1xi (3)

Assumptions and Properties

I Homoskedasticity (constant variance):

var(εi | Xi ) = E [εi − E (εi )] | Xi ]2

= E (ε2i | Xi )(Why?)= σ2 (4)

I This implies the variance of εi for each Xi is equal to somepositive constant (which is equal to σ2).

I (Q: Since we usually do not observe σ2 directly, what do youthink is used as its estimator?)

I When this assumption does not hold, we have a conditionknown as heteroskedasticity, and the variance is equal to σ2

I Why might you care about this assumption?

I Independence assumptions:

cov(εi , εj | Xi ,Xj) = E [εi − E (εi ) | Xi ][εj − E (εj) | Xj ]

= E (εi | Xi )(εj | Xj)(Why?)= 0 (5)

I This implies that there is no correlation of the disturbancesacross the observations.

I Wrt sampling, the observations are sampled independently.

I Problem with time-series data: If εti and εt−1,i are positivelycorrelated, then Y is a function of not only Xi and εti , butalso εt−1,i .

I Xk are fixed in repeated sampling.

I Very strong assumption!

I Experimental designs (in principle) will satisfy this condition. . .

I Unfortunately, we often work with observational data.

I This is why causal inference is difficult (or at least one reasonwhy).

I Covariance result: the covariance between εi and Xi is 0:

cov(εi ,Xi ) = E [εi − E (εi )][Xi − E (Xi )]

= E [εi (Xi − E (Xi ))](Why?)= E (εiXi )− E (Xi )E (εi )

= E (εiXi )

= 0. (6)

I The import of it is to say that the unsystematic component(given by εi ) is not related to the systematic component(given by the Xi ).

I X is not a constant

I n > k + 1

I There is no perfect collinearity.

−1 < rXi ,Xj< 1 (7)

i.e. one variable is not a linear combination of another variablesuch that the correlation between the variables is 1 (or -1).

I The model is correctly specified.

I Note we have said nothing about distributions at this point.

Inference for the Regression Model

I The regression assumptions give us a baseline to evaluate theadequacy of the model.

I But we need more precision in connecting our estimates backto the population parameters.

I βk are derived from the sample data so there will be variability.

I We want to estimate the parameter’s precision or its reliability.

I The usual measure of precision in statistics is the standarderror.

I It is taken as the standard deviation of the samplingdistribution of the estimator.

I Given that our estimator has a probability distribution (for agiven sample size from a given population), it is natural to askwhat the variance is of that distribution.

I This leads directly to the consideration of the variance of theestimators.

I Bivariate model first (extension to the n-variable case isstraightforward).

I The variance of the regression slope, β is given by

var(β) =σ2∑

(Xi − X )2,

and the standard error is the square root of the variance,giving us

se(β) =σ√∑

(Xi − X )2.

I The variance of the regression intercept β0 is given by

var(β0) =

( ∑X 2

(Xi − X )2

and the standard error is given by

se(β0) =

√( ∑X 2

(Xi − X )2

I In general, we will be more interested in the precision aroundthe slope coefficient than the intercept.

I We have seen σ2 before: variance of the error component.

I Which is assumed to be constant.

I We usually will not directly observe this term (why?) butinstead must estimate it directly from the data.

I What is the estimator we use?

I Recall the “standard error of the estimate”:

s.e(r) =

√ ∑r2i

n − k − 1

I For the bivariate setting:

var(ri ) =

∑(Yi − Yi )

n − 2

∑(ri )

n − 2

n − 2,

which, after taking the square root, gives us√SSE

n − 2

I The square root is the s.e. of the estimate aka the “rootmean square error.”

I . . . and the MSE is?I

∑(Yi − Yi )

2/n − 2

lm(formula = obamapercent ~ proportionforprop8)

Residuals:

Min 1Q Median 3Q Max

-8.795 -5.392 -0.669 4.117 19.317

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 102.3339 3.5434 28.9 <2e-16 ***

proportionforprop8 -0.8658 0.0608 -14.2 <2e-16 ***

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 6.15 on 56 degrees of freedom

Multiple R-squared: 0.783, Adjusted R-squared: 0.78

F-statistic: 203 on 1 and 56 DF, p-value: <2e-16

> anova(regmod)

Analysis of Variance Table

Response: obamapercent

Df Sum Sq Mean Sq F value Pr(>F)

proportionforprop8 1 7667 7667 203 <2e-16 ***

Residuals 56 2119 38

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

I Model output:RSS: 2119RSS df: 56 (n-2)MSE: 38RegSS: 7667RegSS df: 1 (k)MSR: 7667 (RegSS/df)TSS: RSS + RegSS (not shown)TSS df: 57 (n-1)RMSE: 6.15

I More model output . . .

I Standard error of the regression coefficient:

se(β) =σ√∑

(Xi − X )2.

I RMSE=6.15 and√∑

(Xi − X )2 ≈ 10620

I s.e.(β) = 6.15/√

(10620) ≈ .06

I You can verify the s.e. of the constant on your own.

I The extension to multiple regressors is straightforward(although like the least squares estimators, the presentation inscalar form gets ugly).

I Model with β0, β1 β2 gives variances for the intercepts of:

var(β1) =σ2∑

(X1 − X1)2(1− r21,2)

for β1,

var(β2) =σ2∑

(X2 − X2)2(1− r21,2)

for β2

I The variance function for the constant is a bit ugly; consultFox to see it.

I Standard errors:

se(β1) =σ√∑

(X1 − X1)2(1− r21,2)

for β1,

se(β2) =σ√∑

(X2 − X2)2(1− r21,2)

for β2.

I The term, 1− r21,2, is known as the “auxiliary regression”

where the r2 is obtained by the regression of X1 on X2.

I Equivalently, the square root of the r2 term gives you thecorrelation coefficient between X2 and X1.

I It is a measure of how collinear the covariates are.

I Fox uses this notation for the variance:

var(βk) =1

1− R2k

× σ2∑ni=1(xij − x j)2

I Obviously the same result will be gotten if you take the squareroot.

I When k > 2, R2k is the squared multiple correlation from the

regression of some X on all the other Xk .

I Note that the first factor is sometimes called the “varianceinflation factor.”

I We now have β and we have the s.e.(β)

I What can we do in the way of inference?

I It’s time to overlay some distributional assumptions here.

I Conventional to assume normality.

I Now understand, we’ve gotten pretty far without thenormality assumption.

I The only assumptions regarding εi have been:a. conditional mean is 0.b. variance is homoskedastic.c. 0 covariance with xi .

I But now we need to go beyond point estimation and entertheworld of hypothesis testing. This requires us to say somethingabout the distribution of the error term.

I The regression coefficients are a linear function of εi (recallthe least squares estimator).

I Therefore, the sampling distribution of our least squaresestimator will depend on the sampling distribution of ε.

I The assumptions:

E (εi ) = 0

E (ε2i ) = σ2

E (εi , εj) = 0, i 6= j ,

which are the assumptions discussed earlier.

I But in addition to this, we’re going to assume the ε isnormally distributed.

I This leads to the following assumption:

εi ∼ N(0, σ2),

which says that ε is normally distributed with mean 0 andvariance σ2.

I We can state this more explicitly by recognizing that for anytwo normally distributed random variables, a zero covariancebetween them implies independence.

I This means that if εi and εj have a 0 covariance (which theydo by assumption), then they can be said to beindependently distributed, leading to:

εi ∼ NID(0, σ2),

where NID means normally and independently distributed.

I Why assume the normal?

I The principle reason is given by the central limit theorem.

I Under the CLT, if there are large number of iid randomvariables, the distribution of their sum will tend to a normaldistribution as n increases.

I So it is the central limit theorem that provides us with astrong justification to assume normality.

I An important result of the normal distribution is that anylinear function of normally distributed random variables isitself, normally distributed.

I The regression coefficients are linear functions of εi , so itmust be the case that the sampling distributions for theregression estimates are also normally distributed.

I For the multiple regression setting we now can say

βk ∼ N(βk , σ2βk

I Additional results: under the normal distribution, we candefine a distribution for our estimator σ2 as

(n − k − 1)σ2

σ2∼ χ2

n−k−1,

where χ2 denotes the chi-square distribution with n − k − 1degrees of freedom. Use of the χ2 statistic will allow us tocompute confidence intervals around the estimator σ2.

I Under the normal distribution, the regression estimates haveminimum variance in the entire class of unbiased estimators.

I Finally if εi is distributed normally, then Yi itself must benormally distributed:

Yi ∼ N(β0 + βkXi , σ2).

I Under the normality condition, we can specify Z = β1−β1σb1

I A fundamental problem exists because σ is usually unknown.In its place, we estimate σ by using the standard error of β1.

I This gives rise to a t-statistic:

t =β1 − β1

s.e.(β1),

which follows the t distribution with n − k − 1 degrees offreedom.

I Now since β1−β1

s.e.(β1)∼ t(n − k − 1), we can use the t

distribution to establish a confidence interval:

Pr(−tα/2 ≤ t ≤ tα/2) = 1− α.

The term tα/2 denotes our critical value and α denotes thesignificance level. The level α = .05 is common, but .01 or .10levels are also commonly used as well.

I Substituting terms for the interval, we can rewrite theprevious statement as

Pr(−tα/2 ≤β1 − β1

s.e.(β1)≤ tα/2) = 1− α,

I Rearranging, gives

Pr[β1 − tα/2s.e.(β1) ≤ β1 ≤ β1 + tα/2s.e.(β1)]

which is the 100(1− α) percent confidence interval.

I Hence, α = .05 yields a 95 percent confidence interval:

β1 ± tα/2s.e.(β1).

I One important thing to note is the fact that we’re dividing thesignificance level by two.

I Note also that the width of the c.i. is proportional to thestandard error of the coefficient.

I We can now see why the standard error is a measure ofprecision: it directly effects the interval in which thepopulation parameters will probabilistically reside (overrepeated samples).

I Simple tests-of-significance can now be done.

I We can test the condition of the null hypothesis using a tstatistic.

I The t:

t =β1 − β1

s.e.(β1).

I We could state the null as

H0 : β1 = 0,

but we could easily specify β1 under the null as being equal toany hypothetical value (i.e. 1, .5, 3.14, etc.).

I Define β∗1 as the value of β1 under the null and rewrite t as

t =β1 − β∗1s.e.(β1)

where β∗1 now reflects the condition of the null (and tα/2

denote the critical t values).Jones POL 212: Research Methods

I We can utilize p values to determine the probability of a tvalue.

I In consulting a t table, we can look up the appropriate degreesof freedom and derive the probability for a given t value.

I Suppose we have 8 degrees of freedom and obtain a t value of2.306.

I In looking at the t table, we see that the probability ofobtaining a t value of 2.306 or greater is 5 percent. Thismeans that this result could have occurred by chance aloneonly about 5 percent of the time.

I This is all based on classical statistics.

I Joint tests-of-significance are possible.

I Omnibus F -test: H0 : β1 = β2 = . . . = βk = 0

I Easy to compute using variance components:

F0 =RegSS/k

RSS/(n − k − 1)

= MSR/MSE (8)

I Consult an F-table for k and n − k − 1 degrees-of-freedomand you can obtain a p-value for the test.

I Next week: matrix form of the model as well as further proofsof the assumptions.

I Model matrices and diagnostic methods.

I Review matrix algebra.

Statistical Inference for the Linear Regression...

Documents