STAT 540: Data Analysis and Regressionriczw/teach/STAT540_F15/...Proof: We have already shown that...

An Overview

Inferences about β1.

1 Sampling distribution of β1.

2 Sampling distribution of {β1 − β1}/√

var{β1}.3 Confidence interval for β1.

4 Hypothesis testing.

Inferences about β0.

Estimation and prediction (with respect to some x0).

ANOVA approach.

Coefficient of determination: R2.

W. Zhou (Colorado State University) STAT 540 July 6th, 2015 1 / 63

Inferences about β1

Recall simple linear regression model

Yi = β0 + β1Xi + εi, εi ∼ iid N(0, σ2),

for i = 1, . . . , n.

Recall that LS and ML estimate of β1 is

β1 =

∑ni=1(Xi − X)(Yi − Y )∑n

i=1(Xi − X)2

As we will show, β1 is normal with

E(β1) = β1 and V ar(β1) =σ2∑n

i=1(Xi − X)2


Preliminary Results Concerning β1

Note that

β1 =

∑ni=1(Xi − X)(Yi − Y )∑n

i=1(Xi − X)2

=

∑ni=1(Xi − X)Yi∑ni=1(Xi − X)2

=

n∑i=1

Xi − X∑ni=1(Xi − X)2

Yi

Express β1 as a linear combination of {Yi}:

β1 =

where


Preliminary Results Concerning β1

Now we compute∑ni=1 ki,

∑ni=1 kiXi, and

∑ni=1 k

2i .


Sampling Distribution of β1

β1 follows a normal distribution. Why?

Mean of β1, E(β1) is

Variance of β1, V ar(β1) is


Completing Proof of Gauss-Markov Theorem

Theorem: Under the simple linear regression model with the residuals being mean

zero, constant variance (but not necessarily normal), β0 and β1 are BLUE (Best

Linear Unbiased Estimators) because they have minimum variance among all

linear unbiased estimators.

Proof:

We have already shown that the estimators are linear since

We have already shown that β1 is unbiased, i.e., E(β1) = β1. Now

It remains to show the minimum variance among all linear estimators.


V ar(β1) is minimal among unbiased linear estimators

Let an arbitrary linear unbiased estimator be of the form β1 =∑ni=1 ciYi

where ci are constants that satisfy

Note that Var{β1} is


V ar(β1) is minimal among unbiased linear estimators

Also note that∑ni=1 kidi = 0. Why? Show it below.

Thus


Estimate of V ar(β1)

Recall that

Var{β1} = σ2

(n∑i=1

(Xi − X)2

)−1

The estimated variance is

Var{β1} =

∑ni=1(Yi − Yi)2/(n− 2)∑n

i=1(Xi − X)2=

MSE∑ni=1(Xi − X)2

.

We will showβ1 − β1√Var{β1}

∼ tn−2.


Sampling Distribution of β1−β1√Var{β1}

Definition: If Zi ∼ i.i.d. N(µi, σ2i ) for i = 1, . . . , k then∑k

i=1

(Zi−µi

σi

)2

∼ χ2k (Chi-square distribution with k degrees of freedom).

Definition (KNNL A.44): A t random variable with ν degrees of freedom

results from the expression

tν =z√qν/ν

where z and qν are independent standard normal and χ2ν random variables,

respectively.



Note thatβ1 − β1√Var{β1}

=β1 − β1√

σ2∑ni=1(Xi−X)2

∼ N(0, 1).

Also

Var{β1}Var{β1}

=

MSE∑ni=1(Xi−X)2

σ2∑ni=1(Xi−X)2

=MSE

σ2=

SSE

(n− 2)σ2∼χ2n−2

n− 2.

Thus,

β1 − β1√Var

{β1

} =

β1−β1√Var{β1}√Var{β1}Var{β1}

∼ tn−2



The last conclusion on the previous slide only holds if SSE/σ2 is

independent of β0 and β1.

This is given in Theorem (2.11) of KNNL for the simple linear regression

model and is proven in general in STAT640.


Confidence Interval for β1

Denote

√Var

{β1

}= s{β1}. Recall that

β1 − β1

s{β1}∼ tn−2.

The (1− α) CI for β1 is

β1 ± t(1− α/2;n− 2)s{β1}

where t(1− α/2;n− 2) is the (1− α/2)100th percentile of tn−2.

See KNNL pp.1317–1318 Table B.2. (Or use ± qt(.025,df))

Interpretation of CI. For example, a 95% CI for β1 is (–,–).

I If we repeated the study 100 times and created 100 CI’s for β1, we would

expect that 95 of these intervals would include the true value of β1.

I The method used to construct this interval has a 5% error rate.



In the advertising example, recall that

n∑i=1

Xi = 24.40,

n∑i=1

X2i = 107.42,

n∑i=1

XiYi = 154.07

n∑i=1

Yi = 35.50,

n∑i=1

Y 2i = 222.03.

Thus

β1 =154.07− 24.40× 35.50/7

107.42− 24.4× 24.40/7=

30.33

22.37= 1.356

β0 = 35.50/7− 1.356× 24.40/7 = 0.345

Compute MSE

MSE =SSE

n− 2=

∑ni=1(Yi − Yi)2

n− 2=

0.86

5= 0.172



Compute standard deviation estimate

s{β1} =

√MSE∑n

i=1(Xi − X)2=

√0.172

22.37= 0.0877

For a 95% CI, α = 0.05 and

t(1− α/2;n− 2) = t(0.975; 5) = 2.571

The 95% CI for β1 is

β1 ± t(1− α/2;n− 2)s{β1} = 1.356± 2.571× 0.0877

= 1.356± 0.225

= (1.13, 1.58).


Review of Hypothesis Testing

Recall two types of errors.

I Type I: Reject H0 when H0 is true.

I Type II: Fail to reject H0 when H0 is false.

Level of significance α = P (Type I error).

Power = 1− P (Type II error) = 1− β.

Recall p-value.

I A p-value is the probability of observing a sample outcome as extreme or more

extreme than the observed outcome under the assumption that H0 is true.

I Small p-value provides evidence against H0.

I It is misleading to say that p-value = 0. Use p-value ≤ 0.0001.

When we choose α, we control P[type I error] but it will affect β, too. We

can’t choose both α and β (without manipulating n), so we choose to

control α (more important).


Review of Hypothesis Testing

1-sided versus 2-sided test.

CI versus hypothesis testing: For the t-test given above, we could state the

conclusion of an α-level test in terms of a (1− α)100% CI. If 0 is contained

in the (1− α)100% CI, then we fail to reject H0.

When writing up a hypothesis test for this class, always include

I Hypothesis in statistical and practical terms.

I Test statistic.

I Decision rule and p-value.

I Conclusion in the context of a problem. Use wording such as “reject H0” or

“fail to reject H0”. Do not use “accept H0”.


Hypothesis Testing for β1

A test of interest (why?) is:

H0 : β1 = 0 versus Ha : β1 6= 0.

Recall thatβ1 − β1

s{β1}∼ tn−2

Thus an α-level test is based on the test statistic

t∗ =β1 − 0

s{β1}.

Decision rule: If |t∗| > t(1− α/2;n− 2), reject H0; otherwise do not reject

H0.

p-value = 2× P (T > |t∗|) where T ∼ tn−2.



Revisit the advertising example and test

H0 : β1 = 0 versus Ha : β1 6= 0.

The test statistic is

t∗ =β1

s{β1}=

1.356

0.0877= 15.46.

Compared with t5, the p-value is

2× P (t5 > 15.46) < 0.0001.

Thus reject H0 and there is strong evidence that there is a positive line

relationship between advertising expenditure and sales.



In general, for testing H0 : β1 = βh versus Ha : β1 6= βh, use the test statistic

t∗ =β1 − βhs{β1}

.

and proceed as before.

For testing H0 : β1 ≤ βh versus Ha : β1 > βhI Decision rule: If t∗ > t(1− α;n− 2), reject H0; otherwise do not reject H0.

I p-value = P (T > t∗) where T ∼ tn−2

For testing H0 : β1 ≥ βh versus Ha : β1 < βhI Decision rule: If t∗ < t(α;n− 2), reject H0; otherwise do not reject H0.

I p-value = P (T < t∗) where T ∼ tn−2



Recall that

β0 = Y − β1X.

It can be shown that β0 is normal with

E(β0) = β0 and σ2{β0} = V ar(β0) = σ2

[1

n+

X2∑ni=1(Xi − X)2

].

(Left as HW)

Estimated variance is

s2{β0} = MSE

[1

n+

X2∑ni=1(Xi − X)2

].



It can be shown that the sampling distribution is

β0 − β0

s{β0}∼ tn−2.

CIs and hypothesis tests for β0 follow as those for β1.

Note the case of β0 = 0.

In practice, never drop β0 from the model unless there is a scientific reason.

However, rarely is one interested in the actual value of β0


Types of Prediction and Estimation

Estimate the mean response E(Yh) for a given level of X = Xh.

Predict a new observation Yh(new) for a given level of X = Xh.

Predict the mean of m new observations all at a given level of Xh.

Estimate confidence band for regression line for several (or all) Xh’s.


Data Example: muscle mass (HW 1 (7))

Recall that Y = muscle mass and X = age with the fitted regression line

Y = 156.35− 1.19X

What is the population mean measure of muscle mass for a 55-year-old

person?

What should we predict for the muscle mass for a 55-year-old person

randomly selected from the population?

In both cases, the estimate is

Y = 156.35− 1.19× 55 = 90.9

but uncertainty is larger in the second case.


Estimation of E(Yh)

Let Xh be the level of X for which we want to estimate the mean response.

Xh could be observed or not, but should be within the range of {Xi}.

E(Yh) = the mean response at Xh.

The estimate of E(Yh) is


Derivation of V ar(Yh)

Three results are used in the derivation.

Yh =

V ar(a1Y1 + a2Y2) =

Cov(Y , β1) =


Derivation of V ar(Yh)

Thus, σ2{Yh} is


Inference for E(Yh)

Variance of Yh is

σ2{Yh} = V ar(Yh) = σ2

[1

n+

(Xh − X)2∑ni=1(Xi − X)2

].


s2{Yh} = MSE

[1

n+

(Xh − X)2∑ni=1(Xi − X)2

].

Note thatYh − E(Yh)

s{Yh}∼ tn−2.

The (1− α) CI for E(Yh) is

Yh ± t(1− α/2;n− 2)s{Yh}


Inference for E(Yh)

In the advertising example, suppose Xh = 6.

The estimate of the mean sales at Xh = 6 is

Yh = β0 + β1Xh = 0.345 + 1.356× 6 = 8.48.

The estimated standard deviation is

s{Yh} =

√MSE

[1

n+

(Xh − X)2∑ni=1(Xi − X)2

]

=√

0.172×√

1

7+

(6− 3.486)2

22.37= 0.271.

The 95% CI for the mean sales at Xh = 6 is

Yh ± t(1− α/2;n− 2)s{Yh} = 8.48± 2.571× 0.271

= 8.48± 0.70 = (7.78, 9.18).


Inference for Yh(new)

Xh = the “new” value of X.

I In the previous case, Xh might also be “new” in the sense that it was not a

value in the dataset. But here we talking about a new or hypothetical single

person (i.e., experimental/observational unit).

Yh(new) = the “new” response (as yet unobserved).

The best point prediction of Yh(new) is

Predicts new individual to be the mean for everyone else with Xh



Best estimate of prediction error is

Yh(new} − Yh

Note Yh(new} and Yh are independent

Variance of the prediction error σ2{pred} is




s2{pred} = MSE

[1 +

1

n+

(Xh − X)2∑ni=1(Xi − X)2

].

Note thatYh − Yh(new)

s{pred}∼ tn−2

The (1− α) prediction interval (PI) for Yh(new) is

Yh ± t(1− α/2;n− 2)s{pred}


Prediction of Yh(new)

In the advertising example, again suppose Xh = 6.

The predicted Yh(new) is

Yh = β0 + β1Xh = 0.345 + 1.356× 6 = 8.48.

The estimated standard deviation of the prediction error is

s{pred} =

√MSE

[1 +

1

n+

(Xh − X)2∑ni=1(Xi − X)2

]

=√

0.172×√

1 +1

7+

(6− 3.486)2

22.37= 0.495.

The 95% PI for Yh(new) is

Yh ± t(1− α/2;n− 2)s{pred} = 8.48± 2.571× 0.495

= 8.48± 1.27 = (7.21, 9.75).


Analysis of Variance (ANOVA) Approach

The idea is to partition the variation into

SS Total = SS Model + SS Error

Why partition the variation?

1 Weigh different sources of variation.

2 Hypothesis testing.

3 Comparison of models.


Partitioning Deviation of Each Observation

Yi − Y︸︷︷︸total dev

= Yi − Y︸︷︷︸dev of fitted from mean

+ Yi − Yi︸︷︷︸dev of obs from fitted

.

If {Yi − Y } are large in relation to {Yi − Yi}, then the regression relation

explains (or accounts for) a large proportion of the total variation in {Yi}.

If {Yi − Y } are small in relation to {Yi − Yi}, then the regression relation

explains (or accounts for) a small proportion of the total variation in {Yi}.


Partitioning Total Sum of Squares

n∑i=1

(Yi − Y )2

︸︷︷︸SSTO

=

n∑i=1

(Yi − Y )2

︸︷︷︸SSR

+

n∑i=1

(Yi − Yi)2

︸︷︷︸SSE

.

SSTO =∑ni=1(Yi − Y )2 is the total sum of squares.

I A measure of total variation in the data (compare to variance).

SSR =∑ni=1(Yi − Y )2 is the regression sum of squares.

I The larger the SSR in relation to SSTO, the larger the proportion of variability

in the Yi’s accounted for by the regression relation.

SSE =∑ni=1(Yi − Yi)2 is the error sum of squares.

I The greater the variation of the Yi’s around the fitted regression line, the

larger the SSE.


Partitioning Total Sum of Squares

SSTO = SSR + SSE

where

1

SSTO =

n∑i=1

(Yi − Y )2 =

n∑i=1

Y 2i −

1

n(

n∑i=1

Yi)2

df = n− 1

2

SSR =

n∑i=1

(Yi − Y )2 = β21

n∑i=1

(Xi − X)2 = β21

[n∑i=1

X2i −

1

n(

n∑i=1

Xi)2

]df = 1

3

SSE =

n∑i=1

(Yi − Yi)2 = SSTO− SSR

df = n− 2W. Zhou (Colorado State University) STAT 540 July 6th, 2015 37 / 63

Partitioning Degrees of Freedom

n∑i=1

(Yi − Y )2

︸︷︷︸df=n−1

=

n∑i=1

(Yi − Y )2

︸︷︷︸df=1

+

n∑i=1

(Yi − Yi)2

︸︷︷︸df=n−2

.

SSTO df = n− 1: Y is used to estimate µY .

SSE df = n− 2: β0, β1 are used to estimate β0, β1.

Reasons to partition df?

I Compute MSE and MSR.

I See ST640.


Expected Mean Squares E(MSE)

Define

MSE =SSE

n− 2

Since SSE/σ2 ∼ χ2(n− 2), we have

E(MSE) = σ2.


Expected Mean Squares E(MSR)

Define MSR = SSR1 , recall SSR = β2

1

∑ni=1(Xi − X)2, we have

E(β21) =

σ2∑ni=1(Xi − X)2

+ β21

Why?

Thus

E(MSR) = σ2 + β21

n∑i=1

(Xi − X)2.

I Observe that when β1 = 0, E(MSR) = σ2.


Expected Mean Squares

Thus for testing H0 : β1 = 0 versus Ha : β1 6= 0, use the test statistic

SSR/1

SSE/(n− 2)=MSR

MSE

It can be shown that under H0 : β1 = 0,

F ∗ =MSR

MSE∼ F1,n−2.

Thus we can perform an F -test instead of a t-test.

In fact, T ∼ tν then T 2 ∼ F1,ν

I Thus the F -test is equivalent to a two-sided t-test for H0 : β1 = 0 versus

Ha : β1 6= 0.


Example SSTO

SSTO =

n∑i=1

(Yi − Y )2

=

n∑i=1

Y 2i −

1

n(

n∑i=1

Yi)2

= 222.03− 1

7(35.50)2

= 41.99

df = n− 1 = 6


Example SSR and SSE

SSR =

n∑i=1

(Yi − Y )2

= β21

[n∑i=1

X2i −

1

n(

n∑i=1

Xi)2

]= 1.3562 × (107.42− 24.402/7)

= 41.13

df = 1

SSE =

n∑i=1

(Yi − Yi)2

= SSTO− SSR

= 41.99− 41.13 = 0.86

df = n− 2 = 5


General Linear Test Approach

Consider the full model

Yi = β0 + β1Xi + εi, εi ∼ iid N(0, σ2)

and obtain SSE(F).

Consider the reduced model when β1 = 0

Yi = β0 + εi, εi ∼ iid N(0, σ2)

and obtain SSE(R).

It can be shown that SSE(F ) ≤ SSE(R) (intuitively, why?)

In addition, under H0 : β1 = 0,

F ∗ =

SSE(R)−SSE(F )dfR−dfFSSE(F )dfF

∼ F (dfR − dfF , dfF )


Example

To test H0 : β1 = 0, the F test statistic is

F ∗ =MSR

MSE=

41.13

0.172= 239.13

Compare with F (1, 5) and the p-value is

P (F (1, 5) > F ∗) = P (F (1, 5) > 239.13) < 0.0001.

Same conclusion as in the t test.


ANOVA Table

Summarize results using an ANOVA table.

Source SS df MS F

Regression (X) SSR 1 SSR/1 MSR/MSE

Error SSE n− 2 SSE/n− 2 –

Total SSTO n− 1 – –

For the advertising example, n = 7, then

Source SS df MS F

Ad expenditure

Error 0.86 –

Total 41.99 – –


Coefficient of Determination R2

Recall that

I SSTO measures the variation in Yi about Y (which does not take into account

Xi).

I SSE measures variation in Yi after accounting for linear relationship between

X and Y .

I SSTO − SSE = SSR is a measure of the reduction of variation due to

regression of Y on X.

Define a coefficient of determination as

R2 =SSR

SSTO= 1− SSE

SSTO


Coefficient of Determination R2

In the advertising example,

R2 =41.13

41.99= 0.9791

Interpret R2 as the proportion of variation in the Yi’s explained by the linear

regression relationship between X and Y

0 ≤ R2 ≤ 1.

Reported as the “multiple R-squared” in R summary output.

Relation to the sample correlation coefficient for simple linear regression

model (only):

r = sign(β1)√R2

Can you show that?


Limitations of R2

High R2 does not guarantee that useful predictions can be made.

Low R2 does not imply lacks of associations.I You can get R2 near zero even when there is a strong (or perfect) relationship

between X and Y .

F Sample correlation only measures LINEAR relationshipF E.g., X ∼ N(0, 1), Y = sin(X2), try yourself

I Outliers.

Alternative measure: Spearmen correlation (using nonparametric statistics),

Lowess R2 etc.


Correlation Analysis

Correlation analysis (Section 2.11 of KNNL) is closely related to regression

analysis

Regression analysis

I One variable is the response Y

I One variable is the predictor X

I Model conditional distribution of Y given X

I Distribution of X is not relevant

I Y given X and X given Y are not the same

Correlation analysis

I Both variables are response variables

I Want to measure association between two variables

I ρX,Y and ρY,X are the same


Bivariate Normal Distribution

The bivariate normal distribution is an example of a joint distribution for two

continuous random variables

We say X and Y have a bivariate normal distribution with parameters

µx, µy, σ2x, σ

2y, ρ if the probability density is

Interpretation of parameters

1 µx = mean of X

2 µy = mean of Y

3 σ2x = variance of X

4 σ2y = variance of Y

5 ρ= correlation of X and Y


Definition of the Correlation Coefficient

ρ = Cov(X,Y )/(σxσy) is called the correlation coefficient

Recall that Var(X) = E{(X − µx)2} and Cov(X,Y ) =?

Properties

1 ρ ∈ [−1, 1]

2 |ρ| = 1 implies that X and Y have a perfect linear relationship (perfect

correlation)

3 Independence of X and Y implies that ρ = 0

4 ρ = 0, on the other hand, implies independence (no linear relationship) when

(X,Y ) is bivariate normal



Density is constant on ellipses (contour plots)

Marginal distributions are normal that

Conditional distributions are normal that

Note the relationship of the conditional distribution of Y given X = x to

simple linear regression model



Two motivations for simple linear regression

1 Bivariate normal observations

2 x is fixed (not necessarily normal) and Y |x is normal

We can relate the bivariate normal and simple linear regression parameters:


Inference for a Correlation Coefficient

The maximum likelihood estimate is the sample correlation coefficient

(Pearson correlation)

This estimate replaces population quantities with sample quantities

Test H0 : ρ = 0 versus H1 : ρ 6= 0

I Equivalent to testing β1 = 0 in regression

I t = r√n− 1/

√1− r2 has t-distribution with n− 2 df

I This is exactly the t-test for H0 : β1 = 0



Remaining inference procedures assume bivariate normality and rely on

Fisher’s z-transformation

Z =1

2log

(1 + r

1− r

)Z ∼ N(log((1 + ρ)/(1− ρ))/2, 1/(n− 3))

Var(Z) does not depend on ρ!

Good approximation when n > 25, why?



Construct an 100(1− α)% confidence interval for ρ

CI for log((1 + ρ)/(1− ρ))/2 is

Obtain an approximate confidence interval for ρ by applying the inverse

transformation to the ends of the previous confidence interval



Test H0 : ρ = ρ0 versus H1 : ρ 6= ρ0

The test statistic is

√n− 3

(1

2log

(1 + r

1− r

)− 1

2log

(1 + ρ0

1− ρ0

))Obtain the p-value from comparison to a standard normal distribution


Example: Yields of Broadbalk Wheat (bu/acre)Source: R. A. Fisher, Statistical Methods for Research Workers, 14th ed. page

137.

Same two plots used in each of n=12 years

I Plot 1: fertilized with nitrate of soda, Xi=yield in i-th year

I Plot 2: same amount of N as sulfate of ammonia, Yi = yield in i-th year



Grain Example

Summary statistics

Sample size: n = 12

Sample Means: x = 35.1825 and y = 29.3541

Sums of Squares:∑12i=1(xi − x)2 = 346.184,

∑12i=1(yi − y)2 = 612.285

Sums of Crossproducts:∑12i=1(xi − x)(yi − y) = 238.5449

Sample correlation: r = 0.518


Test H0 : ρ = 0 versus H1 : ρ 6=0

t = (0.518√

12− 2)/(√

1− 0.5182) = 2.24 where t ∼ t12−2 and p-value is

0.0844

Conclusion: There is some evidence of a positive correlation in yields, but it

is not conclusive. Why?

I n = 12 is a small sample size

I The pair of yields observed in 1877 is somewhat inconsistent with the pattern

observed in other years. Check the accuracy of the 1877 data.


Confidence Interval for ρ

Apply the Fisher z-transformation

z =1

2log

(1 + 0.518

1− 0.518

)= 0.5736

zlower = 0.5736− (1.96)√

1/9 = −0.0797

zupper = 0.5736 + (1.96)√

1/9 = 1.2269

Apply the inverse transformation

(−1 + exp(2(−0.0797))

1 + exp(2(−0.0797)),−1 + exp(2(1.2269))

1 + exp(2(1.2269))

)⇒ (−0.0795, 0.8417)


Date post:	30-Aug-2021
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

STAT 540: Data Analysis and Regressionriczw/teach/STAT540_F15/...Proof: We have already shown that...

Documents