+ All Categories
Home > Documents > Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3...

Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3...

Date post: 30-May-2020
Category:
Upload: others
View: 19 times
Download: 0 times
Share this document with a friend
43
Chapter 10 Simple Linear Regression and Correlation 10.1 Introduction Aim : FTo study the association among variables; FTo predict outcome given covariates Example 10.1 Altitude and Boiling point. 234
Transcript
Page 1: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

Chapter 10

Simple Linear Regression and Correlation

10.1 Introduction

Aim: FTo study the association among variables;

FTo predict outcome given covariates

Example 10.1 Altitude and Boiling point.

234

Page 2: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 235

In the 1840s and 1850s, Forbes wanted to be able to determine

the altitude from measurements of the boiling point (BP) of water.

Altitude can be determined from atmospheric pressure. He collected

17 data points from different locations.

Boiling point 194.5 194.3 197.9 198.4 199.4 199.9 200.9 201.1 201.4

201.3 203.6 204.6 209.5 208.6 210.7 211.9 212.2

pressure 20.79 20.79 22.4 22.67 23.15 23.35 23.89 23.99 24.02

24.01 25.14 26.57 28.49 27.76 29.04 29.88 30.06

100*log(pressure) 131.8 131.8 135 135.5 136.5 136.8 137.8 138 138.1

138 140 142.4 145.5 144.3 146.3 147.5 147.8

**

* ** *

****

*

*

**

***

195 200 205 210

2224

2628

30

BoilingPoint

pres

sure

Bioling Point versus Pressure

**

* ** *

****

*

*

**

***

195 200 205 210

135

140

145

BoilingPoint

100

* lo

g10(

pres

sure

)

Bioling Point versus log−Pressure

Figure 10.1: Boiling point versus pressure and 100*log(pressure)

Page 3: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 236

Questions:

♠How are pressure and BP related?

♠Can pressure be predicted from BP and how well?

Example 10.2 Income and Education.

Figure 10.2: Year of education versus income (r = 0.63)

An observational study shows the data given in Figure 10.2.

Page 4: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 237

Questions:

♠ If one’s education level is 15 years, what would you estimate his/her

income to be?

♠ How much is one extra year of education worth?

A professional claimed she is underpaid. How to adjust? Other pos-

sible variables include years of work, gender, rank, achievements,

etc.. This is a multiple regression problem.

Page 5: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 238

Example 10.3 Prediction of Housing Value

Zillow.com: What is a fair market value of a house? An important

variable is the size of the house or its proxy, the number of rooms.

••

•••

••

••

••

••••

••

••

••• ••

• ••

• ••• •

••••

••

••••• ••

• • •••

••••

••

•••

••

•• • ••

••• ••

•••• ••

• • •

•••••

••

••••• • ••• •

• • •••

•••• ••••

•• ••

•••

••

•• ••

••• •

••

••••

•••

•••

• ••

• • •

••

• ••

••• ••

• •

••

• •

••

••

••

••

••

•••

• •

• •• •• ••

•• ••

• ••

••

• •

• • ••

••

••

••

• •••

••

••

••• •

•• • ••

••

••• •

••

••

••

••

••

••

••

••

••

••

• ••

••

••••

•••••

••

••

••••••••

••

• ••

• •

••

•••••

•••

••

• • •••

•• •• ••• ••••

• ••

•• •

••• ••

•• • • •

•• •

••• •

• • •• • •

••

•••

• •• •

• •••

• ••••

••

•••

• ••

•• • ••• •••••

••••

••• •••• •••

• ••

••

••• •

• • •••

••

••••

• ••

• •

••••

rm

4 5 6 7 8

1020

3040

50No. Rooms versus housing value

Figure 10.3: No. of rooms versus value of house (in thousand USD) in Boston in 1970.

Page 6: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 239

Questions:

— If a house has 7.5 rooms, what is the fair market value?

— What is a reasonable range for the value of the house?

— On average, how much is an extra room worth?

�In reality, housing value depends on the age, distance to business center, crime rate, pollution level,

tax, pupil-teacher ratio, recent sales, among others. It is a multiple regression problem.

In Ex. 10.1 — 10.3, we can see from the scatter plots that data:

— are somewhat noisy;

— have an overall linear pattern;

— are far from a defined functional form.

Page 7: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 240

Example 10.4 In health science studies, many explanatory vari-

ables are collected in addition to the response. E.g.

Response: Survival time after transformation

Covariates: age, gender, blood pressure, waiting time, race etc

Purpose: Identify the risk factors and describe the association.

Purpose of regression:

♠ quantify the contribution of X to Y

♠ summarize the association (screening variables)

♠ given x, predict the mean response and its associated SD

Page 8: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 241

10.2 Model and Summary Statistics

Bivariate data: (x1, y1), (x2, y2), · · · , (xn, yn).

Generic pair: (X, Y )

�X — independent variable, covariate, predictor;

�Y — dependent variable, response.

Simple linear regression:

Y = β0 + β1X + ε,

F β’s —regression coefficients; β0 —intercept; β1 —slope.

F ε —measurement errors / part that cannot be explained by x.

Page 9: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 242

Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.

x1 x2 x3

0 1y xβ β= +0 1 1xβ β+

0 1 2xβ β+

0 1 3xβ β+

Figure 10.4: Distributions of Y for different given x.

Data: The ith observation is generated from

Yi = β0 + β1xi + εi, i = 1, · · · , n.

We might assume {εi} are i.i.d N(0, σ2) (oval shape in scatter plot).

Page 10: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 243

Group mean: For the linear model, we have

E(Y |x) = E(β0 + β1x + ε) = β0 + β1x,

which is the average for the group with covariate ≈ x (blue line).

Group SD: Similarly, we have

var(Y |x) = σ2,

which is the variance for the group with covariate ≈ x.

The summary statistics are

♠ x-variable: x and SDx =√

Sxxn−1 or Sxx =

∑(xi − x)2.

♠ y-variable: y and SDy =

√Syyn−1 or Syy =

∑(yi − y)2.

♠ strength of linear association: r =Sxy√SxxSyy

Page 11: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 244

—sample correlation coefficient, where

Sxy =∑

(xi − x)(yi − y) =∑

xiyi − nxy.

Example 10.5 Two-sample problems and regression

Let Y1, · · · , Ym and Ym+1, · · · , Ym+n be respectively the a random

sample from the first and second population. Let x1 = · · · = xm = 0

and xm+1 = · · · = xm+n = 1 be the indicator for the first and second

population. Consider the linear model

Yi = β0 + β1xi + εi =

β0 + εi if i ≤ m

β0 + β1 + εi if i > m

Thus, β0 = µ1 and β1 = µ2 − µ1.

Page 12: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 245

Example 10.6 Time series and regression

Presented in Figure 10.5 is the US monthly unemployment rates. To

predict this month xt using the previous month xt−1, we have vari-

ables y = xt and x = xt−1 and model it as (autoregressive or AR

model)

xt = β0 + β1xt−1 + εt

1960 1970 1980 1990 2000 2010

46

810

Month

Unr

ate

(a) Monthly unemploy rates

**

****** * *

***

* *** ***

* ** *

* ********

*********

* *** ** * *** ****** *******

****** ** ******************************** ******** **************

* ****

* * * * **** *

* * *********** ************************ **

***** **

* **

*

* ** * * ************* * ***********

************

*************

* ***** **

** ** ****

** **** *

** **

* * ** *

* *** *

**

* *******

***

***

******

* *********** ******

** *********

****************

** *** ******** ************ *** * * *

* *** * ** ******

* ***** * ***

******************************

* ******************* ****************

************************** ******** ****

* ** *

* * * **** *************

* ********* ********************************** *********** *** *

* * ** *

**

**

**

** ***

* * *************

*************************

**********

***********

********************* ***

4 6 8 10

46

810

Unrate[1:(n − 1)]

Unr

ate[

2:n]

(b) x(t−1) versus x(t) for unemploy data

Figure 10.5: (a) Unemploy rates. (b) xt−1 versus xt

Page 13: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 246

10.3 Estimation of Model Parameters

Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.

Linear Regression Model

True regression line

(x1,y1)

x1

0 1y xβ β= +

Figure 10.6: Finding a line to pass through the data cloud.

Method of least-squares: Find β0 and β1 to min the part that

can not be explained:

SSE︸︷︷︸Sum of Square Errors

(β0, β1) =

n∑i=1

(yi − β0 − β1xi︸ ︷︷ ︸εi

)2,

Page 14: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 247

which are the MLE’s for normal errors.

Solution: Setting derivatives to zero,

∂β0SSE(β0, β1) =

n∑i=1

−2(yi − β0 − β1xi) = 0;

∂β1SSE(β0, β1) =

n∑i=1

−2(yi − β0 − β1xi)xi = 0,

we have { β0 = y − β1x

β1 =SxySxx

= rSDySDx

Fitted value: yi = β0 + β1xi

Residual: εi = yi − yiSSE =

∑ni=1 ε

2i , —also called RSS (Residual Sum of Squares).

Page 15: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 248

Regression Line: y = β0 + β1x. It is used to predict the mean

response y for a given x value.

Reg. Principle: if x increases one SU, y increases r SU:

y − ySDy

= rx− xSDx

.

Proof. It is the same as the regression equation:

y = y + rSDySDx︸ ︷︷ ︸β1

(x− x) = β0 + β1x.

Page 16: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 249

Example 10.7 A large scale study between math (x) and physics

(y) test scores shows

avg math score (x) = 75, SDx = 10, r = 0.8

avg physics score (y) = 70, SDy = 8.

The overall pattern of the data is of oval shape.

a) If a student’s math score is 80, guess his/her physics score;

b) What is the average physics score (and standard deviation) of the

group having math score about 80?

c) If a students’s math score is 60, predict his/her physics score.

Page 17: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 250

Fproblem a) = problem b)

Method 1: Regression principle

a) 80−7510 = 0.5 SU in x =⇒ 0.8× 0.5 = 0.4 SU in y, and

regression est. = y + 0.4 SDy = 70 + 0.4× 8 = 73.2

c) 60−7510 = −1.5 SU in x =⇒ 0.8× (−1.5) = −1.2 SU in y, and

regression est. = 70− 1.2× 8 = 60.4.

Regression Effect: In the test and retest situation, the bottom

group shows overall improvement while the top group

deteriorates somewhat.

Page 18: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 251

E.g. Heights between two generations: tall fathers tend to associate

with tall sons, but not as tall as their fathers.

Method 2: Regression equation

β1 = rSDySDx

= 0.8× 810 = 0.64.

(increasing 1 point of math increases about 0.64 in physics)

β0 = y − β1x = 70− 0.64× 75 = 22.

a) regression est.= 22 + 0.64× 80 = 73.2

c) regression est.= 22 + 0.64× 60 = 60.4

Page 19: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 252

10.4 Estimating σ2

The regression equation gives the mean of the group having covariate

x. What is the SD σ of this group?

A natural estimate of σ2 is

1

n− 2

n∑i=1

ε2i ≡

SSE

n− 2.

due to the loss of 2 degrees of freedom.

Computation of SSE: It can be shown that

SSE = Syy(1− r2).

Page 20: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 253

Estimator of σ:

σ =

√SSE

n− 2=

√Syy(1− r2)

n− 2=

√n− 1

n− 2SDy

√1− r2,

which is smaller than SDy —the subgroup has a smaller variance.

Example 10.7 (cont.) What is the likely size of the prediction error?

Solution: σ ≈ SDy√

1− r2 = 8×√

1− 0.82 = 4.8.

(b) The group with math score about 80 has the average physics score

73.2, and SD 4.8.

Page 21: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 254

10.5 Goodness-of-Fit

Question: How well is x related to y?

If there is no relationship, yi = β0 + εi and the LS estimator of β0

minimizesn∑i=1

(yi − β0)2 =⇒ β0 = y.

The sum of squared errors (SSE) in this case is∑ni=1(yi− y)2 = Syy.

If there is linear relationship, the unexplained variability is

SSE = Syy −S2xy

Sxx.

Thus, the reduction in unexplained variability is

SSreg = Syy − SSE =S2

xy

Sxx.

Page 22: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 255

It is called the sum of squares due to regression (SSR).

The Coefficient of Determination R2:

R2 =SSregSyy

= 1− SSE

Syy.

It gives the percentage of the variability of Y explained by the regres-

sion on X . The larger, the better the fit. Note that

R2 =S2xy

SxxSyy= r2.

Page 23: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 256

10.6 Inference of model parameters

Standard errors: Since the estimators are linear in Y ,

β1 =SxySxx

=

∑(xi − x)(yi − y)

Sxx=∑ xi − x

Sxxyi

β0 = y − β1x =∑(

1

n− x(xi − x)

Sxx

)yi

then we have:

var(β1) =σ2

Sxx, SE(β1) = σ/

√Sxx

var(β0) = σ2(1

n+

x2

Sxx), SE(β0) = σ

√1

n+

x2

Sxx

Page 24: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 257

Given that the estimators are normally distributed, it follows that:

β0 − β0

SE(β0)∼ tn−2,

β1 − β1

SE(β1)∼ tn−2

Confidence intervals:

—Intercept β0: β0 ± tα/2,n−2 SE(β0).

—Slope β1: β1 ± tα/2,n−2 SE(β1).

The same principle applies to the hypotheses tests.

Example 10.1 (cont.) For Forbes’ data, take y = log(Pressure).

n = 17, x = 202.95, y = 139.60

Sxx = 530.78, Syy = 427.91, Sxy = 475.38

Page 25: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 258

(a) Construct the 95% CI for β1.

β1 =SxySxx

=475.38

530.78= .8956

σ =

√SSE

n− 2=

√Syy − S2

xy/Sxx

n− 2=

√2.148

15= .3784

SE(β1) =σ√Sxx

=.3784√530.78

= 0.0164

df = n− 2 = 15, t0.025,15 = 2.13.

Thus 95% CI for β1 is 0.896± 2.13× 0.0164 = (.86, .93).

(b) Test H0 : β1 = 0.95←→ H1 : β1 6= 0.95

— Method 1: 0.95 is not in 95% CI, reject H0;

— Method 2: t = 0.896−0.950.0164 = −3.29, reject H0 as |t| ≥ 2.13;

— Method 3: P-value = 2 P (T15 > 3.29) = .5%, evidence is very

Page 26: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 259

strong, reject H0.> logP = 100*log10(pressure) #define log-pressure

> fit = lsfit(BoilingPoint, logP) #least-square fits

> ls.print(fit) #print the summary result

Residual Standard Error=0.3792

R-Square=0.995

F-statistic (df=1, 15)=2961.547

p-value=0

Estimate Std.Err t-value Pr(>|t|)

Intercept -42.1642 3.3414 -12.6189 0

X 0.8956 0.0165 54.4201 0

10.7 Predictions

Given the new value x∗, we would like to predict its response

Y∗ = β0 + β1x∗ + ε∗, with var(ε∗) = σ2.

Page 27: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 260

The expectation E(Y ∗|x∗) = β0 + β1x∗ is estimated as

y∗ = β0 + β1x∗ = y+ β1(x∗− x) =∑(

1

n+

(x∗ − x)(xi − x)

Sxx

)yi

The variance of the prediction error (Y ∗ − y∗) is

var(Y ∗ − y∗) = var(Y ∗) + var(y∗) = σ2 + σ2

[1

n+

(x∗ − x)2

Sxx

],

coming from two sources: ε∗ and estimated coefs: β0 and β0 .

SE of prediction

SEpred(y∗|x∗) = σ

[1 +

1

n+

(x∗ − x)2

Sxx

]1/2

.

A (1− α)% predictive interval of Y ∗ is

y∗ ± tα/2,n−2 SEpred(y∗|x∗).

Page 28: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 261

Ex. 10.1 (cont.) Construct the 95% predictive interval for the log-

pressure at x∗ = 205.

Recall that β1 = .896. The estimated slope is

β0 = y − β1x = 139.60− 0.896× 202.95 = −42.24

The predicted value is

y∗ = −42.24 + 0.896 ∗ 205 = 141.44.

The SE of the prediction is given by

SEpred(y∗|x∗) = σ

[1 +

1

n+

(x∗ − x)2

Sxx

]1/2

= .3784×

[1 +

1

17+

(205− 202.95)2

530.78

]1/2

= .391

Page 29: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 262

The 95% predictive interval at x∗ = 205 is

141.44± 2.13× .391 = (140.6, 142.3)

> fit = lm(logP ~ BoilingPoint) #a different way of fit model

> summary(fit) #a different way of summary

Call:

lm(formula = logP ~ BoilingPoint)

Residuals:

Min 1Q Median 3Q Max

-0.31974 -0.14707 -0.06890 0.01877 1.35994

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -42.16418 3.34136 -12.62 2.17e-09 ***

BoilingPoint 0.89562 0.01646 54.42 < 2e-16 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 0.3792 on 15 degrees of freedom

Multiple R-squared: 0.995, Adjusted R-squared: 0.9946

F-statistic: 2962 on 1 and 15 DF, p-value: < 2.2e-16

Page 30: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 263

> new = data.frame(BoilingPoint = seq(200, 210, 5))

#variable name has to be the same

> predict(fit, new, interval = "prediction") #prediction interval

fit lwr upr

1 136.9594 136.1214 137.7974

2 141.4375 140.6028 142.2721 #This is for BoilingPoint = 205

3 145.9155 145.0480 146.7831

### try also the following

predict(fit, new, se.fit = TRUE) #give the SE of the fit

predict(fit, new, interval = "confidence")

#confidence interval for the group mean

Ex. 10.7 (cont.)

avg math score (x) = 75, SDx = 10, r = 0.8

avg physics score (y) = 70, SDy = 8. n = 900

(a) Predict a student’s physics score if his math score is 85 and attach the size of the prediction

error.

85− 7510 = 1 SU in x → 0.8× 1 = 0.8 SU in y

regression estimate: y = 70 + 0.8× 8 = 76.4.

Page 31: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 264

σ =√

n−1n−2 SDy

√1− r2 =

√899898 × 8×

√1− 0.82 = 4.80

SEpred = σ

[1 +

1

n+

(x∗ − x)2

Sxx

]12

= 4.80

[1 +

1

n+

(85− 75)2

(n− 1) SD2x

]1/2

= 4.80

[1 +

1

900+

1

899

]= 4.81.

Thus, the prediction is 76.4, give or take 4.81.

(b) Of those whose math score is about 85, what percentage of them scored above average in physics?

The physics score of the students in this subgroup is ≈ N(76.4, 4.802). The percentage of this

subgroup above the average is given by:

P{PS > 70} = 1− Φ(70− 76.4

4.80

)= 1− Φ(−1.333) = 90.9%.

(note that the overall percentage of students who scored over 70 in physics was 50%).

(c) On average, how much each point increase in math contributes to the physics score? Answer

this question through a 95% CI.

β1 =SxySxx

= rSDy

SDx= 0.8× 8

10= 0.64.

Page 32: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 265

SE(β1) =σ√Sxx

=4.80

SDx

√n− 1

=4.80

10√

899= 0.0160

Thus, the 95% CI for β1 is

0.64± t0.025,898 × .0160 = 0.64± 0.0314 = [0.61, 0.67].

(d) Would increasing 10 points of the math score increase more than 6 points of the physics score?

H0 : β1 ≤ 0.6←→ H1 : β1 > 0.6.

t = β1−β10SE

= 0.64−0.60.0160 = 2.50

P-value = 1-pt(2.5, 898) =.63%.

Strong evidence against H0 =⇒ the answer is ”yes.”

(e) Is it reasonable to assume that increasing 10 points of the math score increases on average about

6.5 points of the physics score?

H0 : β1 = 0.65←→ H1 : β1 6= 0.65

Page 33: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 266

t = 0.64−0.650.0160 = −0.625, P-value = 53.2%.

Weak evidence against H0 =⇒ accept H0 =⇒ the answer is ”yes.”

10.8 Correlation

Recall that the covariance and correlation between X and Y are

cov(X, Y ) = E(X − µX)(Y − µY ) and ρ =cov(X, Y )

σxσy.

Sample covariance =∑

(xi−x)(yi−y)n−1 =

Sxyn−1.

Sample correlation: r =Sxy√

Sxx√Syy

.

Properties: Like population corr., r has the following properties:

♠ −1 ≤ r ≤ 1;

Page 34: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 267

♠ The larger the |r|, the stronger the linear association;

♠ r > 0: positive association; r < 0: negative association;

♠ r is independent of unit.

Figure 10.7: Simulated data with sample correlations 0.026, -0.595, -0.990, 0.241, 0.970, 0.853.

Page 35: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 268

Some Caveats:

— r ≈ 0 does not mean no association.

Figure 10.8: Two cases of null sample correlation (from G.Dallal c1999).

— r does not provide the evidence of causation.

♣ correlation between shoe size and reading skills is high for school

kids. Buying larger shoes? Confounding factor: age.

Page 36: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 269

♣ beer consumption and lake water level. Confounding factor:

temperature.

— Correlation is sometimes computed based on averages. Such a

correlation is called ecological correlation. It overstates the

strength of the association.

Figure 10.9: Left panel: Income vs level of education averaged in groups (3, r=0.902) instead of individuals (30,r=0.572). Right panel: Overall cancer risk versus per capita daily food intake (from G.Dallal c1999).

Page 37: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 270

— When the bivariate data are not normal , the correlation and other

summary statistics are not sufficient. In addition, the correlation

might not be very meaningful.

Figure 10.10: The scatter plot for the hypothetical data presented in Figure 10.2, with education between 12 years and 16 yearsremoved. The correlation increases from 0.63 to 0.73. The marginal distributions are not normal either.

Page 38: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 271

Figure 10.11: Each data set has a correlation coeficient of 0.7 (from G.Dallal c1999).

Page 39: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 272

Questions: Are X and Y correlated? The null hypothesis is H0 : ρ = 0. There are three kinds of

alternative hypotheses:

(a)H1 : ρ > 0. (b)H1 : ρ < 0. (c)H1 : ρ 6= 0.

Exact test: Intuitively, reject H0 in (a) when r is large and reject H0 in (c) when |r| is large. This

is equivalent to using

T =r√n− 2√

1− r2,

which is monotonic in r.

Null distribution: When the bivariate data are normal, under H0 : ρ = 0, T ∼ tn−2

Example 10.8 A random sample of 45 measurements shows the correlation between the x

(expression of a protein) and y (survival time) variables to be 0.29.

(a) Is there substantial evidence that x and y are correlated at the 5% significance level?

Testing problem: H0 : ρ = 0←→ H1 : ρ 6= 0. The test statistic

t =

√45− 2× 0.29√

1− 0.292= 2.0.

Page 40: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 273

Table 10.1: Summary of exact test for H0 : ρ = 0

Problem Reject region P-value t = r√n−2√1−r2 .

(a) t > tα,n−2 P (Tn−2 > t) =

Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.

t

-|t|

0

0

0

t

|t|

2−nt

2−nt

2−nt(b) t < −tα,n−2 P (Tn−2 < t) =

Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.

t

-|t|

0

0

0

t

|t|

2−nt

2−nt

2−nt

(c) |t| > tα/2,n−2 P (|Tn−2| > |t|) =

Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.

t

-|t|

0

0

0

t

|t|

2−nt

2−nt

2−nt

Hence, the P-value = P (|T43| > 2.0) = 5.2%. We don’t have strong enough evidence against

H0, namely, x and y are uncorrelated.

(b) What assumption(s) did you make in the above inference?

The data are a random sample from the bivariate normal dist.

Page 41: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 274

Fisher transformation: When (X1, Y1), · · · , (Xn, Yn) is a sample from a bivariate normal dis-

tribution, then

V =1

2ln

1 + r

1− r

a∼ N(µρ,1

n− 3), where µρ =

1

2ln

1 + ρ

1− ρ.

(1− α)-CI for µρ: v ± zα/2/√n− 3 = (c1, c2), or for ρ:(

exp(2c1)− 1

exp(2c1) + 1,

exp(2c2)− 1

exp(2c2) + 1

).

Note that V is a monotonic function in r, the sample correlation. One can test H0 : ρ = ρ0

against

(a)H1 : ρ > ρ0, (b)H1 : ρ < ρ0, (c)H1 : ρ 6= ρ0.

The test statistic is

Z =√n− 3(V − µ0); µ0 =

1

2ln

1 + ρ0

1− ρ0.

Example 10.9 Based on a sample of size 103, the sample correlation between the income and

education was found to be 0.4.

Page 42: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 275

(a) Construct the 90% CI for the population correlation.

The Fisher transformation

v =1

2ln

1 + .4

1− .4= 0.4236

α = 0.1, zα/2 = 1.645, 90% CI for the Fisher transform of ρ is

0.4236± 1.645/√

100 = (0.2591, 0.5881).

Hence, by converting this into ρ, the 90% CI for ρ is(exp(2× 0.2591)− 1

exp(2× 0.2591) + 1,

exp(2× 0.5881)− 1

exp(2× 0.5881) + 1

)= (0.2535, 0.5285).

(b) Is there any substantial evidence that the population correlation between the income and edu-

cation is at least .2?

The problem: H0 : ρ ≤ .20 ←→ H1 : ρ > .20. Note that µ0 = 12 ln 1+.2

1−.2 = 0.2027. Hence,

the test statistic

z = (0.4236− 0.2027)√

103− 3 = 2.209.

Thus, the P-value is P (Z > 2.209) = 1−Φ(2.209) = 1.36%. We have substantial evidence that

the population correlation > .2.

Page 43: Chapter 10 Simple Linear Regression and Correlationjqfan/fan/classes/245/chap10.pdfExample 10.3 Prediction of Housing Value Zillow.com: What is a fair market value of a house? An important

ORF 245: Correlation and Simple Linear Regression – J.Fan 276

Ex.10.1. (cont.) Test the null hypothesis that BoilingPoint and logP are uncorrelated and construct

95% confidence interval for the population correlation.

> cor.test(BoilingPoint, logP)

Pearson’s product-moment correlation

data: BoilingPoint and logP

t = 54.42, df = 15, p-value < 2.2e-16

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:

0.9928242 0.9991143

sample estimates:

cor

0.9974771


Recommended