Home > Documents > 7.1 - Motivation 7.1 - Motivation 7.2 - Correlation / Simple Linear Regression 7.2 - Correlation /...

7.1 - Motivation 7.1 - Motivation 7.2 - Correlation / Simple Linear Regression 7.2 - Correlation /...

Date post: 28-Dec-2015
Category:
View: 233 times
84
7.1 - Motivation 7.2 - Correlation / Simple Linear Regression 7.3 - Extensions of Simple Linear Regression CHAPTER 7 Linear Correlation & Regression Methods
Transcript

• 7.1 - Motivation

• 7.2 - Correlation / Simple Linear Regression

• 7.3 - Extensions of Simple Linear Regression

CHAPTER 7 Linear Correlation & Regression Methods

Parameter Estimation via SAMPLE DATA …Testing for association between two POPULATION variables X and Y…

• Categorical variables • Numerical variables

Categories of X

Categories of Y

Chi-squared Test ???????

Examples: X = Disease status (D+, D–)

Y = Exposure status (E+, E–)

X = # children in household (0, 1-2, 3-4, 5+)Y = Income level (Low, Middle, High)

PARAMETERS

Means: [ ] [ ]X YE X E Y

Variances: 2 2

2 2

( )

( )

X X

Y Y

E X

E Y

Covariance:

( )( )XY X YE X Y

Parameter Estimation via SAMPLE DATA …

2 2

2 2

( )

( )

X X

Y Y

E X

E Y

2

2

( )21

( )21

x x

x n

y y

y n

s

s

PARAMETERS

Means: [ ] [ ]X YE X E Y

Variances:

( )( )XY X YE X Y

• Numerical variables

???????

Means: [ ] [ ]X YE X E Y

Variances:

( )( )XY X YE X Y

PARAMETERSx y

n nx y STATISTICS

( )( )

1

x x y y

xy ns

1 2 3 4, , , , , nx x x x x

1 2 3 4, , , , , ny y y y y

(can be +, –, or 0)

Covariance: Covariance:

Parameter Estimation via SAMPLE DATA …

2

2

( )21

( )21

x x

x n

y y

y n

s

s

1 2 3 4, , , , , nx x x x x

1 2 3 4, , , , , ny y y y yx1 x2 x3 x4 … xn

y1 y2 y3 y4 … ynPARAMETERS

Means: [ ] [ ]X YE X E Y

Variances:

Covariance:

( )( )XY X YE X Y

• Numerical variables

???????

Means: [ ] [ ]X YE X E Y

Variances:

( )( )XY X YE X Y

PARAMETERSx y

n nx y STATISTICS

( )( )

1

x x y y

xy ns

(can be +, –, or 0)

X

Y

JAMA. 2003;290:1486-1493

Scatterplot

(n data points) Covariance:

Parameter Estimation via SAMPLE DATA …

2

2

( )21

( )21

x x

x n

y y

y n

s

s

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … ynPARAMETERS

Means: [ ] [ ]X YE X E Y

Variances:

Covariance:

( )( )XY X YE X Y

• Numerical variables

???????

Means: [ ] [ ]X YE X E Y

Variances:

( )( )XY X YE X Y

PARAMETERSx y

n nx y STATISTICS

( )( )

1

x x y y

xy ns

(can be +, –, or 0)

X

Y

JAMA. 2003;290:1486-1493

Scatterplot

Does this suggest a linear trend between X and Y?

If so, how do we measure it?

(n data points) Covariance:

Testing for association between two population variables X and Y…

• Numerical variables

???????

PARAMETERS

Means: [ ] [ ]X YE X E Y

Variances: 2 2

2 2

( )

( )

X X

Y Y

E X

E Y

Covariance:

( )( )XY X YE X Y

Linear Correlation Coefficient:

2 2

XY

X Y

Always between –1 and +1

LINEAR^

Parameter Estimation via SAMPLE DATA …

2

2

( )21

( )21

x x

x n

y y

y n

s

s

2 2

XY

X Y

2 2

xy

x y

sr

s s

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … ynPARAMETERS

Means: [ ] [ ]X YE X E Y

Variances:

Covariance:

( )( )XY X YE X Y

• Numerical variables

???????

Means: [ ] [ ]X YE X E Y

Variances:

Covariance:

( )( )XY X YE X Y

PARAMETERSx y

n nx y STATISTICS

( )( )

1

x x y y

xy ns

(can be +, –, or 0)

X

Y

JAMA. 2003;290:1486-1493

Scatterplot

Linear Correlation Coefficient:

Always between –1 and +1

(n data points)

Parameter Estimation via SAMPLE DATA …

JAMA. 2003;290:1486-1493

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … ynPARAMETERS

Means: [ ] [ ]X YE X E Y

Variances: 2 2

2 2

( )

( )

X X

Y Y

E X

E X

Covariance:

( )( )XY X YE X Y

• Numerical variables

???????

Means: [ ] [ ]X YE X E Y

Variances: 2 2

2 2

( )

( )

X X

Y Y

E X

E X

Covariance:

( )( )XY X YE X Y

PARAMETERSx y

n nx y 2

2

( )21

( )21

x x

x n

y y

y n

s

s

STATISTICS

( )( )

1

x x y y

xy ns

(can be +, –, or 0)

X

Y

Scatterplot

(n data points)

Example in R (reformatted for brevity):

> c(mean(x), mean(y)) 7.05 12.08

> var(x) 29.48944

> var(y) 43.76178

> cov(x, y) -25.86667

2 2

xy

x y

sr

s s

Linear Correlation Coefficient:

Always between –1 and +1

> cor(x, y) -0.7200451

> pop = seq(0, 20, 0.1)

> x = sort(sample(pop, 10))1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1

> y = sample(pop, 10)13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0

n = 10

plot(x, y, pch = 19)

Parameter Estimation via SAMPLE DATA …

2 2

xy

x y

sr

s s

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn

• Numerical variables

X

Y

JAMA. 2003;290:1486-1493

Scatterplot

Linear Correlation Coefficient:

Always between –1 and +1

r measures the strength of linear association

(n data points)

Parameter Estimation via SAMPLE DATA …

2 2

xy

x y

sr

s s

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn

• Numerical variables

X

Y

JAMA. 2003;290:1486-1493

Scatterplot

Linear Correlation Coefficient:

Always between –1 and +1

–1 0 +1positive linear

correlationnegative linear

correlation

r

r measures the strength of linear association

(n data points)

Parameter Estimation via SAMPLE DATA …

2 2

xy

x y

sr

s s

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn

• Numerical variables

X

Y

JAMA. 2003;290:1486-1493

Scatterplot

Linear Correlation Coefficient:

Always between –1 and +1

–1 0 +1positive linear

correlationnegative linear

correlation

r

r measures the strength of linear association

(n data points)

Parameter Estimation via SAMPLE DATA …

2 2

xy

x y

sr

s s

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn

• Numerical variables

X

Y

JAMA. 2003;290:1486-1493

Scatterplot

Linear Correlation Coefficient:

Always between –1 and +1

–1 0 +1positive linear

correlationnegative linear

correlation

r

r measures the strength of linear association

(n data points)

r measures the strength of linear association

Parameter Estimation via SAMPLE DATA …

2 2

xy

x y

sr

s s

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn

• Numerical variables

X

Y

JAMA. 2003;290:1486-1493

Scatterplot

Linear Correlation Coefficient:

Always between –1 and +1

–1 0 +1positive linear

correlationnegative linear

correlation

r

(n data points)

> cor(x, y) -0.7200451

r measures the strength of linear association

Testing for linear association between two numerical population variables X and Y…

Linear Correlation Coefficient

2 2

XY

X Y

0: 0 "

."

H No linear association

between X and Y

: 0 "

."AH Linear association

between X and Y

2 2

xy

x y

sr

s s

Linear Correlation Coefficient

Now that we have r, we can conduct HYPOTHESIS TESTING on

Test Statistic for p-value

222 ~

1n

rT n t

r

2

0.7210 2

1 ( .72)

82.935 on t

p-value = .0189 < .052 * pt(-2.935, 8)

Parameter Estimation via SAMPLE DATA …

If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have…

2 2

xy

x y

sr

s s

Linear Correlation Coefficient:

r measures the strength of linear association 0 1Y X

“Response = Model + Error”

Find estimates and for the “best” line0 1

0 1ˆ ˆY X

> cor(x, y) -0.7200451

Residuals

( , )i ix y

ˆ( , )i ix y

ˆi i ie y y

Parameter Estimation via SAMPLE DATA …

in what sense???

If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have…

2 2

xy

x y

sr

s s

Linear Correlation Coefficient:

r measures the strength of linear association 0 1Y X

“Response = Model + Error”

Find estimates and for the “best” line0 1

0 1ˆ ˆY X

> cor(x, y) -0.7200451

Residuals

( , )i ix y

ˆ( , )i ix y

ˆi i ie y y

i.e., that minimizes

1 2ˆ xy

x

s

s

0 1ˆ ˆy x

25.866670.87715

29.48944

12.08 ( 0.87715)(7.05)

18.26391

“Least Squares Regression Line”

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

( , ) is on linex y

0 1ˆ ˆY X

If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have…

2 2

xy

x y

sr

s s

Linear Correlation Coefficient:

r measures the strength of linear association 0 1Y X

“Response = Model + Error”

Find estimates and for the “best” line0 1

ˆ 18.26391 0.87715Y X

> cor(x, y) -0.7200451

Residuals

( , )i ix y

ˆ( , )i ix y

ˆi i ie y y

i.e., that minimizes

1 2ˆ xy

x

s

s

0 1ˆ ˆy x

25.866670.87715

29.48944

12.08 ( 0.87715)(7.05)

18.26391

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

( , ) is on linex yCheck

0 1ˆ ˆY X

Find estimates and for the “best” line0 1

ˆ 18.26391 0.87715Y X

> cor(x, y) -0.7200451

Residuals

( , )i ix y

ˆ( , )i ix y

ˆi i ie y y

i.e., that minimizes

X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1

Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0

1 2ˆ xy

x

s

s

0 1ˆ ˆy x

25.866670.87715

29.48944

12.08 ( 0.87715)(7.05)

18.26391

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

Find estimates and for the “best” line0 1

ˆ 18.26391 0.87715Y X

> cor(x, y) -0.7200451

Residuals

( , )i ix y

ˆ( , )i ix y

ˆi i ie y y

i.e., that minimizes

X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1

Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0

1 2ˆ xy

x

s

s

0 1ˆ ˆy x

25.866670.87715

29.48944

12.08 ( 0.87715)(7.05)

18.26391

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response Y

Find estimates and for the “best” line0 1> cor(x, y) -0.7200451

Residuals

( , )i ix y

ˆ( , )i ix y

ˆi i ie y y

i.e., that minimizes

X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1

Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0

~ E X E R C I S E ~

1 2ˆ xy

x

s

s

0 1ˆ ˆy x

25.866670.87715

29.48944

12.08 ( 0.87715)(7.05)

18.26391

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response Y

ˆ 18.26391 0.87715Y X 2

ˆ 18.26391 0.87715Y X

Find estimates and for the “best” line0 1> cor(x, y) -0.7200451

Residuals

( , )i ix y

ˆ( , )i ix y

ˆi i ie y y

i.e., that minimizes

1 2ˆ xy

x

s

s

0 1ˆ ˆy x

25.866670.87715

29.48944

12.08 ( 0.87715)(7.05)

18.26391

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response

residuals

X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1

Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0

~ E X E R C I S E ~Y

ˆY Y

ˆ 18.26391 0.87715Y X

Find estimates and for the “best” line0 1> cor(x, y) -0.7200451

( , )i ix y

ˆ( , )i ix y

ˆi i ie y y

i.e., that minimizes

1 2ˆ xy

x

s

s

0 1ˆ ˆy x

25.866670.87715

29.48944

12.08 ( 0.87715)(7.05)

18.26391

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response

residuals

X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1

Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0

~ E X E R C I S E ~

~ E X E R C I S E ~

Y

ˆY Y

Residuals

Testing for linear association between two numerical population variables X and Y…

Linear Regression Coefficients

0 1: 0 "

."

H No linear association

between X and Y

1: 0 "

."AH Linear association

between X and Y

Linear Regression CoefficientsTest Statistic for p-value?

1 2ˆ xy

x

s

s 0 1

ˆ ˆy x

0 1Y X

0 1ˆ ˆY X

“Response = Model + Error”

Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1

ˆ 18.26391 0.87715Y X

Find estimates and for the “best” line0 1> cor(x, y) -0.7200451

( , )i ix y

ˆ( , )i ix y

ˆi i ie y y

i.e., that minimizes

1 2ˆ xy

x

s

s

0 1ˆ ˆy x

25.866670.87715

29.48944

12.08 ( 0.87715)(7.05)

18.26391

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response

residuals

X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1

Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0

~ E X E R C I S E ~

~ E X E R C I S E ~

Y

ˆY Y

Residuals

189.6555

Testing for linear association between two numerical population variables X and Y…

Linear Regression Coefficients

0 1: 0 "

."

H No linear association

between X and Y

1: 0 "

."AH Linear association

between X and Y

Linear Regression Coefficients

Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1

Test Statistic for p-value

1 2ˆ xy

x

s

s 0 1

ˆ ˆy x

0 1Y X

0 1ˆ ˆY X

“Response = Model + Error”

2Err ˆSS ( )y y

21 12

Err

ˆ( 1)

MSx nT n s t

ErrErr

SSMS

2n

0.87715 0(9)(29.48944)

189.6555 / 8

82.935 on t

Same t-score as H0: = 0! p-value = .0189

> plot(x, y, pch = 19)> lsreg = lm(y ~ x) # or lsfit(x,y)> abline(lsreg)> summary(lsreg)

Call:lm(formula = y ~ x)

Residuals: Min 1Q Median 3Q Max -8.6607 -3.2154 0.8954 3.4649 5.7742

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 18.2639 2.6097 6.999 0.000113 ***x -0.8772 0.2989 -2.935 0.018857 * ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.869 on 8 degrees of freedomMultiple R-squared: 0.5185, Adjusted R-squared: 0.4583 F-statistic: 8.614 on 1 and 8 DF, p-value: 0.01886

BUT WHY HAVE TWO METHODS FOR THE SAME PROBLEM???

Because this second method generalizes…

Source df SS MS F-ratio p-value

Treatment

Error

Total –

ANOVA Table0 1

1

: 0

: 0A

H

H

0 1Y X

Source df SS MS F-ratio p-value

Regression

Error

Total –

ANOVA Table0 1

1

: 0

: 0A

H

H

0 1Y X

?

Source df SS MS F-ratio p-value

Regression 1

Error

Total –

ANOVA Table0 1

1

: 0

: 0A

H

H

0 1Y X

?

Testing for linear association between two numerical population variables X and Y…

Linear Regression Coefficients

0 1: 0 "

."

H No linear association

between X and Y

1: 0 "

."AH Linear association

between X and Y

Linear Regression Coefficients

Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1

Test Statistic for p-value

1 2ˆ xy

x

s

s 0 1

ˆ ˆy x

0 1Y X

0 1ˆ ˆY X

“Response = Model + Error”

2Err ˆSS ( )y y

21 12

Err

ˆ( 1)

MSx nT n s t

ErrErr

SSMS

2n

0.87715 0(9)(29.48944)

189.6555 / 8

82.935 on t

Same t-score as H0: = 0! p-value = .0189

Errdf 8

Source df SS MS F-ratio p-value

Regression 1

Error 8

Total –

ANOVA Table0 1

1

: 0

: 0A

H

H

0 1Y X

?

?

?

?

Parameter Estimation via SAMPLE DATA …

Means:

Variances:

x y

n nx y 2

2

( )21

( )21

x x

x n

y y

y n

s

s

STATISTICS

JAMA. 2003;290:1486-1493

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn

Scatterplot

(n data points)

Total

Total

SS

df

Parameter Estimation via SAMPLE DATA …

Means:

Variances:

x y

n nx y 2

2

( )21

( )21

x x

x n

y y

y n

s

s

STATISTICS

JAMA. 2003;290:1486-1493

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn

Scatterplot

(n data points)

Total

Total

SS

df

SSTot is a measure of the total amount

of variability in the observed responses (i.e., before any model-fitting).

2TotSS ( )y y 2( 1) yn s

Parameter Estimation via SAMPLE DATA …

JAMA. 2003;290:1486-1493

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn

Scatterplot

(n data points)

Means:

Variances:

x y

n nx y 2

2

( )21

( )21

x x

x n

y y

y n

s

s

STATISTICS

Total

Total

SS

df

SSReg is a measure of the total amount

of variability in the fitted responses (i.e., after model-fitting.)

2TotSS ( )y y

2Reg ˆSS ( )y y

2( 1) yn s

Parameter Estimation via SAMPLE DATA …

Means:

Variances:

x y

n nx y 2

2

( )21

( )21

x x

x n

y y

y n

s

s

STATISTICS

Total

Total

SS

df

JAMA. 2003;290:1486-1493

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn

Scatterplot

(n data points)

SSErr is a measure of the total amount

of variability in the resulting residuals (i.e., after model-fitting).

2Err ˆSS ( )y y

2TotSS ( )y y

2Reg ˆSS ( )y y

2( 1) yn s

ˆ 18.26391 0.87715Y X > cor(x, y) -0.7200451

( , )i ix y

ˆ( , )i ix y

ˆi i ie y y

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response

residuals

X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1

Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0

~ E X E R C I S E ~

~ E X E R C I S E ~

Y

ˆY Y

Residuals

2Err ˆSS ( )y y

2TotSS ( )y y

2Reg ˆSS ( )y y

2( 1) yn s

= 189.656

= 393.856

= 9 (43.76178)

= 204.2

ˆ 18.26391 0.87715Y X

( , )i ix y

ˆ( , )i ix y

ˆi i ie y y

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response

residuals

X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1

Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0

~ E X E R C I S E ~

~ E X E R C I S E ~

Y

ˆY Y

Residuals

2Err ˆSS ( )y y

2TotSS ( )y y

2Reg ˆSS ( )y y

= 189.656

= 393.856

= 204.2

SSTot = SSReg + SSErr minimum

> cor(x, y) -0.7200451

TotErr

Reg

Source df SS MS F-ratio p-value

Regression 1 204.200 MSReg

Fk – 1, n – k 0 < p < 1

Error 8 189.656 MSErr

Total 9 393.856 –

ANOVA Table0 1

1

: 0

: 0A

H

H

0 1Y X

Source df SS MS F-ratio p-value

Regression 1 204.200 204.200

8.61349 0.018857

Error 8 189.656 23.707

Total 9 393.856 –

ANOVA Table

Same as before!

0 1

1

: 0

: 0A

H

H

0 1Y X

Source df SS MS F-ratio p-value

Regression 1 204.200 204.200

8.61349 0.018857

Error 8 189.656 23.707

Total 9 393.856 –

> summary(aov(lsreg))

Df Sum Sq Mean Sq F value Pr(>F) x 1 204.20 204.201 8.6135 0.01886 *Residuals 8 189.66 23.707

Source df SS MS F-ratio p-value

Regression 1 204.200 204.200

8.61349 0.018857

Error 8 189.656 23.707

Total 9 393.856 –

Reg

Tot

SS

SS

204.2.

393.856 0.5185Moreover,

The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining.

Coefficient of Determination

> cor(x, y) -0.7200451

Coefficient of Determination

Reg

Tot

SS

SS

204.2.

393.856 0.5185Moreover,

The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining.

2 2( 0.72)r 0.5185

> plot(x, y, pch = 19)> lsreg = lm(y ~ x)> abline(lsreg)> summary(lsreg)

Call:lm(formula = y ~ x)

Residuals: Min 1Q Median 3Q Max -8.6607 -3.2154 0.8954 3.4649 5.7742

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 18.2639 2.6097 6.999 0.000113 ***x -0.8772 0.2989 -2.935 0.018857 * ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.869 on 8 degrees of freedomMultiple R-squared: 0.5185, Adjusted R-squared: 0.4583 F-statistic: 8.614 on 1 and 8 DF, p-value: 0.01886

Reg

Tot

SS

SS 0.5185

2 2( 0.72)r 2 2( 0.72)r 0.5185Coefficient of Determination

The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining.

Given:

2 2

xy

x y

sr

s s

Linear Correlation Coefficient

Least Squares Regression Line

12ˆ ,xy xs s 0 1

ˆ ˆy x 0 1

ˆ ˆˆ XY minimizes SSErr =

Summary of Linear Correlation and Simple Linear Regression

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn

Means

x

y

Variances2

2

x

y

s

s

Covariance

xys

JAMA. 2003;290:1486-1493

X

Y

X

Y

= SSTot – SSReg

2ˆ( )y y

–1 r +1measures the strength of linear association

(ANOVA)All point estimates can be upgraded to CIs for hypothesis testing, etc.

Given:

2 2

xy

x y

sr

s s

Linear Correlation Coefficient

–1 r +1measures the strength of linear association

Least Squares Regression Line

12ˆ ,xy xs s 0 1

ˆ ˆy x

minimizes SSErr =

(ANOVA)

Summary of Linear Correlation and Simple Linear Regression

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn

Means Variances Covariance

x

y

2

2

x

y

s

sxys

JAMA. 2003;290:1486-1493

X

Y

X

Y

= SSTot – SSReg

2ˆ( )y y0 1ˆ ˆˆ XY

upper 95% confidence band

95% Confidence Intervals

lower 95% confidence band

y

All point estimates can be upgraded to CIs for hypothesis testing, etc.

(see notes for “95% prediction intervals”)

Given:

2 2

xy

x y

sr

s s

Linear Correlation Coefficient

–1 r +1measures the strength of linear association

Least Squares Regression Line

12ˆ ,xy xs s 0 1

ˆ ˆy x 0 1

ˆ ˆˆ XY minimizes SSErr =

(ANOVA)

Summary of Linear Correlation and Simple Linear Regression

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn

Means Variances Covariance

x

y

2

2

x

y

s

sxys

JAMA. 2003;290:1486-1493

X

Y

X

Y

Coefficient of Determination

= SSTot – SSReg

2ˆ( )y y

All point estimates can be upgraded to CIs for hypothesis testing, etc.

Reg2

Tot

SS

SSr proportion of total variability modeled

by the regression line’s variability.

Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc.

0 1 1 2 2 3 3 1 1k kY X X X X

0 1 1 2 2 1 1ˆ ˆ ˆ ˆˆ

k kY X X X

0 1 2 3 1: 0kH 1 2 3 1

"

, , , , ."k

No linear association between Y and

any of its predictors X X X X : 0

for some 1,2,..., 1A iH

i k

"

."

Linear association between Y and

at least one of its predictors

“Response = Model + Error”

Multilinear Regression

“main effects”

For now, assume the “additive model,” i.e., main effects only.

Multilinear Regression

Fitted response

Residual

True response yi

X1

X20

Y

(x1i , x2i)

Predictors

ˆiy

ˆii iyye

1 2( , , )x x y

0 1 1 2 2ˆ ˆ ˆ ˆY X X

Once calculated, how do we then test the null hypothesis?

Least Squares calculation of regression coefficients is computer-intensive. Formulas require Linear Algebra (matrices)!

ANOVA

Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc.

0 1 1 2 2 3 3 1 1k kY X X X X

“Response = Model + Error”

Multilinear Regression

“main effects”

R code example: lsreg = lm(y ~ x1+x2+x3)

R code example: lsreg = lm(y ~ x1+x2+x3)R code example: lsreg = lm(y ~ x+x^2+x^3)

Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc.

0 1 1 2 2 3 3 1 1

2 2 21,1 1 2,2 2 1, 1 1

cubes +

k k

k k k

Y X X X X

X X X

“Response = Model + Error”

Multilinear Regression

“main effects”

R code example: lsreg = lm(y ~ x+x^2+x^3)R code example: lsreg = lm(y ~ x1+x2+x1:x2)R code example: lsreg = lm(y ~ x1*x2)

Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc.

0 1 1 2 2 3 3 1 1

2 2 21,1 1 2,2 2 1, 1 1

1,2 1 2 1,3 1 3 1, 1 1 1

2,3 2 3 2,4 2 4 2, 1 2 1

cubes +

+

+

+

k k

k k k

k k

k k

Y X X X X

X X X

X X X X X X

X X X X X X

“Response = Model + Error”

Multilinear Regression

“main effects”

“interactions”“interactions”

Recall…

Example in R (reformatted for brevity):

> I = c(1,1,1,1,1,0,0,0,0,0)

> lsreg = lm(y ~ x*I)> summary(lsreg)

Coefficients:

Estimate(Intercept) 6.56463x 0.00998I 6.80422x:I 1.60858

Suppose these are actually two subgroups, requiring two distinct linear regressions!

Multiple Linear Reg with interactionwith an indicator (“dummy”) variable:

ˆ 6.56 0.01 6.80 1.61Y X I X I

I = 1

I = 0

ˆ 6.56 0.01Y X

ˆ 13.36 1.62Y X 0 1 2 3

ˆ ˆ ˆ ˆY X I X I

ANOVA Table (revisited)

0 1 2 3 1: 0kH

From sample of n data points…. 0 1 1 2 2 1 1ˆ ˆ ˆ ˆˆ

k kY X X X

-11 2 3 k

"No linear association between Y and

any of its predictors X , X , X ,…, X ."

0 1 1 2 2 1 1k kY X X X

: 0

for some 1,2,..., 1A iH

i k

"Linear association between Y and

at least one of its predictors."

Note that if true, then it would follow that 0Y

But how are these regression coefficients calculated in general?

“Normal equations” solved via computer (intensive).

Note that if true, then it would follow that

0 .Y

0ˆ .y

Source df SS MS F p-value

Regression

Error

Total

0 1 2 3 1: 0kH

1k

n k

1n 2

1

( )n

ii

y y

2

1

ˆ( )n

ii

y y

2

1

ˆ( )n

i ii

y y

SS

df

RegMS

ErrMS

Reg

Err

MS

MS

1,k n kF 0 1p

0 1 1 2 2 1 1ˆ ˆ ˆ ˆˆ

k kY X X X

1 2 3 1

"

, , , , ."k

No linear association between Y and

any of its predictors X X X X

ANOVA Table (revisited)

(based on n data points).

*** How are only the statistically significant variables determined? ***

p-values: p1 < .05 p2 < .05 p4 < .05 3 .05p

“MODEL SELECTION”(BE) Step 0. Conduct an overall F-test of significance (via ANOVA) of the full model.

X1

+ + + + ……

X2 X3 X4

1 23 4

Y

0 1: 0H 0 2: 0H 0 3: 0H 0 4: 0H ……Step 1.t-tests:

Reject H0 Reject H0 Accept H0 Reject H0

…… ……

Step 2. Are all coefficients significant at level ? If not….

If significant, then…

3 .05p

p-values: p1 < .05 p2 < .05 p4 < .05 3 .05p 3 .05p

“MODEL SELECTION”(BE) Step 0. Conduct an overall F-test of significance (via ANOVA) of the full model.

X1

+ + + + ……

X2 X3 X4

1 23 4

Y

0 1: 0H 0 2: 0H 0 3: 0H 0 4: 0H ……Step 1.t-tests:

Reject H0 Reject H0 Accept H0 Reject H0

…… ……

Step 2. Are all coefficients significant at level ? If not….

X1

+ + + + ……

X2 X41 2 X3

3 4

Y

delete that term,

If significant, then…

3 .05p p-values: p1 < .05 p2 < .05 p4 < .05

“MODEL SELECTION”(BE) Step 0. Conduct an overall F-test of significance (via ANOVA) of the full model.

X1

+ + + + ……

X2 X3 X4

1 23 4

Y

0 1: 0H 0 2: 0H 0 3: 0H 0 4: 0H ……Step 1.t-tests:

Reject H0 Reject H0 Accept H0 Reject H0

…… ……

Step 2. Are all coefficients significant at level ? If not….

X1

+ + + + ……

X2 X41 2

4

Y

Step 3. Repeat 1-2 as necessary until all coefficients are significant → reduced model

delete that term, and recompute new coefficients!

If significant, then…

X1 X2 X4

+ + + ……

1 2 4

1

2

k

1Y 2Y kY

1 2 k

12

k

= ==H0:

Analysis of Variance (ANOVA)k 2 independent, equivariant, normally-distributed “treatment groups”

Recall ~

MODEL ASSUMPTIONS?

“Regression Diagnostics”

Re-plot data on a “log-log” scale.

Re-plot data on a “log” scale (of Y only)..

Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)

Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)

Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)

0 1

ˆ ˆ ˆlnˆ1

X

0 1ˆ ˆ( )

1ˆ1 Xe

“MAXIMUM LIKELIHOOD ESTIMATION”

“log-odds” (“logit”) = example of a general “link function”

( )g

(Note: Not based on LS implies “pseudo-R2,” etc.)

Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)

0 1 1 2 2

ˆ ˆ ˆ ˆ ˆlnˆ1

k kX X X

0 1 1 2 2ˆ ˆ ˆ ˆ( )

1ˆ1 k kX X Xe

Suppose one of the predictor variables is binary… 1

1, Age 500, Age 50

X

“log-odds” (“logit”)

10 1 2 2

1

ˆ ˆ ˆ ˆ ˆlnˆ1

k kX X

1 1:X

1 0 :X 00 2 2

0

ˆ ˆ ˆ ˆlnˆ1

k kX X

SUBTRACT!

Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)

0 1 1 2 2

ˆ ˆ ˆ ˆ ˆlnˆ1

k kX X X

0 1 1 2 2ˆ ˆ ˆ ˆ( )

1ˆ1 k kX X Xe

Suppose one of the predictor variables is binary… 1

1, Age 500, Age 50

X

“log-odds” (“logit”)

10

1

ˆ ˆlnˆ1

1 2 2ˆ ˆ X ˆ

k kX 1 1:X

1 0 :X 00

0

ˆ ˆlnˆ1

2 2ˆ X ˆ

k kX

SUBTRACT!

Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)

0 1 1 2 2

ˆ ˆ ˆ ˆ ˆlnˆ1

k kX X X

0 1 1 2 2ˆ ˆ ˆ ˆ( )

1ˆ1 k kX X Xe

Suppose one of the predictor variables is binary… 1

1, Age 500, Age 50

X

“log-odds” (“logit”)

011

1 0

ˆˆ ˆln lnˆ ˆ1 1

Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)

0 1 1 2 2

ˆ ˆ ˆ ˆ ˆlnˆ1

k kX X X

0 1 1 2 2ˆ ˆ ˆ ˆ( )

1ˆ1 k kX X Xe

Suppose one of the predictor variables is binary… 1

1, Age 500, Age 50

X

“log-odds” (“logit”)

1

11

0

0

ˆ

ˆ1 ˆlnˆ

ˆ1

1

odds of surgery given Age 50 ˆlnodds of surgery given Age 50

Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)

0 1 1 2 2

ˆ ˆ ˆ ˆ ˆlnˆ1

k kX X X

0 1 1 2 2ˆ ˆ ˆ ˆ( )

1ˆ1 k kX X Xe

Suppose one of the predictor variables is binary… 1

1, Age 500, Age 50

X

“log-odds” (“logit”)

ODDS RATIO

OR e ………….. implies ………….. 1ˆln OR

(1 )d

d y

a ydt

( )M y

1

(1 )d a dt

ln | | ln |1 | at b

ln1

at b

d ya y

dt

in population dynamics

Unrestricted population growth(e.g., bacteria)

Population size y obeys the following law

with constant a > 0.

1d y a dt

y

ln | |y at b a t by e a t be e

0a ty y e

a tC eWith initial condition 0(0)y y

Restricted population growth(disease, predation, starvation, etc.)

Population size y obeys the following law,constant a > 0, and “carrying capacity” M.

Exponential growth

Let survival probability = .y M

1 1

1d a dt

Logistic growth0

0 0(1 ) ate

Recommended