+ All Categories
Home > Documents > Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics July 14, 2011 Karen...

Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics July 14, 2011 Karen...

Date post: 22-Dec-2015
Category:
Upload: elliot-stansfield
View: 215 times
Download: 1 times
Share this document with a friend
Popular Tags:
25
Lecture 4, part 1 Lecture 4, part 1 : Linear : Linear Regression Regression Analysis: Two Advanced Topics Analysis: Two Advanced Topics July 14, 2011 July 14, 2011 Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University Introduction to Statistical Measurement and Modeling
Transcript
Page 1: Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics July 14, 2011 Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University.

Lecture 4, part 1Lecture 4, part 1: Linear Regression: Linear RegressionAnalysis: Two Advanced TopicsAnalysis: Two Advanced Topics

July 14, 2011July 14, 2011

Karen Bandeen-Roche, PhDDepartment of Biostatistics

Johns Hopkins University

Introduction to Statistical Measurement

and Modeling

Page 2: Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics July 14, 2011 Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University.

Data examples Boxing and neurological injury

Scientific question: Does amateur boxing lead to decline in neurological performance?

Some related statistical questions:

Is there a dose-response increase in the rate of cognitive decline with increased boxing exposure?

Is boxing-associated decline independent of initial cognition and age?

Is there a threshold of boxing that initiates harm?

Page 3: Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics July 14, 2011 Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University.

Boxing data-2

0-1

00

10

20

blk

diff

0 100 200 300 400blbouts

bandwidth = .8

Lowess smoother

Page 4: Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics July 14, 2011 Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University.

Outline

Topic #1: Confounding Handling this is crucial if we are to draw

correct conclusions about risk factors

Topic #2: Signal / noise decomposition Signal: Regression model predictions

Noise: Residual variation

Another way of approaching inference, precision of prediction

Page 5: Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics July 14, 2011 Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University.

Topic # 1: Confounding

Confound means to “confuse”

When the comparison is between groups that are otherwise not similar in ways that affect the outcome

Lurking variables,….

Page 6: Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics July 14, 2011 Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University.

Confounding Example: Drowning and Eating Ice Cream

Ice Cream eaten

Drowning rate

**

**

***

**

*

*

**

**

**

***

**

*

*

**

Page 7: Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics July 14, 2011 Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University.

July 2010 JHU Intro to Clinical Research 7

ConfoundingEpidemiology definition: A characteristic “C” is a confounder if it is associated (related) with both the outcome (Y: drowning) and the risk factor (X: ice cream) and is not causally in between

Ice Cream Consumption

Ice Cream Consumption Drowning rateDrowning rate

????

Page 8: Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics July 14, 2011 Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University.

ConfoundingStatistical definition: A characteristic “C” is a confounder if the strength of relationship between the outcome (Y: drowning) and the risk factor (X: ice cream) differs with, versus without, adjustment for C

Ice Cream EatenIce Cream Eaten Drowning rateDrowning rate

Outdoor Temperature

Page 9: Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics July 14, 2011 Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University.

Confounding Example: Drowning and Eating Ice Cream

Ice Cream eaten

Drowning rate

**

**

***

**

*

*

**

**

**

***

**

*

*

**

Cool temperature

Warm temperature

Page 10: Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics July 14, 2011 Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University.

July 2010 JHU Intro to Clinical Research 10

Effect modificationA characteristic “E” is an effect modifier if the strength of relationship between the outcome (Y: drowning) and the risk factor (X: ice cream) differs within levels of E

Ice Cream Consumption

Ice Cream Consumption Drowning rateDrowning rate

Outdoor temperature

Outdoor temperature

Page 11: Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics July 14, 2011 Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University.

Effect Modification: Drowning and Eating Ice Cream

Ice Cream eaten

Drowning rate

*

* * *

*

*

*

** *

*

*

*

**

**

***

**

*

*

**

Cool temperature

Warm temperature

Page 12: Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics July 14, 2011 Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University.

Topic #2: Signal/Noise Decomposition

Lovely due to geometry of least squares

Facilitates testing involving multiple parameters at once

Provides insight into R-squared

Page 13: Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics July 14, 2011 Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University.

Signal/Noise Decomposition First step: decomposition of variance

“Regression” part: Variance of s

“Error” or “Residual” part: Variance of e

Together: These determine “total” variance of Ys

“Sums of Squares” (SS) rather than variance per se

Regression SS (SSR):

Error SS (SSE):

Total SS (SST):

Y

Page 14: Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics July 14, 2011 Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University.

Signal/Noise Decomposition

Properties SST = SSR + SSE

SSR/SST = “proportion of variance explained” by regression = R-squared

Follows from geometry

SSR and SSE are independent (assuming A1-A5) and have easily characterized probability distributions

Provides convenient testing methods

Follows from geometry plus assumptions

Page 15: Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics July 14, 2011 Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University.

Signal/Noise Decomposition

SSR and SSE are independent Define M = span(X) and take “Y” as centered at

It is possible to orthogonally rotate the coordinate axes so that first p axes ε M; remaining n-p-1 axes ε M⊥

Gram-Schmidt orthogonalization

Doing this transforms Y into TY :=Z, for some orthonormal matrix T with columns:= {e1,...,en-1}

Distribution of Z = N(TE[Y|X],σ2I)

Y

Page 16: Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics July 14, 2011 Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University.

Signal/Noise Decomposition

SSR and SSE are independent - continued TY=Z Y = T’Z

SSE = squared length of =

SSR = squared length of =

Claim now follows: SSR & SSE are independent because (Z1,…,Zp) and (Zp+1,…,Zn-

1) are independent

Page 17: Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics July 14, 2011 Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University.

Signal/Noise Decomposition Under A1-A5 SSE, SSR and their scaled ratio

have convenient distributions

Under A1-A2: E[Y|X] ε M, E[Zj|X] =0, all j>p

Recall {Z1,...,Zn-1} are mutually independent normal with variance=σ2

Thus SSE = =

~ σ2 χ2n-p-1 under A1-A5

(a sum of k independent squared N(0,1) is ) k2

Page 18: Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics July 14, 2011 Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University.

Signal/Noise Decomposition

Under A1-A5 SSE, SSR and their scaled ratio have convenient distributions

For j ≤ p E[Zj|X] ≠ 0 in general

Exception: H0: β1=…=βp = 0

Then SSR = ~ σ2 χ2p under A1-A5

and

~ Fp,n-p-1 ~

with numerator and denominator independent.

SSR p

SSE n p

/

/ ( ) 1

p

n p

p

n p

2

12 1

/

/ ( )

Page 19: Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics July 14, 2011 Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University.

Signal/Noise Decomposition

An organizational tool: The analysis of variance (ANOVA) table

SOURCE Sum of Squares (SS)

Degrees of freedom (df)

Mean square (SS/df)

Regression SSR p SSR/p

Error SSE n-p-1 SSE/(n-p-1)=

Total SST= SSR + SSE

n-1

2

F = MSR/MSE

Page 20: Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics July 14, 2011 Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University.

“Global” hypothesis tests These involve sets of parameters

Hypotheses of the form

H0: βj = 0 for all j in a defined subset of {j=1,...,p} vs. H1: βj ≠ 0 for at least one of the j

Example 1: H0: βLATITUDE = 0 and βLONGITUDE = 0

Example 2: H0: all polynomial or spline coefficients involving a given variable = 0.

Example 3: H0: all coefficients involving a variable = 0.

Page 21: Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics July 14, 2011 Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University.

“Global” hypothesis tests Testing method: Sequential decomposition of sums of

squares

Hypothesis to be tested is H0: βj1=...=βjk = 0 in full model

Fit model excluding xj1,...,xjpj: Save SSE = SSEs

Fit “full” (or larger) model adding xj1,...,xjpj to smaller model. Save SSE=SSEL, often=overall SSE

Test statistic S = [(SSES-SSEL)/pj]/[SSEL(n-p-1)]

Distribution under null: F(pj,n-p-1)

Define rejection region based on this distribution

Compute S

Reject or not as S is in rejection region or not

Page 22: Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics July 14, 2011 Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University.

Signal/Noise Decomposition

An augmented version for global testing

SOURCE Sum of Squares (SS)

Degrees of freedom (df)

Mean square (SS/df)

Regression SSR p SSR/p

X1 SST-SSEs p1

X2|X1 SSES-SSEL p2 (SSES-SSEL )/p2

Error SSEL n-p-1 SSEL/(n-p-1)

Total SST= SSR + SSE

n-1

F = MSR(2|1)/MSE

Page 23: Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics July 14, 2011 Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University.

R-squared – Another view

From last lecture: ECDF Corr(Y, ) squared

More conventional: R2 = SSR/SST

Geometry justifies why they are the same Cov(Y, ) = Cov(Y- + , ) = Cov(e, ) +

Var( )

Covariance = inner product first term = 0

A measure of precision with which regression model describes individual responses

Y

Y Y Y Y Y Y

Page 24: Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics July 14, 2011 Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University.

Outline: A few more topics

Colinearity

Overfitting

Influence

Mediation

Multiple comparisons

Page 25: Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics July 14, 2011 Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University.

Main points Confounding occurs when an apparent association

between a predictor and outcome reflects the association of each with a third variable A primary goal of regression is to “adjust” for confounding

Least squares decomposition of Y into fit and residual provides an appealing statistical testing framework An association of an outcome with predictors is

evidenced if SS due to regression is large relative to SSE

Geometry: orthogonal decomposition provides convenient sampling distribution, view of R2

ANOVA


Recommended