+ All Categories
Home > Documents > Introduction to regression

Introduction to regression

Date post: 24-Jan-2016
Category:
Upload: hateya
View: 52 times
Download: 0 times
Share this document with a friend
Description:
Introduction to regression. Karen Bandeen -Roche, PhD Department of Biostatistics Johns Hopkins University. July 13, 2011. Introduction to Statistical Measurement and Modeling. Data motivation. Temperature modeling - PowerPoint PPT Presentation
Popular Tags:
40
Introduction to regression Introduction to regression July 13, 2011 July 13, 2011 Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University Introduction to Statistical Measurement and Modeling
Transcript
Page 1: Introduction to regression

Introduction to regressionIntroduction to regression

July 13, 2011July 13, 2011

Karen Bandeen-Roche, PhDDepartment of Biostatistics

Johns Hopkins University

Introduction to Statistical Measurement

and Modeling

Page 2: Introduction to regression

Data motivation

Temperature modeling Scientific question: Can we accurately and

precisely model geographic variation in temperature?

Some related statistical questions:

How does temperature vary as a function of latitude and longitude (what is the “shape”)?

Does the temperature variation with latitude differ by longitude? Or vice versa?

Once we have a set of predictions: How accurate and precise are they?

Page 3: Introduction to regression

United States temperature map

http://green-enb150.blogspot.com/2011/01/isorhythmic-map-united-states-weather.html

Page 4: Introduction to regression

Temperature data

temp

80 100 120 140 160

20

30

40

50

60

70

80

80

100

120

140

160

longtude

20 30 40 50 60 70 80 20 30 40 50

20

30

40

50

latitude

Page 5: Introduction to regression

Data examples Boxing and neurological injury

Scientific question: Does amateur boxing lead to decline in neurological performance?

Some related statistical questions:

Is there a dose-response increase in the rate of cognitive decline with increased boxing exposure?

Is boxing-associated decline independent of initial cognition and age?

Is there a threshold of boxing that initiates harm?

Page 6: Introduction to regression

Boxing data-2

0-1

00

10

20

blk

diff

0 100 200 300 400blbouts

bandwidth = .8

Lowess smoother

Page 7: Introduction to regression

Outline Regression

Model / interpretation

Estimation / parameter inference

Direct / indirect effects

Nonlinear relationships / splines

Model checking

Precision of prediction

Page 8: Introduction to regression

Regression Studies the mean of an outcome or response variable “Y” as a function of predictor or

covariate variables “X” E[Y|X]

Note directionality: dependent Y independent X

Contrast to covariance, correlation: Cov(X,Y)=Cov(Y,X)

-20

-10

010

20

blk

diff

0 100 200 300 400blbouts

bandwidth = .8

Lowess smoother

Page 9: Introduction to regression

Regression Purposes

Describing relationships between variables

Making inferences about relationships between variables

Predicting future outcomes

Investigating interplay among multiple variables

Page 10: Introduction to regression

Regression Model Y = systematic g(x) + random error

Y = E[Y|x] + ε (population model)

Yi = E[Y|xi] + εi, i = 1,…, n (sample from population)

Yi = fit(xi) + ei (sample model)

This course: the linear regression model

fit(X1,...,Xp;ß) = E[Y|X1,...,Xp] = ß0+ß1X1+...+ßpXp

Simple linear regression: One X yielding

E[Y|X] = ß0+ß1X

Page 11: Introduction to regression

Regression Model

E[Y|X1,...,Xp] = ß0+ß1X1+...+ßpXp

Interpretation of β0: Mean response in subpopulation for whom all covariate values equal 0

Interpretation of ßj: The difference in mean response comparing subpopulations whose xj value differs by one unit and values on other covariates do not differ

Page 12: Introduction to regression

Modeling geographical variation:

Latitude and Longitude

http://www.enchantedlearning.com/usa/activity/latlong/

Page 13: Introduction to regression

Regression – Matrix Specification

Notation:

n-vector v = (v1,...,vn)' (column), where “'” = “transpose”

nxp matrix A = (n rows, p columns) = [Aij], Aij = aij

E[Z] = (E[Z1],...,E[ZM])’

Model: Y = Xß + ε Y, ε = nx1 vector (n “records,” “people,” “units,” etc.)

X = nx(p+1) "design matrix“

Columns: list of ones–codes the intercept–and then p Xs

ß= (p+1) vector (intercept and then coefficients β1,...,βp)

Page 14: Introduction to regression

Estimation Principle of least squares

Independently discovered by Gauss (1803) & Legendre (1805)

minimizes = Resulting estimator:

“Fit” = HY

“Residual” e= = e = (I-H)Y

ββ ni 1

[Yi (β0 β1xi1 ... βpxip)]2n

i 1

[Yi (β0 β1xi1 ... βpxip)]2 (Y Xβ)(Y Xβ)(Y Xβ)(Y Xβ)

X X X Y

1

Y i β0 β1xi ... βpxpY i β0 β1xi ... βpxpY Xβ X(X X) 1X YY Xβ X(X X) 1X Y

Yi β0 β1xi ... β pxp Yi Y

iYi Y

i( )

Page 15: Introduction to regression

Least squares fit - Intuition Simple linear regression (SLR)

Fit=weighted average of slopes between each (xi,Yi) and :

i.e.

1

1

11

2

1

1

2

1

2

i

n

i i

i

ni

n

i i

xx

i

ni

ii

i

n

i

Y Y x x

x x

Y x x

S

Y Y

x xx x

x x

,11

2

1

2

i

n

ii

ii

i

i

n

i

wY Y

x xw

x x

x x

x Y,

Page 16: Introduction to regression

Least squares fit - Intuition Weight increases with distance of xi from

sample mean of xs

LS line must go through :

Equally loaded springs attaching each point to line with fulcrum at

Multiple linear regression “projects” Y into the model space = all linear combinations of columns of X

x Y,

x Y,

Page 17: Introduction to regression

Why does LS yield a good estimator?

Need some preliminaries and assumptions

A preliminary – Variance of a random m-vector, z, is an (m x m) matrix,

1. Diagonal elements = variances: = Var(Zi)

2.(i, j) off-diagonal element = Cov(Zi, Zj)

3. = (symmetric)

4. is positive semi-definite: for any .

5. For an (n x m) matrix, A, Var

ii

ij

ij ji a a a, 0

A z AVar z A

Page 18: Introduction to regression

Why does LS yield a good estimator?

Assumptions – Random Part of the Model A1: Linear model:

A2: Mean-0, exogenous errors: .

A3: Constant variance: has all diagonal entries =

A4: Random sampling: Errors are statistically independent.

A5: Probability distribution: is distributed as (~) multivariate normal (MVN)

Y X

E X E 0

2

i

Var X

Page 19: Introduction to regression

Why does LS yield a good estimator?

1. Unbiasedness (“accuracy”):

2. Standard errors for beta:

a.

b. To estimate: Will need estimate for

3. BLUE: LS produces the Best Linear Unbiased Estimator

a. SLR meaning: If is a competing linear unbiased estimator, then

b. MLR, matrix statement: is positive semidefinite

i. e.g. for any choice of c0, …, cp

E E X X X YT T 1

Var X XT 2 1

Var X XjT

jj

( ) 2 1

2 2

Var Var

Var c Var cj

p

j j j

p

j j( ) ( )

0 0

Page 20: Introduction to regression

Inference on coefficients

We have To assess: Will need estimate for

ei approximates

Estimate by ECDF variance of ei

Will justify “n-p” shortly

Estimated standard error of

2 2

Var X XjT

jj

( ) 2 1

2 Var i( )

i 2

2

1

21

n p

ei

n

i

: jjj

X X2 1

Page 21: Introduction to regression

Inference on coefficients

SE

100 confidence interval for :

Hypothesis testing versus

Test statistic: ,

i.e., T follows a t distribution with n-p-1 degrees of freedom.

( ): jjj

X X2 1

( )1 j

( ) ( )/ j a jt n p SE 1 2 1

H j j0 :* H j j1 :

*

H j j0 :*

T

SE X Xt n pj j

j

j j

jj

( )

~ ( )

* *

2 11

Page 22: Introduction to regression

Direct and Indirect Associationsin Regression

Coefficients’ magnitude, interpretation may vary according to what other predictors are in the model.

Terminology Consider two-covariate model:

= “direct association” of x1 with Y, controlling for x2

The SLR coefficient in

= “total association” of x1 with Y

The difference = “indirect association” through x2 and x1 with Y.

Y x x 0 1 1 2 2

11

* Y x 0 1 1* * *

1 1*

Page 23: Introduction to regression

Direct and Indirect Associationsin Regression

The fitted total association of x1 with Y involves the fitted direct association and the regression of x2 on x1:

For SLR

where is the slope in the LS SLR of x2 on x1: =

Total, direct associations only equal if or

Generalizes to more than two covariates

*

1

1 1 0 1 1 2 211

1 1 1 1

S

S

x x x x

SX Y

X X

i i i ii

n

X X

0 1 2 1 1 21

1 1

1 1

S x x x

S

X X i ii

n

X X

, 1 22 1

1 1

S

SX X

X XS

SX X

X X

2 1

1 1

21

21 0 2 0

Page 24: Introduction to regression

Model Checking with Residuals

Two areas of concern Widespread failure of assumptions A1-A5

Isolated points: Potential to unduly influence fit

Model checking method Fit full model, calculate

and plot ei in various ways

Rationale: If model fits, ei mimics , so should have mean 0, equal variance, etc.

e y y i ni i i , , . . . , ,1

i

Page 25: Introduction to regression

Model Checking with Residuals

Checking A1-A3: Scatterplots of residuals vs. (i) fitted values ; (ii) each continuous covariate x ij, with reference line at y=0 Good fit: flat, unpatterned cloud about reference

line

A1-A2 violation: Curvilinear pattern about reference line

Influential points: Close to reference line at expense of overall shape

A3 violation: Tendency for degree of vertical “spread” about the reference line to vary systematically with or xij

Checking A5: Stem-and-leaf, ECDF-versus-normal plot

y i

y i

Page 26: Introduction to regression

Nonlinear relationships: Curve fitting

Method 1: Polynomials Replace by

Curve Method 2: “Splines” = piecewise polynomials

Feature 1: “order” = order of polynomials being joined

Feature 2: “knots” = where the polynomials are joined

j jx j j j j jK jKx x x1 2

2 . . . .

Page 27: Introduction to regression

How to fit splines Linear spline with K knots at position k1, …, kK:

where (x-k)+ = x-k if (x-k) > 0 and = 0 otherwise.

Spline of order M:

where x – k0: =x.

To ensure a smooth curve, only include plus functions for the highest order terms

Y x x k x k x kK K

0 1 2 1 3 2 1. . . . ,

Y x kpm pm

M

p

K m

010

,

Page 28: Introduction to regression

Spline interpretation, linear spline case

= Mean outcome in subpopulation with x=0

= Mean Y vs. x slope on {x: x<k1}

= Amount by which mean Y vs. x slope on differs from slope on {x: x<k1}

= Amount by which mean Y vs. x slope on differs from slope on

0

1

2

q

x k x k: 1 2

x k x kq q: 1

x k x kq q: 2 1

Page 29: Introduction to regression

Checking goodness of fit curveThe partial residual plot

Larsen W, McCleary S. “The use of partial residual plots in regression analysis. Technometrics 1972; 14: 781-90.

Purpose: Assess degree to which a fitted curve describing association between Y and X matches the empirical shape of the association.

Basic idea If there were no covariates except the one for

which a curve is:

being fit, we’d plot Y versus X

and overlay with versus X.

Proposal: Subtract out contributions of other covariates

Y

Page 30: Introduction to regression

The partial residual plot - Method

Suppose model is , where

x2 = variable of interest & f(x2) = vector of terms involving x2,

represents “adjustment” for covariates other than x2

Fit the whole model (e.g. get );

Calculate partial residuals ;

Plot rp versus x2, overlaying with the plot of versus x2

Another way to think about it: The partial residuals are + the overall model residuals.

Y X f x 0 1 1 2 2( )

X 1 1

, , 0 1 2

r Y Xp 0 1 1

f x( ) 2 2

f x( ) 2 2

Page 31: Introduction to regression

United States temperature map

http://green-enb150.blogspot.com/2011/01/isorhythmic-map-united-states-weather.html

Page 32: Introduction to regression

Data motivation

Temperature modeling - Statistical questions:

How does temperature vary as a function of latitude and longitude (what is the “shape”)? - DONE

Does the temperature variation with latitude differ by longitude? Or vice versa?

Once we have a set of predictions: How accurate and precise are they?

Page 33: Introduction to regression

Does the temperature variation with latitude differ by longitude? Preliminary: Does temperature vary across latitude

and longitude categories?

Usual device = dummy variables

Suppose the categorical covariate (X) has K categories

Choose one category (say, the Kth) as the “reference”

For other K-1 categories create new var. X1, …XK-1 with

Xki = 1 if Xi = category k; 0 otherwise

Rather than include Xi in MLR, include Xi1, …, X-iK-1.

If more than one categorical variable, can create as many sets of dummy variables as necessary (one per variable)

k E Y ca tegoryK E Y ca tegoryK

Page 34: Introduction to regression

Does the temperature variation with latitude differ

by longitude? Formal wording: Is there interaction in the association of latitude and longitude with temperature? Synonyms: effect modification, moderation

Covariates A and B interact if the magnitude of the association between A and Y varies across levels of B

i.e. B “modifies” or “moderates” the association between A & Y

Page 35: Introduction to regression

Interaction Modeling Interactions are typically coded as “main

effects” + “interaction terms”

Example: Two categorical predictors (factors)

Model:

Mean Y given predictors

Predictor 1 Predictor 2

Factor level 1

Factor level 2

x2 = 0 x2 = 1

Factor level 1

x11 = 0 x12 = 0

Factor level 2

x11 = 1 x12 = 0

Factor level 3

x11 = 0 x12 = 1

0

0 11

0 12

0 2

0 11 2 31

0 12 2 32

Y x x x x x x x 0 11 11 12 12 2 2 31 11 2 32 12 2M ain M ain In terac tio n

Y x x x x x x x 0 11 11 12 12 2 2 31 11 2 32 12 2

Page 36: Introduction to regression

Interaction Modeling: Interpretations

= mean response at the “reference” factor level combination;

= amount by which mean response at the second level of the first factor differs from the mean response at the first level of the first factor, among those at the first level of the second factor

Y x x x x x x x 0 11 11 12 12 2 2 31 11 2 32 12 2

0 11 12 20 0 0 E Y x x x, ,

11 0 11 0 11 12 2 11 12 21 0 0 0 0 0 E Y x x x E Y x x x, , , ,

Page 37: Introduction to regression

Interaction Modeling: Interpretations

=

= amount by which the (difference in mean response across levels of the second factor) differs across levels of the first factor

= amount by which the second factor’s effect varies over levels of the first factor.

31 0 11 2 31 0 11 0 2 0

E Y x x x E Y x x x11 12 2 11 12 21 0 1 1 0 0 , , , ,

E Y x x x E Y x x x11 12 2 11 12 20 0 1 0 0 0 , , , ,

Page 38: Introduction to regression

How precise are our predictions?

A natural measure: ECDF Corr

“R-squared” = R2

Temperatures example: R2 = 0.90 (spline model), R=0.95

An extremely precise prediction (nearly a straight line)

Caution: This reflects precision for the SAME data used to build the model Generally an overestimate of precision for predicting

future data

Cross-validation needed: Evaluate fit in a separate sample

( , )Y Y R

Page 39: Introduction to regression

Main Points Linear regression model

Random part: Errors distributed as independent, identically distributed N(0,σ2)

Systematic part: E[Y] = Xβ

Direct versus total effects

Differences in means

Nonlinear relationships: polynomials, splines

Interactions

Page 40: Introduction to regression

Main Points Estimation: Least squares

Unbiased

BLUE: Best linear unbiased estimator

Inference: Var( ) = σ2(X’X)-1

Standard errors = square roots of diagonal elements

Testing, confidence intervals as in Lecture 2

Model fit: Residual analysis

R2 estimates precision of predicting individual outcomes


Recommended