+ All Categories
Home > Documents > Statistics 262: Intermediate...

Statistics 262: Intermediate...

Date post: 20-Jan-2021
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
42
Statistics 262: Intermediate Biostatistics Regression & Survival Analysis Jonathan Taylor & Kristin Cobb Statistics 262: Intermediate Biostatistics – p.1/??
Transcript
Page 1: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Statistics 262: IntermediateBiostatistics

Regression & Survival Analysis

Jonathan Taylor & Kristin Cobb

Statistics 262: Intermediate Biostatistics – p.1/??

Page 2: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Introduction

This course is an applied course, and we willlook at the following types of data (to beexplained, of course!)

Linear Regression Models

ANOVA

Mixed Effects

Generalized Linear Models

Survival Models: (Non-,Semi-)Parametric

Statistics 262: Intermediate Biostatistics – p.2/??

Page 3: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Logistics – People

Instructor: Jonathan Taylor([email protected])

Instructor: Kristin Cobb([email protected])

TA: Pei Wang ([email protected])

TA: Eric Bair ([email protected])

Statistics 262: Intermediate Biostatistics – p.3/??

Page 4: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Logistics: Computer labs

The following times are reserved for us (all theway over at the Fleischmann computer lab...)

April 20th, 11:30am-1:00pm

April 27th, Noon-1:00pm

May 11th, 10:30am-12:30pm

May 18th, 10:30am-12:30pm

Statistics 262: Intermediate Biostatistics – p.4/??

Page 5: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Logistics: Evaluation

This is our proposed marking scheme:

6 assignments: 60%.

take home final (40 %) OR project (40 %).

Statistics 262: Intermediate Biostatistics – p.5/??

Page 6: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Logistics: Computing environment

Although I prefer R, I will use SAS forexamples in class.

Students can submit assignments in either Ror SAS.

Statistics 262: Intermediate Biostatistics – p.6/??

Page 7: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

What is a linear regression model?

Dumb example: Suppose you wanted to test thefollowing hypothesis

H: Tall people tend to marry tall people.

How could we do this?

Statistics 262: Intermediate Biostatistics – p.7/??

Page 8: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Husband & Wife height

Collect height information from marriedcouples in the form of (Mi, Fi), 1 ≤ i ≤ n forsome number n samples.

Plot pairs (Mi, Fi).

If there is a “relationship” between M ’s andF ’s, you might conclude that H is true.

What is relationship? A linear regressionmodel says that

Fi = β0 + β1Mi + εi

where εi is “random error.’Statistics 262: Intermediate Biostatistics – p.8/??

Page 9: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Raw Data

160 170 180 190

140

150

160

170

180

M

F

Statistics 262: Intermediate Biostatistics – p.9/??

Page 10: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Fitting a model

Data seem to indicate linear relationship isgood.

How good is the fit?

Can we “quantify” the quality of fit as intwo-sample t-test?

This is the basis of simple linear regression.

Statistics 262: Intermediate Biostatistics – p.10/??

Page 11: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Least Squares Fit

160 170 180 190

140

150

160

170

180

M

F

Statistics 262: Intermediate Biostatistics – p.11/??

Page 12: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

More variables: multiple linear regression models

Usually, we care about more than one“predictor” at a time – multiple linearregression.

Example: predicting birth weight of babiesbased given mother’s habits duringpregnancy: i.e. smoking / alcohol / exercise /diet.

We can estimate “effects” for many variables:are some of them less important than others?

With many variables there are many possiblemodels: which is the “best”?

Statistics 262: Intermediate Biostatistics – p.12/??

Page 13: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Analysis of Variance: ANOVA

Suppose you work for a pharmaceuticalcompany developing a drug to lower cholesteroland you want to test the hypothesis

Does the drug affect cholesterol, i.e. doesit decrease cholesterol on average?

To test its effect, you might consider a case /control expriment with a placebo on agenetically pure strain of mice.

Observe (Di, Pi), 1 ≤ i ≤ n (could havedifferent numbers of cases / placebos).

Statistics 262: Intermediate Biostatistics – p.13/??

Page 14: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

ANOVA: more general

Question is µD (average cholesterol in cases)different than µP (average cholesterol inplacebos)?

This example is just a two sample t-test, butwhat if you had different dose levels / morethan one drug / more than one strain?

Other issues: in genetically pure mice,“average” effect is well-defined. What about inhuman studies?

How do we generalize the results of the studyto the entire population? (random, mixed vs.fixed effects models). Statistics 262: Intermediate Biostatistics – p.14/??

Page 15: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Generalized Linear Models

In HRP/STATS 261 you have seen someexamples of a generalized linear model:

logistic regression: trying to predictprobabilities based on covariates;

log-linear models: modelling count data /contingency tables;

multiple linear regression models are also“generalized linear models”;

these models can be put into a general class:GLMs.

Statistics 262: Intermediate Biostatistics – p.15/??

Page 16: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Poisson regression

In a given HIV+ population, we might beobtain a viral genotype of patients at twodifferent time points.

As HIV mutates in response to drugtreatment, the virus undergoes mutations inbetween the two visits.

How quickly do mutations develop? Are theresome covariates that influence this rate ofmutation?

How do we model this?

Statistics 262: Intermediate Biostatistics – p.16/??

Page 17: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Raw Data

5 10 15 20 25 30 35 40

02

46

8

T

N

Statistics 262: Intermediate Biostatistics – p.17/??

Page 18: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Poisson regression model

Basic Model (no other covariates):

Ni ∼ Poisson(eβ0+β1Ti), 1 ≤ i ≤ n

where Ni is the number of mutations forsubject i, and Ti is the time between visits.

This implies that

log(E(Ni)) = β0 + β1Ti, Var(Ni) = eβ0+β1Ti.

This is a “log” link, with “variance function”V (x) = x.

Statistics 262: Intermediate Biostatistics – p.18/??

Page 19: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Poisson regression fit

5 10 15 20 25 30 35 40

02

46

8

T

N

Statistics 262: Intermediate Biostatistics – p.19/??

Page 20: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Survival analysis

The “main” topic of this course is survival data.Survival data refers to a fairly broad class ofdata: basically, the object of study are “survivaltimes” of individuals and the observations havethree components:

An observed time, Ti, 1 ≤ i ≤ n.

Observed covariates Xij, 1 ≤ i ≤ n, 1 ≤ j ≤ p.

A “censoring” variable δi, 1 ≤ i ≤ n.

Statistics 262: Intermediate Biostatistics – p.20/??

Page 21: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Censored data

What is “δ”? In a given study, not everyone“fails”, eventually funding dries up and datahas to be published!

Alternatively, people leave the area, or areotherwise “lost to follow-up.” However,knowing that they left the study “alive” hassome information in it.

With much at stake, we do not want to throwaway this data.

The difficulty comes in incorporating δ into the“likelihood.”

Statistics 262: Intermediate Biostatistics – p.21/??

Page 22: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Kidney infection data

Given two different techniques of catheterplacement in patients (surgically,pericutaneously), physicians want to decide ifone technique is better at holding off infections.

Observations: time on study, presence ofinfection, catheter placement.

Censoring can bias usual estimates of CDF:Kaplan-Meier estimate gives an unbiasedestimate.

Can we determine whether “survival”experience are different in the two groups?(log-rank test) Statistics 262: Intermediate Biostatistics – p.22/??

Page 23: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Kidney data

0 5 10 15 20 25

0.0

0.2

0.4

0.6

0.8

1.0

t

P(T

>t)

Statistics 262: Intermediate Biostatistics – p.23/??

Page 24: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Other survival models

We could model time as coming from aparametric family of distributions: Gumbel,Weibull, etc.

Alternatively, we can fit a semi-parametericmodel like Cox’s proportional hazards model(which is actually not appropriate in thiscase....)

Statistics 262: Intermediate Biostatistics – p.24/??

Page 25: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Moving on

End of introduction.

After break: regression!

Statistics 262: Intermediate Biostatistics – p.25/??

Page 26: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Simple linear regression

Specifying the model.

Fitting the model: least squares.

Inference about parameters.

Diagnostics.

Statistics 262: Intermediate Biostatistics – p.26/??

Page 27: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Specifying the model

Returning to husband and wife data:

Fi = β0 + β1Mi + εi

Assumption: E(εi|Mi) = 0 (often:εi|Mi ∼ N(0, σ2))

Regression equation: E(Fi|Mi) = β0 + β1Mi

Errors are independent across pairs ofcouples.

This fully specifies joint distribution of(F1, . . . , Fn) given (M1, . . . ,Mn).

Statistics 262: Intermediate Biostatistics – p.27/??

Page 28: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Least Squares

Least squares regression chooses the linethat minimizes

SSE(β0, β1) =n∑

i=1

(Fi − β0 − β1Mi)2.

(Conditional) mean can be estimated for anygiven height M as

F (M) = β0 + β1 · M.

where (β0, β1) are the minimizers of SSE.

Statistics 262: Intermediate Biostatistics – p.28/??

Page 29: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Estimating σ2

Strength of association will depend onvariability in data, as in two sample t-test.

Natural estimate of σ2

σ2 =1

n − 2

n∑

i=1

(Fi − F (Mi)

)2

.

Under normality assumption

σ2

σ2∼

χ2n−2

n − 2.

Statistics 262: Intermediate Biostatistics – p.29/??

Page 30: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

SAS code to fit basic model

LIBNAME DATADIR ’C:\sas’;

PROC IMPORT OUT=DATADIR.height

DATAFILE = ’C:\heights.csv’ DBMS=CSV REPLACE;

RUN;

PROC REG DATA=DATADIR.height;

MODEL WIFE=HUSBAND;

RUN;

Statistics 262: Intermediate Biostatistics – p.30/??

Page 31: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Inference: CIs

Under the normality assumption, any linearcombination of (β0, β1) are also normallydistributed.

Can be used to construct confidence intervalfor predicted mean:

F (M) ± tn−2,1−α/2 · SD(F (M))

As well as a new observation

F (M) ± tn−2,1−α/2 ·

√SD(F (M))2 + σ2

Statistics 262: Intermediate Biostatistics – p.31/??

Page 32: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

SAS: predicted mean, CIs

PROC REG DATA=DATADIR.height;

MODEL WIFE=HUSBAND / P CLM CLI;

RUN;

P refers to “predicted” mean.

CLM refers to “mean.”

CLI refers to new “individual.”

Statistics 262: Intermediate Biostatistics – p.32/??

Page 33: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Inference: Hypothesis tests

We can test our “hypothesis” formally in termsof the coefficient β1.

If the regression function is truly linear, thenβ1 = 0 means height of husband has no effecton height of wife.

Under H0 : β1 = 0

β1

SD(β1)∼ tn−2.

Statistics 262: Intermediate Biostatistics – p.33/??

Page 34: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

SAS: predicted mean, CIs

PROC REG DATA=DATADIR.height;

MODEL WIFE=HUSBAND;

MYTEST: TEST WIFE=0;

OTHERTEST: TEST WIFE=1;

RUN;

Output has a page for MYTEST andOTHERTEST.

If there is more than one covariate, more thanone variable can be specified per TESTstatement.

Statistics 262: Intermediate Biostatistics – p.34/??

Page 35: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

SAS: plots

PROC REG DATA=DATADIR.height;

MODEL WIFE=HUSBAND;

PLOT WIFE*HUSBAND;

RUN;

Statistics 262: Intermediate Biostatistics – p.35/??

Page 36: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Least Squares Fit

160 170 180 190

140

150

160

170

180

M

F

Statistics 262: Intermediate Biostatistics – p.36/??

Page 37: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Diagnostics

How do we tell if our assumptions arejustified?

Can we check if we have left a higher orderterm out?

Is the variance constant across couples?

Are the residuals close to being normallydistributed?

Statistics 262: Intermediate Biostatistics – p.37/??

Page 38: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

SAS: residual plot

PROC REG DATA=DATADIR.height;

MODEL WIFE=HUSBAND;

PLOT R.*P. / VREF=0;

RUN;

R. refers to “residual” Ri = Fi − F (Mi).

P. refers to “predicted” value F (Mi).

Can detect missing higher order terms,nonconstant variance.

Statistics 262: Intermediate Biostatistics – p.38/??

Page 39: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Residual Plot: Count data

1.0 1.5 2.0 2.5 3.0

01

23

45

P

R

Statistics 262: Intermediate Biostatistics – p.39/??

Page 40: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

Residual Plot

150 155 160 165 170 175

−20

−10

05

10

P

R

Statistics 262: Intermediate Biostatistics – p.40/??

Page 41: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

SAS: QQplot

PROC REG DATA=DATADIR.height;

MODEL WIFE=HUSBAND;

PLOT R.*NQQ.;

RUN;

NQQ. refers to “quantile-quantile”.

Can detect departures from normality.

Statistics 262: Intermediate Biostatistics – p.41/??

Page 42: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring... · 2005. 11. 23. · Introduction This course is an applied course, and we will look

plot@

−2 −1 0 1 2

−20

−10

05

10

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Statistics 262: Intermediate Biostatistics – p.42/??


Recommended