+ All Categories
Home > Documents > The Cox model in R

The Cox model in R

Date post: 02-Jan-2016
Category:
Upload: cadman-hodges
View: 98 times
Download: 4 times
Share this document with a friend
Description:
The Cox model in R. Gardar Sveinbjörnsson, Jongkil Kim, Yongsheng Wang. OUTLINE. Recidivism data Cox PH Model for Time-Independent Variables in R Model Selection Model Diagnostics Cox PH Model for Time-Dependent Variables in R Summary. 18,April 2011. Department of Mathematics, ETHZ. - PowerPoint PPT Presentation
60
The Cox model in R Gardar Sveinbjörnsson, Jongkil Kim, Yongsheng Wang
Transcript
Page 1: The Cox model in R

The Cox model in RGardar Sveinbjörnsson, Jongkil Kim, Yongsheng Wang

Page 2: The Cox model in R

2

OUTLINE

Recidivism data

Cox PH Model for Time-Independent Variables in R

Model Selection

Model Diagnostics

Cox PH Model for Time-Dependent Variables in R

Summary

18,April 2011 Department of Mathematics, ETHZ

Page 3: The Cox model in R

3

Recidivism data

The data is from an experimental study of recidivism of 432 male prisoners, who were observed for a year after being released from prison.

Half of the prisoners were randomly given financial aid when they were released.

18,April 2011 Department of Mathematics, ETHZ

The data is from an experimental study of recidivism of 432 male prisoners, who were observed for a year after being released from prison.

Half of the prisoners were randomly given financial aid when they were released.

Recidivism data

Page 4: The Cox model in R

4

Variables in Recidivism Data

week: week of first arrest after release, or censoring time. arrest: the event indicator, 1 = arrested , 0 = not fin: 1=received financial aid, 0= not age: in years at the time of release race: 1= black, 0= others wexp: 1= had full-time work experience, 0= not mar: 1= married, 0= not paro: 1= released on parole, 0= not prio: number of prior convictions educ: codes 2 (grade 6 or less), 3 (grades 6 through 9), 4 (grades 10

and 11), 5 (grade 12), or 6 (some post-secondary). emp1— emp52: 1= employed in the corresponding week, 0 = not

18,April 2011 Department of Mathematics, ETHZ

Page 5: The Cox model in R

5

Recidivism Data

> Rossi <- read.table(’Rossi.txt’, header=T) > Rossi[1:5, 1:10] ## omitting the variables emp1 — emp52

18,April 2011 Department of Mathematics, ETHZ

week arrest fin age race wexp mar paro prio educ

1 20 1 0 27 1 0 0 1 3 3

2 17 1 0 18 1 0 0 1 8 4

3 25 1 0 19 0 1 0 1 13 3

4 52 0 1 23 1 1 1 1 1 5

5 52 0 0 19 0 1 0 1 3 3

Page 6: The Cox model in R

Cox PH Model for Time-Independent Variables in R

Page 7: The Cox model in R

7

Cox PH Model for Time-Independent Variables in R

Surv and coxph function in R

Cox Regression

Adjusted survival curve

18,April 2011 Department of Mathematics, ETHZ

Page 8: The Cox model in R

8

Surv function in R

Surv(time, event) time: survival or censoring

time event: the status indicator 0=censored 1=observed

Left-truncated and right-censored data

Surv(time, time2, event) time: left-truncation time time2: survival or

censoring time event: the status indicator 0= censored 1= observed

18,April 2011 Department of Mathematics, ETHZ

> Surv(time, time2, event, type=c('right', 'left', 'interval', 'counting'), origin=0)

Right-censored data

Page 9: The Cox model in R

9

Coxph function in R

> coxph(formula, data=, weights, subset, na.action, init, control, method=c("efron","breslow","exact"), singular.ok=TRUE, robust=FALSE, model=FALSE, x=FALSE, y=TRUE, ...)

Most of the arguments are similar to lm

18,April 2011 Department of Mathematics, ETHZ

Page 10: The Cox model in R

10

Coxph function in R

Formula

The right-hand side: the same as a linear model

The left-hand side: a survival object

Method : The method for tie handling. If there are no tied survival times all the methods are equivalent. Breslow: the default for most Cox PH models

Efron: used as the default and much more accurate than Breslow when dealing with tied survival times

Exact: computes the exact partial likelihood

18,April 2011 Department of Mathematics, ETHZ

Page 11: The Cox model in R

11

Cox regression

> mod.allison <- coxph(Surv(week, arrest) ~ fin + age + race + wexp + mar + paro + prio + as.factor(educ), data=Rossi)

> mod.allison

18,April 2011 Department of Mathematics, ETHZ

Page 12: The Cox model in R

12

Cox regression

18,April 2011 Department of Mathematics, ETHZ

Call:coxph(formula = Surv(week, arrest) ~ fin + age + race + wexp + mar + paro + prio +

as.factor(educ), data = Rossi)

coef exp(coef) se(coef) z p

fin -0.4027 0.669 0.1930 -2.087 0.0370

age -0.0514 0.950 0.0222 -2.316 0.0210

race 0.3615 1.435 0.3122 1.158 0.2500

wexp -0.1200 0.887 0.2135 -0.562 0.5700

mar -0.4236 0.655 0.3822 -1.108 0.2700

paro -0.0982 0.906 0.1959 -0.501 0.6200

prio 0.0794 1.083 0.0293 2.707 0.0068

as.factor(educ)3 0.5934 1.810 0.5196 1.142 0.2500

as.factor(educ)4 0.3284 1.389 0.5437 0.604 0.5500

as.factor(educ)5 -0.1210 0.886 0.6752 -0.179 0.8600

as.factor(educ)6 -0.4070 0.666 1.1233 -0.362 0.7200

Likelihood ratio test=38.7 on 11 df, p=6.01e-05 n= 432, number of events= 114

Page 13: The Cox model in R

13

Adjusted survival curve

> plot(survfit(mod.allison), ylim=c(.7, 1), xlab=’Weeks’, ylab=’Proportion Not Rearrested’)

18,April 2011 Department of Mathematics, ETHZ

Page 14: The Cox model in R

14

Adjusted survival curve

We may wish to display how estimated survival depends upon the

value of a covariate. Because the principal purpose of the recidivism study was to assess the impact of financial aid on rearrest, let us focus on this covariate.

We construct a new data frame with two rows, one for each value of fin; the other covariates are fixed to their median.

18,April 2011 Department of Mathematics, ETHZ

Page 15: The Cox model in R

15

Adjusted survival curve

18,April 2011 Department of Mathematics, ETHZ

> Rossi.fin <- data.frame(fin=c(0,1), age=rep(median(age),2), race=rep(median(race),2),wexp=rep(median(wexp),2), mar=rep(median(mar),2), paro=rep(median(paro),2), prio=rep(median(prio),2), educ=as.factor(rep(median(educ),2))

> plot(survfit(mod.allison, newdata=Rossi.fin), conf.int=T, lty=c(1,2), col=c(‘red’, ‘blue’), ylim=c(.5, 1), xlab='Weeks', ylab='Proportion Not Rearrested')

Page 16: The Cox model in R

16

Model Selection

Page 17: The Cox model in R

17

Model Selection

Why variable selection?

Purposeful selection

Stepwise selection

Best Subset Selection of Covariates

18,April 2011 Departement of Mathematics, ETHZ

Page 18: The Cox model in R

18

Why variable selection?

We generally want to explain the data in the simplest way.

Unnecessary predictors in a model will effect the estimation of other quantities. That is to say, degrees of freedom will be wasted

If model is to be used for prediction, we will save effort, time and/or money if we do not have to collect data for predictors that are redundant.

18,April 2011 Department of Mathematics, ETHZ

Page 19: The Cox model in R

19

Why variable selection?

We must decide on a method to select a subset of variables.

Purposeful selection

Stepwise selection

- using P-values

- using AIC

Best subset selection

18,April 2011 Department of Mathematics, ETHZ

Page 20: The Cox model in R

20

Purposeful selection

1. We fit a multivariable model containing all variables that were significant in a univariable analysis at the 20-25% level.

2. We use the p-values from the Wald statistic to remove variables from our model. We also confirm the non-significance by a likelihood ratio test.

3. We check whether the removal has produced an “important” change in coefficients of other variables.

4. We check again all the variables that we removed.

5. We check for nonlinearity.

6. We look for interactions.

7. We check assumptions.

18,April 2011 Department of Mathematics, ETHZ

Page 21: The Cox model in R

21

Stepwise selection

Stepwise selection is a mix between forward and backward selection.

We can either start with an empty model or a full model and add/remove predictors according some criteria.

At each step we reconsider terms that were added or removed earlier.

→ Often applied in practice→ Done argument in the step() function in R→ In practice often based on AIC/BIC

18,April 2011 Department of Mathematics, ETHZ

Page 22: The Cox model in R

22

Stepwise selection

The AIC is a measure of the relative goodness of fit of a statistical model.

It does not only reward goodness of fit, but also includes a penalty that is an increasing function of the number of parameters.

AIC = 2k – 2max(loglikelihood), where k is the number of parameters in the model.

This means the smaller the better

18,April 2011 Department of Mathematics, ETHZ

Page 23: The Cox model in R

23

Stepwise selection using our data

Step: AIC=1327.35Surv(week, arrest) ~ fin + age + mar + prio

Df AIC<none> 1327.3- mar 1 1327.7- fin 1 1329.0- age 1 1335.4- prio 1 1336.2

18,April 2011 Department of Mathematics, ETHZ

Page 24: The Cox model in R

24

Best Subset Selection

Stepwise only considers a small number of all the possible models

Best subset provides a way to check all the possible models

The same as in linear regression: need a criterion to judge the models

Idea: not only based on goodness-of- fit, but also penalizes for the model size.

18,April 2011 Department of Mathematics, ETHZ

Page 25: The Cox model in R

25

Best Subset Selection

Mallow’s C: C=W+(p-2q) smaller C is better

p: number of variables under consideration

q: number of variables not included in the subset model

W=W(p)-W(p-q), where W(p) is the Wald statistics for the model containing all p variables and W(p-q) denotes the Wald statistics for the subset model

18,April 2011 Department of Mathematics, ETHZ

Page 26: The Cox model in R

26

Best Subset Selection of Covariates

Check the model

18,April 2011 Department of Mathematics, ETHZ

Variables Mallow’s C

fin, age, mar, prio 6.56

fin, age, mar, prio, race 7.22

fin, age, mar, prio, wexp 7.81

fin, age, mar, prio, paro 8.47

fin, age, mar, prio, educ 5.39

fin, age, mar, prio, race, paro 9.09

fin, age, mar, prio, wexp, paro 9.75

fin, age, mar, prio, race, wexp 8.53

fin, mar, prio 11.77

fin, age, prio 6.60

fin, age, mar 17.34

age, mar, prio 8.28

Page 27: The Cox model in R

27

Model Diagnostics

Page 28: The Cox model in R

28

Model Diagnostics

Analyze PH assumption with residuals

Influential observations

Checking nonlinearity

18,April 2011 Departement of Mathematics, ETHZ

Page 29: The Cox model in R

18,April 2011 Department of Mathematics, ETHZ

Analyze PH assumption with residuals

We have a strong evidence of non-PH assumption for age

plot with cox.zph shows us plots of scaled Schoenfeld residuals.

> cox.zph(mod.allison.4) rho chisq pfin -0.000159 2.99e-06 0.99862age -0.221020 7.38e+00 0.00659prio -0.077930 7.32e-01 0.39237mar 0.131485 2.08e+00 0.14937GLOBAL NA 8.88e+00 0.06406

Page 30: The Cox model in R

18,April 2011 Department of Mathematics, ETHZ

Analyze PH assumption with residuals

> plot(cox.zph(mod.allison.4))

Page 31: The Cox model in R

18,April 2011 Department of Mathematics, ETHZ

Analyze PH assumption with residuals

For the variable age, the plot of residuals changes over time.

There are two possible solutions. The effect of variable age is

different with regards to time intervals: Age is a strata variable

The effect of age is declining over time: Interaction between age and time exists

Page 32: The Cox model in R

18,April 2011 Department of Mathematics, ETHZ

Analyze PH assumption (Strata)

The variable age is “strata” variable We separate observations into 4 groups by their ages

[ < 19] [20 ~ 25] [26 ~ 30] [31 < ]

> library(car)> Rossi$age.cat <- recode(Rossi$age, “lo:19=1;20:25=2;26:30=3;31:hi=4")> table(Rossi$age.cat) 1 2 3 4 66 236 66 64

Page 33: The Cox model in R

33

Analyze PH assumption (Strata)

Use this separated age groups as strata variables and check the PH assumption again

> mod.allison.6 <- coxph(Surv(week, arrest) ~ fin + prio + strata(age.cat) + mar, data=Rossi)

> cox.zph(mod.allison.6) rho chisq pfin -0.0164 0.0315 0.859prio -0.0721 0.5946 0.441mar 0.1337 2.0830 0.149GLOBAL NA 2.7478 0.432

18,April 2011 Departement of Mathematics, ETHZ

Page 34: The Cox model in R

34

Analyze PH assumption (Interaction)

To analyze the interaction between age and time, we should transform the data.

Ex) 1st observation, we change (0,20] to (0,1+], (1,2+], (2,3+], . . . , (19,20]

18,April 2011 Departement of Mathematics, ETHZ

> Rossi[1,1:10] week arrest fin age race wexp mar paro prio educ1 20 1 0 27 1 0 0 1 3 3

> Rossi2[1:20,1:10] start stop arrest.time week arrest fin age . . . prio educ 1.1 0 1 0 20 1 0 27 . . . 3 3

. . .1.19 18 19 0 20 1 0 27 . . . 3 31.20 19 20 1 20 1 0 27 . . . 3 3

Page 35: The Cox model in R

18,April 2011 Department of Mathematics, ETHZ

Analyze PH assumption (Interaction)

The interaction exists between time and age

coxph(formula = Surv(start, stop, arrest.time) ~ fin + age + age:stop + prio + mar, data = Rossi.2) coef exp(coef) se(coef) z pfin -0.35971 0.698 0.19042 -1.89 0.05900age 0.03552 1.036 0.03899 0.91 0.36000prio 0.09868 1.104 0.02721 3.63 0.00029mar -0.50512 0.603 0.37302 -1.35 0.18000age:stop -0.00371 0.996 0.00145 -2.56 0.01100Likelihood ratio test=38.1 on 5 df, p=3.55e-07 n= 19809

Age has a positive partial effect on the hazard but this effect gets smaller with time, even becoming negative effect about 10 weeks.

Page 36: The Cox model in R

3618,April 2011 Department of Mathematics, ETHZ

Influential observations

For each covariate we look at how much the regression coefficients change if we remove one observation.

In R the argument type=dfbeta to the residuals() function produces a matrix of estimated changes in the regression coefficients upon deleting each observation in turn.

We then plot these changes.

3

Page 37: The Cox model in R

3718,April 2011 Department of Mathematics, ETHZ 4

Page 38: The Cox model in R

3818,April 2011 Department of Mathematics, ETHZ

Influential observations(Just for fun)

Let see what happens if I change some observations.

I take the first age observation and change it. First to age=60 and then to age=110.

See R

5

Page 39: The Cox model in R

3918,April 2011 Department of Mathematics, ETHZ

Checking nonlinearity

Nonlinearity is a problem in Cox regression as it is in linear and generalized linear models.

To detect nonlinearity we plot the Martingale residuals against covariates.

We add a smooth produced by local linear regression using the loess function and try to detect deviations from zero.

6

Page 40: The Cox model in R

Martingale residuals

The Martingale residual for individual i on time ti is

Where δi is the event indicator

is the cumulative hazard function for individual i. ti is the time at the end of follow up for individual i.

18,April 2011 40Departement of Mathematics, ETHz

Page 41: The Cox model in R

4118,April 2011 Department of Mathematics, ETHZ 7

Page 42: The Cox model in R

4218,April 2011 Department of Mathematics, ETHZ

In case of nonlinearity

We try to transform our covariate which is not linear.

We can try several transformations, e.g. log or sqrt.

We can also include higher order terms in our model and compare with the original model using likelihood ratio test.

8

Page 43: The Cox model in R

43

Cox PH Model for Time-Dependent Variables in R

Page 44: The Cox model in R

18,April 2011 Department of Mathematics, ETHZ

Time-Dependent Variables

Cox PH Model for Time-Dependent Variables

Data Transformation

Model with Time-Dependent Variables

Model with Lagged Time-Dependent Variable

Page 45: The Cox model in R

18,April 2011 Department of Mathematics, ETHZ

Cox PH Model for Time-Dependent Variables

Recall the Cox PH model for Time-Dependent Variables

gi(t) which depend on time t can be 0 (time-independent) t, ln(t), etc... One variable at a time

gi(t) = 1 (t = t0, t1, t2, ..)

= 0 (otherwise)

Heavyside function gi(t) = 1 (t ≥ t0)

= 0 (t < t0)

Page 46: The Cox model in R

18,April 2011 Department of Mathematics, ETHZ

Data Transformation

Now, we want to assess the effect of weekly employment on rearrest harzard.

Weekly employment indicators appear as a single row in 52 columns => Weekly employment indicators in rows

> Rossi[1,] week arrest fin age race wexp mar paro prio educ emp1

emp21 20 1 0 27 1 0 0 1 3 3 0

0 emp3 emp4 emp5 emp6 emp7 emp8 emp9 emp10 emp11 emp12

emp131 0 0 0 0 0 0 0 0 0 0

0 emp14 emp15 emp16 emp17 emp18 emp19 emp20 emp21 emp22

emp231 0 0 0 0 0 0 0 NA NA

NA emp24 emp25 emp26 emp27 emp28 emp29 emp30 emp31 emp32

emp331 NA NA NA NA NA NA NA NA NA

NA emp34 emp35 emp36 emp37 emp38 emp39 emp40 emp41 emp42

emp431 NA NA NA NA NA NA NA NA NA

NA emp44 emp45 emp46 emp47 emp48 emp49 emp50 emp51 emp521 NA NA NA NA NA NA NA NA NA

Page 47: The Cox model in R

18,April 2011 Department of Mathematics, ETHZ

Data Transformation

Transformed data (Weekly employment indicators in rows)

> Rossi.2[1:50,] start stop arrest.time week arrest fin age race wexp mar paro prio educ employed1.1 0 1 0 20 1 0 27 1 0 0 1 3 3 0

. . .1.19 18 19 0 20 1 0 27 1 0 0 1 3 3 01.20 19 20 1 20 1 0 27 1 0 0 1 3 3 02.1 0 1 0 17 1 0 18 1 0 0 1 8 4 02.2 1 2 0 17 1 0 18 1 0 0 1 8 4 0

. . . 2.16 15 16 0 17 1 0 18 1 0 0 1 8 4 02.17 16 17 1 17 1 0 18 1 0 0 1 8 4 0

. . .7.1 0 1 0 23 1 0 19 1 1 1 1 0 4 17.2 1 2 0 23 1 0 19 1 1 1 1 0 4 1

. . .7.22 11 12 0 23 1 0 19 1 1 1 1 0 4 1 7.23 12 13 1 23 1 0 19 1 1 1 1 0 4 0

. . .

Page 48: The Cox model in R

18,April 2011 Department of Mathematics, ETHZ

Model with Time-Dependent Variables

We treat weekly employment as a predictor depended on time to rearrest.

Suggested model:

Xemployed (t) means whether people are employed at week t

(0 or 1)

We estimate coefficient βi, δemployed

Page 49: The Cox model in R

18,April 2011 Department of Mathematics, ETHZ

Model with Time-Dependent Variable

The weekly employment variable has an apparently large effect.

The hazard of rearrest is smaller by a factor of e-1.3289 = 0.265 (declined by 73.5%) when people are on a employed status.

coxph(formula = Surv(start, stop, arrest.time) ~ fin + age + mar + prio + employed, data = Rossi.2) n= 19809 coef exp(coef) se(coef) z Pr(>|z|) fin -0.33898 0.71250 0.19037 -1.781 0.07498 . age -0.04598 0.95507 0.02059 -2.233 0.02552 * mar -0.36119 0.69684 0.37334 -0.967 0.33331 prio 0.08419 1.08784 0.02775 3.034 0.00241 ** employed -1.32897 0.26475 0.24979 -5.320 1.04e-07 ***

Page 50: The Cox model in R

18,April 2011 Department of Mathematics, ETHZ

Model with Lagged Time-Dependent Variables

Claim: The direction of causality is not clear, because a person cannot work when he is in jail.

Weekly

Employment

at time t

Arrest

at time t

Weekly

Employment

at time t-1

Arrest

At time t

Ambiguous causality

Page 51: The Cox model in R

51

Model with Lagged Time-Dependent Variables

Use a lagged value of employment from the previous week Model with lagged time

We apply lagged property to the data and then use the same command and arguments in R arrest.time are shifted by a lagged time

18,April 2011 Departement of Mathematics, ETHZ

Page 52: The Cox model in R

18,April 2011 Department of Mathematics, ETHZ

Model with Lagged Time-Dependent Variables

The coefficient for the lagged employment variable is still significant, but the estimated effect is much smaller: e-0.7891 = 0.45

coxph(formula = Surv(start, stop, arrest.time) ~ fin + age + mar + prio + employed, data = Rossi.3)

n= 19377

coef exp(coef) se(coef) z Pr(>|z|) fin -0.33471 0.71554 0.19104 -1.752 0.079758 . age -0.05014 0.95110 0.02075 -2.416 0.015674 * mar -0.41269 0.66187 0.37385 -1.104 0.269650 prio 0.09217 1.09655 0.02752 3.350 0.000809 ***employed -0.78918 0.45422 0.21704 -3.636 0.000277 ***

Page 53: The Cox model in R

53

Summary

Page 54: The Cox model in R

54

Final model

After we introduced the weekly employment into our model the marriage variable has become non-significant. We therefore remove it.

We also choose to have age as a strata variable for ease of interpretation because it does not satisfy the PH assumption.

18,April 2011 Department of Mathematics, ETHZ

Page 55: The Cox model in R

55

Final model

coxph(formula = Surv(start, stop, arrest.time) ~ fin + strata(age.cat) + prio + employed, data = Rossi.3)

coef exp(coef) se(coef) z Pr(>|z|) fin -0.33454 0.71567 0.19078 -1.754 0.079502 . prio 0.08984 1.09400 0.02707 3.319 0.000902 ***employed -0.82758 0.43710 0.21583 -3.834 0.000126 ***--- exp(coef) exp(-coef) lower .95 upper .95fin 0.7157 1.397 0.4924 1.0402prio 1.0940 0.914 1.0375 1.1536employed 0.4371 2.288 0.2863 0.6673

Rsquare= 0.002 (max possible= 0.053 )Likelihood ratio test= 30.08 on 3 df, p=1.325e-06Wald test = 30.32 on 3 df, p=1.182e-06Score (logrank) test = 31.42 on 3 df, p=6.933e-07

18,April 2011 Department of Mathematics, ETHZ

Page 56: The Cox model in R

56

Final model

Financial aid

coef exp(coef) se(coef) z Pr(>|z|) fin -0.33454 0.71567 0.19078 -1.754 0.079502 .

The estimated hazard ratio for receiving financial aid is 0.71567.

This means, holding the other covariates constant, the rearrested rate of subjects with financial aid reduces 29%.

18,April 2011 Department of Mathematics, ETHZ

Page 57: The Cox model in R

57

Final model

Number of prior convictions

coef exp(coef) se(coef) z Pr(>|z|) prio 0.08984 1.09400 0.02707 3.319 0.000902 ***

The estimated hazard is 1.094. Holding the other covariates constant, an additional time of

prior convictions increases the weekly hazard of rearrest by 9 percent.

18,April 2011 Department of Mathematics, ETHZ

Page 58: The Cox model in R

58

Final model

Employment

coef exp(coef) se(coef) z Pr(>|z|) employed -0.82758 0.43710 0.21583 -3.834 0.000126 ***

The estimated hazard ratio is 0.43710. This means that the hazard of rearrest is smaller by a decline

of 56 percent during a week in which the former inmate was employed.

18,April 2011 Department of Mathematics, ETHZ

Page 59: The Cox model in R

59

Summary

Cox PH Model for Time-Independent Variables in R

Surv and coxph function in R Cox Regression Adjusted survival curve

Model Selection

Why variable selection? Purposeful selection Stepwise selection Best Subset Selection

18,April 2011 Department of Mathematics, ETHZ

Page 60: The Cox model in R

60

Summary

Model Diagnostics

Analyze PH assumption with residuals Influential observations Checking nonlinearity

Cox PH Model for Time-Dependent Variables in R

Model description Analysis for the result Lagged variables

Final model18,April 2011 Department of Mathematics, ETHZ


Recommended