1
Logistic Regression versus Cox Regression
Ch. Mélot, MD, PhD, MSciBiostatService des Soins IntensifsHôpital Universitaire Erasme
ESP, le 26 février 2008
Why do we need multivariable analyses?
We live in a multivariable world. Mostevents, whether medical, political, social, orpersonal, have multiple causes. And thesecauses are related to one another.
2
Definition
Multivariable analysis is a tool fordetermining the relative contributions ofdifferent causes to a single event.
Note: the terms “multivariate analysis” and“multivariable analysis” are often usedinterchangeably. In the strict sense,multivariate analysis refers tosimultaneously predicting multipleoutcomes.
Multivariable approach
Y X1, X2, X3, …
Single dependent variableOutcome
e.g., dead or alive
Independent variablesRisk factorsPredictors
e.g., age, gender, …
3
Multivariate approach
y1
y2
y3
1
x1
x2
x3
=
0j 1j 2j 3j
0j 1j 2j 3j
0j 1j 2j 3j
x
Multiple dependent variablesOutcomes
e.g., countries
Independent variablesRisk factorsPredictors
e.g., drugs, …
Belgium-Luxembourg
France
Germany
HollandSwitzerland
Ital yFinland
UK Ireland
Norway
Austria
Sweden
Spain
Portugal
Denmark
-2
-1.5
-1
-0.5
0.5
1
1.5
2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
MIDAZOLAM
MORPHINE
PROPOFOL
SUFENTANIL
FENTANYL
0
Soliman H.M., Mélot C., et al. Br. J. Anaesth. 2001;87:186-192
MULTIVARIATE ANALYSIS
4
Definition
Outcome: n. That which comes out of, or followsfrom; issue; result;
Webster’s dictionary
Outcome: status of the patient at the end of anepisode of care - presence of symptoms, level ofactivity, and mortality.
If y = continuous-> linear regression
If y = categorical (1 or 0)-> logistic regression
If y = count of events in a period of time-> Poisson regression
If y = time to event (censored data)-> Cox regression
Types of regression
5
MULTIVARIABLE REGRESSION
If y = continuous variable: multipleregression
y = o + 1 x1 + 2 x2 + 3 x3
If y = dichotomus variable: multiple logisticregression
y =o + 1 x1 + 2 x2 + 3 x3e
1 + e o + 1 x1 + 2 x2 + 3 x3
Logit(y) = o + 1 x1 + 2 x2 + 3 x3
MULTIVARIABLE REGRESSION
If y = count of events during a given period oftime (ti) : multivariable Poisson’s regression
If y = time to event: multivariable Cox’sregression
o + 1 x1 + 2 x2 + 3 x3y = ti e
1 x1 + 2 x2 + 3 x3y = h0 (t) e
Ln(y/ti) =o + 1 x1 + 2 x2 + 3 x3
Ln(y/h0 (t)) = 1 x1 + 2 x2 + 3 x3
6
Multiple linearregression
Multiplelogisticregression
Proportionalhazardsanalysis
MultiplePoisson’sregression
What is beingmodeled?
The mean valueof the outcome
The logarithmof the odds ofthe outcome(logit)
The logarithmof the relativehazard
The logarithmof the count ofthe events
Relationship ofmultipleindependentvariabes (X’s)to outcome
The mean valueof the outcomechangeslinearly withx’s
The logit ofthe outcomechangeslinearly withX’s
The logarithmof the relativehazard changeslinearly withX’s
The logarithmof the count ofthe eventschangeslinearly withX’s
Distribution ofthe outcomevariable
Normal Binomial None specif ied Poisson
Variance ofoutcomevariable
Equal groundthe mean
Depends onlyon the mean
None specif ied Mean equalsvariance
Relativehazard overtime
Not applicable Not applicable Constant Not applicable
Expression of the results
If y = continuous variable: multipleregression
y = o + 1 x1 + 2 x2 + 3 x3
1 = « slope » for the risk factor x1
7
Expression of the results
If y = dichotomus variable: multivariablelogistic regression
Logit(y) = o + 1 x1 + 2 x2 + 3 x3
e = odds ratio for the risk factor x11
Expression of the results
If y = count of events during a given period oftime (ti) : multivariable Poisson’s regression
Ln(y/ti) =o + 1 x1 + 2 x2 + 3 x3
e = relative risk of theoccurrence of the event duringthe period of time or relativerisk incidence
1
8
Expression of the results
If y = time to event: multivariable Cox’sregression
Ln(y/h0 (t)) =1 x1 + 2 x2 + 3 x3
e = hazard ratio for the risk factor x1or incidence rate ratio
1
Cox versus Logistic Regression
9
Cox regression vs logistic regression
Distinction between rate andproportion:
– Incidence (hazard) rate: number of newcases of disease per population at-riskper unit time (or mortality rate, ifoutcome is death)
– Cumulative incidence: proportion of newcases that develop in a given time period
Cox regression vs logistic regression
Distinction between hazard/rateratio and odds ratio/risk ratio:– Hazard/rate ratio: ratio of incidence
rates– Odds/risk ratio: ratio of proportions
By taking into account time, you are taking into accountmore information than just binary yes/no.
Gain power/precision.
Logistic regression aims to estimate the odds ratio; Coxregression aims to estimate the hazard ratio
10
Risks vs Rates
Relationship between risk and rates:
R(t) = 1 – e-ht
h = constant hazard rate
R(t) = probability of disease in time t
Risks vs Rates
For example, if rate is 5 cases/1000person-years, then the chance ofdeveloping disease over 10 years is:
Compare to .005(10) = 5%The loss of persons atrisk because they havedeveloped diseasewithin the period ofobservation is smallrelative to the size ofthe total group.
R(t) = 1 - .951 = 0.0488
R(t) = 1 – e -.05
R(t) = 1 – e –(.005) (10)
11
Risks vs Rates
If rate is 50 cases/1000 person-years, then the chance of developingdisease over 10 years is:
Compare to .05(10) = 50%
R(t) = 1 - .61 = 0.39
R(t) = 1 – e -.5
R(t) = 1 – e –(.05) (10)
403
50484543413937353332
1000950902857814773734697662629
12345678910
Incidence: 0.050Persons at riskyear
12
Risk vs Rates
Relationship between risk and rates (derivation):
Waiting time distribution will change ifthe hazard rate changes as a functionof time: h(t)
Exponential density function forwaiting time until the event(constant hazard rate)
r(t) = h e-ht
tt
he-hu du = -e-huR(t) = = -e-hu - -e-0 = 1 – e-ht 00
LOGISTIC REGRESSION
13
Data set (CHD: Coronary Heart Disease) (Yes:1/No:0)
PA TID AGEGRP AGE C HD34 3 38 035 3 38 036 3 39 037 3 39 138 4 40 039 4 40 140 4 41 041 4 41 042 4 42 043 4 42 044 4 42 045 4 42 146 4 43 047 4 43 048 4 43 149 4 44 050 4 44 051 4 44 152 4 44 153 5 45 054 5 45 155 5 46 056 5 46 157 5 47 058 5 47 059 5 47 160 5 48 061 5 48 162 5 48 163 5 49 064 5 49 065 5 49 166 6 50 0
PATID AGEGRP AGE CHD1 1 2 0 02 1 2 3 03 1 2 4 04 1 2 5 05 1 2 5 16 1 2 6 07 1 2 6 08 1 2 8 09 1 2 8 010 1 2 9 011 2 3 0 012 2 3 0 013 2 3 0 014 2 3 0 015 2 3 0 016 2 3 0 117 2 3 2 018 2 3 2 019 2 3 3 020 2 3 3 021 2 3 4 022 2 3 4 023 2 3 4 124 2 3 4 025 2 3 4 026 3 3 5 027 3 3 5 028 3 3 6 029 3 3 6 130 3 3 6 031 3 3 7 032 3 3 7 133 3 3 7 0
PATID AGEGRP AGE CHD67 6 50 168 6 51 069 6 52 070 6 52 171 6 53 172 6 53 173 6 54 174 7 55 075 7 55 176 7 55 177 7 56 178 7 56 179 7 56 180 7 57 081 7 57 082 7 57 183 7 57 184 7 57 185 7 57 186 7 58 087 7 58 188 7 58 189 7 59 190 7 59 191 8 60 092 8 60 193 8 61 194 8 62 195 8 62 196 8 63 197 8 64 098 8 64 199 8 65 1100 8 69 1
LINEAR REGRESSION
y = 0.0218 x - 0.538R² = 0.264
0
0.2
0.4
0.6
0.8
1.0
0 20 40 60 80
Age, yrs
CHD
(0=
No,
1=
Yes
)
14
LOGISTIC REGRESSION
0
5
10
15
20
20-29 30-34 35-39 40-44 45-49 50-54 55-59 60-69
Age Group (yrs)
Num
ber
ofpa
tien
ts
CHD=0 (n = 57)CHD=1 (n = 43)
LOGISTIC REGRESSION
0%
20%
40%
60%
80%
100%
20-29 30-34 35-39 40-44 45-49 50-54 55-59 60-69
Age Group (yrs)
NU
MB
ER
OF
PA
TIE
NTS
CHD=0 (n = 57)CHD=1 (n = 43)
15
LOGISTIC REGRESSION
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
20 30 40 50 60 70
Age, yrs
Prop
orti
onCH
D
LOGISTIC REGRESSION
0
0.2
0.4
0.6
0.8
1.0
0 10 20 30 40 50 60 70 80 90 100
Age, yrs
Prop
orti
onCH
D
e-5.31 + 0.111 Age=(x) 1 + e-5.31 + 0.111 Age
16
LOGISTIC TRANSFORMATION
Logit [(x)] = ln [ ](x)
1 - (x)
(x) =e0 + 1 x
1 + e0 + 1 x
LOGISTIC REGRESSION
Logit (x) = -5.31 + 0.111 age
-3
-2
-1
0
1
2
3
0 10 20 30 40 50 60 70 80 90 100
Age, yrs
Logi
tPr
opor
tion
CHD
17
Odds and Probability
Probability = = 0.166
Odds in favour = = = 0.2056
61
61
Odds against = = 5 against 116
65
(x)
1-(x)
ODDS RATIO AND LOGISTIC REGRESSION
OR = = e
Example: OR = e 0.110 x 10 = 3.03
(x=1)
1-(x=1)(x=0)
1-(x=0)
18
ODDS RATIO AND LOGISTIC REGRESSION
Ln(OR) = 1
95 % CI for OR = ln (e )1 ± 1.96 SE(1)
OR = = e
(x=1)
1-(x=1)(x=0)
1-(x=0)
Forest plot: Odds Ratio with 95 % confidenceinterval
1 3 0 0.5
a bc d
OR =a db c
SE(ln(OR)) = d1c
1b
1a1 +++
p = ns
p < 0.05
p < 0.05
2OR
IC 95 % = OR ± 1.96 SE
Trt A > Trt B Trt B > Trt A
Amplitude of the observed effect
Precision of the observed effect
Favoursactive
Favoursplacebo
Trt A = Trt B
19
COX’s REGRESSION
Cox model
A Cox model is a well-recognized statisticaltechnique for exploring the relationshipbetween the occurrence of an event (e.g.,death, relapse,…) in a patient and severalexplanatory variables.
Survival analysis is concerned with stuyingthe time between entry to a study and asubsequent event (such as death).
Censored survival times occur if the eventof interest does not occur for a patient duringthe study period.
20
Survival Analysis: Terms
Time-to-event: The time from entry into astudy until a subject has a particularoutcome
Censoring: Subjects are said to becensored if they are lost to follow up ordrop out of the study, or if the study endsbefore ends before they die or have anoutcome of interest. They are counted asalive or disease-free for the time theywere enrolled in the study.– If dropout is related to both outcome and
treatment, dropouts may bias the results
Right Censoring (T>t)
Common examplesTermination of the studyDeath due to a cause that is not the
event of interestLoss to follow-up
We know that subject survived at least totime t.
21
Left censoring (T<t)
The origin time, not the event time, isknown only to be less than some value.
For example, if you are studying menarcheand you begin following girls at age 12, youmay find that some of them have alreadybegun menstruating. Unless you can obtaininformation about the start date for thosegirls, the age of menarche is left-censoredat age 12.
Interval censoring (a<T<b)
When we know the event hasoccurred between two time points,but don’t know the exact dates.
For example, if you’re screeningsubjects for HIV infection yearly,you may not be able to determine theexact date of infection.
22
Data Structure: survival analysis
Time variable: ti = time at lastdisease-free observation or time atevent.
Censoring variable: ci =1 if had theevent; ci =0 no event by time ti
Introduction to Kaplan-Meier
Non-parametric estimate of survivorfunction.
Commonly used to describe survivorship ofstudy populations.
Commonly used to compare two studypopulations.
Intuitive graphical presentation.
23
Beginning of study End of study Time in months
Subject B
Subject A
Subject C
Subject D
Subject E
Survival Data (right-censored)
1. subject E dies at 4months
X
100%
Time in months
Corresponding Kaplan-Meier Curve
Probability ofsurviving to justbefore4 monthsis 100% = 5/5
Fractionsurviving thisdeath = 4/5
Subject E dies at 4months
24
100%
Time in months
Corresponding Kaplan-Meier Curve
subject C dies at7 months
Fractionsurviving thisdeath = 2/3
subject A drops outafter 6 months
Beginning of study End of study Time in months
Subject B
Subject A
Subject C
Subject D
Subject E
Survival Data
2. subject Adrops out after6 months
4. Subjects Band D survivefor the wholeyear-longstudy period
1. subject E dies at 4months
X
3. subject C diesat 7 monthsX
25
100%
Time in months
Corresponding Kaplan-Meier Curve
Product limit estimateof survival =P(surviving/at-risk throughfailure1) *P(surviving/at-risk throughfailure2) =4/5 * 2/3= .5333
The product limit estimate
The probability of surviving in the entireyear, taking into account censoring
= (4/5) (2/3) = 53%
NOTE:– 40% (2/5) because the one drop-out survived
at least a portion of the year.
– < 60% (3/5) because we don’t know if the onedrop-out would have survived until the end ofthe year.
26
Cox model
....2211 xbxbeh(t) = h0(t)
h(t) = hazard function, i.e. the probabi li ty of death at time t
h0(t) = baseline or underlying hazard function, and correspondsto the probabi li ty of dying when all the explanatoryvariables are zero. The baseline hazard function isanalogous to the intercept in ordinary regression(since e0 = 1).
Non parametric Parametric
Cox model
The risk to die at time t (hO(t)) is equalto the number of deaths divided by thenumber of patients at risk to die at time t(risk set).
Survival analysis take into accountpatients who did not reach the time t.They are substracted from the number ofpatients at risk to die.
27
What is a hazard function?
The hazard function (h(t)) is theprobability that an individual willexperience an event (e.g., death) within asmall time interval given that theindividual has survived up to the beginningof the interval. It can therefore beinterpreted as the risk of dying at time t.
h(t) =N of individuals experiencing an event in interval beginning at t
(N of individuals surviving at time t) x (interval width)
Assumptions in a Cox model
The relationship between thedependent variable (outcome) and theexplanatory variables must beconstant. It is called theproportional hazards assumption
28
Captopril4909
4871 (99.2%)
Vital statusunknown:38 (0 .8%)
VALsartan In Acute myocardialiNfarcTion (VALIANT study)
Median follow-up: 24.7 months
Valsartan4909
4856 (98.9%)
Vital statusunknown:53 (1 .1%)
14,808 Patients Randomized
4837 (99.0%)
Vital statusunknown:48 (1 .0%)
Combination4885
Informed consentnot ensured: 105 patients
14,703 Patients
13 Pfeffer, McMurray, Velazquez, et al. N Engl J Med 2003;349:1893–1906
Testing PH: VALIANT example.
Under PH, curvesshould be parallel(should not cross)
Real change ineffect of HXMI(History of MI)over time?
History of MIgood for earlysurvival, bad forlater survival?
-7-6.5
-6-5.5
-5-4.5
-4-3.5
-3-2.5
-2
1 3 5 7 9 11 13 15
Days (from AMI)
log(
-log
(S(t
))
No HXMI
HXMI
29
Cox’s regression model
....2211 xbxbeh(t) = h0(t)
Hazard ratio for x1 (= eb1)
Loge ...2211 xbxbh0(t)h(t)
HR for x2 (= eb2)
Example
PROGRESS, Lancet 2001;358:1033-1041
significant
non significantsignificant
significantsignificantsignificant
non significant
significantsignificantsignificant
non significantnon significantnon significantnon significantnon significantnon significant
30
Survival after hepatic surgery forcancer
PatID
Age(years)
ATime
(weeks)
BNumber at riskat start ofstudy
CNumberofdeaths
DNumbercensored
EProportionsurviving until endof week
FCumulativeproportionsurviving
0 18 - - 1.000
1 59 10 18 1 0 1 - 1/18 = 0.944 0.944
2 56 13* 17 0 1 1 – 0/17 = 1.000 0.944
3 54 18* 16 0 1 1 – 0/16 = 1.000 0.944
4 67 19 15 1 0 1 – 1/15 = 0.933 0.882
5 37 23* 14 0 0 1 – 0/14 = 1.000 0.882
6 55 30 13 1 0 1 – 1/13 = 0.923 0.8137
7 65 36 12 1 0 1 – 1/12 = 0.916 0.7459
8 60 38* 11 0 1 1 – 0/11 = 1.000 0.7459
9 58 54* 10 0 1 1 – 0/10 = 1.000 0.7459
10 57 56* 9 0 1 1 – 0/9 = 1.000 0.7459
11 52 59 8 1 0 1 – 1/8 = 0.875 0.6526
12 46 75 7 1 0 1 – 1/7 = 0.857 0.5594
13 43 93 6 1 0 1 – 1/6 = 0.830 0.4662
14 58 97 5 1 0 1 – 1/5 = 0.800 0.3729
15 39 104* 4 0 1 1 – 0/4 = 1.000 0.3729
16 43 107 3 1 0 1 – 1/3 = 0.667 0.2486
17 45 107* 2 0 1 1 – 0/2 = 1.000 0.2486
18 37 107* 2 0 1 1 – 0/2 = 1.000 0.2486
31
Kaplan-Meier estimate of the survivor function
0 20 40 60 80 100 120
Time in weeks to death following surgery for liver cancer
100
90
80
70
60
50
40
30
20
10
0
Sur
viva
lpro
babi
lity
(%)
9 deaths / 18
Median survival: 93 wks
Cox’s regression model
0.13 ageeh(t) = h0(t)
Loge = 0.13 ageh0(t)h(t)
HR = 1.14
32
Cox’s regression model
0 20 40 60 80 100 120
100
90
80
70
60
50
40
30
20
10
0Sur
viva
lpro
babi
lity
(%)
HR = 1.14 (1.03 – 1.26)p = 0.0098
Time in weeks to death following surgery for liver cancer
HR = 1.39
Logistic or Cox model to identifyrisks factors?
33
Cox model versus logisticregression
The logistic regression requires asimilar period of time of observationon all the sujects to avoid theinfluence of time on the outcome.
The Cox regression allows to takeinto account of different period oftime of observation.
Comparison with complete follow-up(fictitious example)
Generated survival timesfrom exponentialdistribution
Assume 100% follow-upfor 365 days
Survival times > 365 daysare right censored
Analysis with logistic andproportional hazardregressions
112(11.2%)
158(15.8%)
Deaths in365 days
2,195days
1,435days
Mediansurvival time(uncensored)
1,0001,000N
TreatPlacebo
34
Kaplan Meier curves for fictitious example
0.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
0 50 100 150 200 250 300 350
Days from randomization to death
Sur
viva
l
Placebo
Treatment
log-rank p-value = 0.0026
Logistic vs Cox PH regression
HR (for treat) =0.691 (0.543, 0.881)
p-value = 0.0028Cox PH regression(with censoring)
HR (for treat) =0.648 (0.592, 0.708)
p-value < 0.0001Cox PH regression(no censoring)
OR (for treat) =0.672 (0.518, 0.872)
p-value = 0.0027Logistic regression
Placebo: 158/1000Treat: 112/1000
OR: (a x d) / (b x c) =0.672
2 p-value = 0.0026Contingency table
35
Cox model vs Logistic regression
Multivariable Cox model Multivariable Logistic regression
Variable HR
Low95 %CI
High95 %CI p OR
Low95 %CI
High95 %CI p
SAPS II score >39 2.05 1.68 2.50 <0.0001 1.39 1.02 1.89 0.04
No ultimately fatal disease(McCabe 1) 0.48 0.39 0.58 <0.0001 0.43 0.32 0.58 <0.0001
Chronic liver disease 1.46 1.09 1.95 0.01 1.90 1.12 3.22 0.02
Decisions to forego life-sustaining therapy 1.89 1.42 2.51 <0.0001 16.56 11.00 24.90 <0.0001
Worsening of the LOD scorewithin the first week afterICU admission 1.36 1.32 1.39 <0.0001 1.60 1.48 1.72 <0.0001
Azoulay E., et al. Intens. Care Med 2003;29:1895-1901
Cox model vs Logistic regression
Cox model Logistic regression
Variable HR
Low95 %CI
High95 %CI p OR
Low95 %CI
High95 %CI p
Uncensored database(survival at day 28)
Albumin (yes = 1) 1.44 1.21 1.73 <0.0001 2.58 2.05 3.25 <0.0001
Censored database (ICUsurvival)
Albumin (yes=1) 1.11 0.91 1.36 0.307 2.77 2.18 3.52 <0.0001
SOAP study
36
Conclusions: Cox regression vs logisticregression
Cox model: estimates hazard/rate ratio: ratio ofincidence rates
Logistic model: estimates odds/risk ratio: ratio ofproportions
By taking into account time, you are taking intoaccount more information than just binary yes/no.
The logistic regression requires a similar period oftime of observation on all the sujects to avoid theinfluence of time on the outcome.