SURVIVAL ANALYSIS
Development
Workshop
What is endogeneity and why we do not like it? [REPETITION]
Three causes:– X influences Y, but Y reinforces X too– Z causes both X and Y fairly contemporaneusly– X causes Y, but we cannot observe X and Z (which we
observe) is influenced by X but also by Y Consequences:
– No matter how many observations – estimators biased (this is called: inconsistent)
– Ergo: whatever point estimates we find, we can’t even tell if they are positive/negative/significant, because we do not know the size of bias + no way to estimate the size of bias
The mystery of staying alive
Everything started in medicine and biology Key question: can we talk about determinants of survival from t1 to t2,
knowing at least that part of people survived from t0 to t1? No magic sticks, cannot „guess” future, but
– Surviving till time T, means S(T) = P(Y>T)– We can estimate P(Y>T) on our sample (what is random here?)– Time is discrete (eventually there is nobody left to die…)
– P(surviving till T) = S(T) = P(Y>T)= p(t1) · p(t2) ·... · p(tN)
t1 t2 t3 tN......t0
T
Technically speaking
Each period we may estimate a probit of surviving till t if I am still alive in t-1
P(live in t | survived till t-1) De facto, this is a sequence of estimation
p(live_t|live_t-1), p(live_t+1|live_t), p(live_t+2|live_t+1), etc. For each ti we may specify:
–ni-1 – no of people at risk in ti-1, i.e. „momentarily earlier”
–di – no of people who disappeared from the sample between ti-1 and ti
ni = ni-1 – di, n0=N
Probability of staying alove between ti-1 and ti:
p(ti) = P(ti|Y>ti-1) = (ni-1 – di)/ni-1 = 1 – di/ni-1
Clinical data 20 observations, 10 deaths, 10 censored observations (people still alive
when observation window has ended) Observation period (FU) counted in months since treatment ended
Example
Kaplan-Meier estimator
S(t1) = P(Y>t1) = P(t1|Y>t0)*P(Y>t0) = (1- 1/20)*1=0.95
t0=0 t1=2.3655
n0=20 d1=1, c1=0 n1 = 20 – 1 =19
i 1 2 3 4 5 6 7 8 9 10
t 2,37 2,40 2,79 3,19 3,91 6,64 7,10 8,02 8,05 8,21
d 1 1 0 1 1 0 1 0 1 0
c 0 0 1 0 0 1 0 1 0 1
Kaplan-Meier estimator
i 14 15 16 17 18 19 20
t 11,47 11,79 15,64 15,70 19,70 21,94 24,30
d 1 0 0 1 0 1 0
c 0 1 1 0 1 0 1
S(t19)=P(Y>t19) = P(t19|Y>t18)*P(Y>t18) = (1-
1/2)*0.3863 = 0.5*0.39=0.19
t18=19.7043 t19=21.9425
n18=2 d19=1, c19=0 n19 = 2 – 1 =1
0.00
0.25
0.50
0.75
1.00
0 2 4 6 8 10 12 14 16 18 20 22 24analysis time
Kaplan-Meier survival estimate
Kaplan-Meier estimator
Additional pitfalls
Assume that survival is conditional on something (not just numbers thrown now and then)– Crucial assumption: distribution of survival
Exponential: λ(t)= λ /constant with the sample?/ Weibull: λ(t) = λpptp-1 /variable, but how?/ Gompertz-Makeham: λ(t) = e{α+βt} /also variable…? / Gamma: S(t) = 1 - Ik(λ t) /also variable…? / A whole variety…
Pros and cons of KM estimator
Advantages:– Intuitive– Little data needs– Computed on data (always can get it)
Disadvantages:– Cannot know if some characteristics help/inhibit survival– No tests, statistical hypotheses, etc.
Overall: nice drawing tool, poor analytical tool => need an analytical tool
Other similar estimators
Nelson-Aalena– Similar to KM – starts from hazard functions and not survival function – Computed on data – nothing more than visualisation
You could test for two (or more) groups as well– Mantel-Haenszel
If probability of death was similar across two groups, no of observations still alive at each point in time should keep the same proportion => testable
– Cox – Combinations of different tests
Cox model
Define – h(t) – hazard function probability of dying at t if you survived untill t
– x1, x2, ..., xk – set of hazard factors
– h0(t) – base hazard function in base group, t – observation time
– β1, β 2, ..., β k – coefficients of model
kk xxxk ethxxth 2211)(),...,,( 01
0
0
1
0
)(
)(
)0,(
)1,( b
b
b
eeth
eth
xth
xth==
=
=×
×
Pros and cons for Cox model
Disadvantages:
– the quotient for the hazard funcitons CONSTANT OVER TIME !
– no (direct) information on h0(t)
– simple hypotheses only (are two groups different) Advantages
– graphical test: curves of ln(-ln(S(t)) for groups compared
– can condition some characteristics
Frailty – a big issue
How do we do this in STATA?
Survival functionstwoway line S age
Hazard functionsgen H = - log(S)
gen h = H[_n] - H[_n-1]
gen logh = log(h) gen agem = age - 0.5 if h <.
twoway line logh agem, xtitle("age")
14
How do we do this in STATA?
Generally, two approaches:– From data (nonparametric)
stscox, stsgraph– Assuming something about hazard distribution (parametric)
stsreg, stscurve
First have to declare data to have survival form:– Variable declaring „death” + variable declaring „time”
stset time, failure(death)
152011-05-12 Seminarium magisterskie - zajęcia 7
How do we do this in STATA?
Parametrically
stset time, fail(death)
streg all_conditioning_variables, distribution(your_selected_distribution) Nonparametrically
sts graph /Kaplan Meier/
sts graph, by(group) /Kaplan Meier/
sts test group /Mantel-Haenszel/
stcox group /Cox/
stphtest, plot(group) /testing whether we can use Cox model/
stphplot, by(treated) /graphical confirmation for the PH test/ And that’s all, folks
Sample streg
Sample stcox
Sample proportionality test
Summary
Not a very sophisticated tool How sophisticated question – depends on us With large samples – nonparametric methods have
some serious edge to rely on If samples small – parametric methods may be less
reliable How about „direction of causality” here?
– Do we run the risk of endogeneity bias?