Post on 12-Oct-2020
transcript
Estimating heterogeneous treatment effects with
right-censored data via causal survival forests ∗
Yifan Cui† Michael R. Kosorok‡ Stefan Wager § Ruoqing Zhu ¶
January 28, 2020
Abstract
There is fast-growing literature on estimating heterogeneous treatment ef-
fects via random forests in observational studies. However, there are few ap-
proaches available for right-censored survival data. In clinical trials, right-
censored survival data are frequently encountered. Quantifying the causal re-
lationship between a treatment and the survival outcome is of great interest.
Random forests provide a robust, nonparametric approach to statistical estima-
tion. In addition, recent developments allow forest-based methods to quantify
the uncertainty of the estimated heterogeneous treatment effects. We propose
causal survival forests that directly target on estimating the treatment effect
from an observational study. We establish consistency and asymptotic normal-
ity of the proposed estimators and provide an estimator of the asymptotic vari-
ance that enables valid confidence intervals of the estimated treatment effect.
The performance of our approach is demonstrated via extensive simulations
and data from an HIV study.
∗Alphabetical order†Department of Statistics, The Wharton School, University of Pennsylvania, 3730 Walnut Street,
Philadelphia, Pennsylvania 19104, USA; e-mail: cuiy@wharton.upenn.edu.‡Department of Biostatistics, University of North Carolina, 3101 McGavran-Greenberg Hall,
Chapel Hill, North Carolina 27599, USA; email: kosorok@unc.edu.§Stanford Graduate School of Business, 655 Knight Way, Stanford, California 94035, USA; email:
swager@stanford.edu.¶Department of Statistics, University of Illinois at Urbana-Champaign, 725 South Wright Street,
Champaign, Illinois 61820, USA; email: rqzhu@illinois.edu.
1
arX
iv:2
001.
0988
7v1
[st
at.M
E]
27
Jan
2020
keywords Precision medicine, Random forests, Right-censored data, Heterogeneous
treatment effects, Confidence intervals
1 Introduction
Recently, random forests have been considered to estimate heterogeneous treatment
effects in observational studies. Several examples in statistical and biomedical settings
include Athey et al. (2019), Friedberg et al. (2018), Lu et al. (2018), Kunzel et al.
(2019) and Oprescu et al. (2019). The advantages of forest and tree-based methods,
as described in Breiman (2001) and Athey et al. (2019), are two-fold. First, random
forests provide a robust nonparametric approach for estimating heterogeneous treat-
ment effects. Second, random forests can quantify the uncertainty of the estimated
heterogeneous treatment effects through recent developments (Mentch and Hooker,
2014; Wager and Athey, 2018).
In clinical trials and other biomedical research studies, right-censored survival
data are frequently encountered. The literature on random forests is well developed
for survival data, e.g., Leblanc and Crowley (1993); Hothorn et al. (2006b) studied
survival tree model in context of conditional inference trees. Hothorn et al. (2006a)
proposed to use the inverse probability of censoring weighting to compensate cen-
soring in forest models; Ishwaran et al. (2008) proposed so-called random survival
forests, which extend random forests to handle survival data via using log-rank test
at each split on an individual survival tree (Ciampi et al., 1986; Segal, 1988); Zhu
and Kosorok (2012) studied the impact of recursive imputation of survival forests
on model fitting; Steingrimsson et al. (2016) proposed doubly robust survival trees
by constructing doubly robust loss functions that use more information to improve
efficiency; Steingrimsson et al. (2019) constructed censoring unbiased regression trees
and forests by considering a class of censoring unbiased loss functions. However,
none of these methods were directly targeted at heterogeneous treatment effects in
observational studies.
Another topic related to estimating heterogeneous treatment effects is optimal
2
treatment regimes. A significant amount of work has been devoted to estimating opti-
mal treatment rules in randomized trials with complete data (Murphy, 2003; Qian and
Murphy, 2011; Zhang et al., 2012; Zhao et al., 2012; Laber and Zhao, 2015). Adapt-
ing the outcome weighted learning framework (Zhao et al., 2012), Zhao et al. (2015)
proposed two new approaches, inverse censoring weighted outcome weighted learning,
and doubly robust outcome weighted learning, both of which require semi-parametric
estimation of the conditional censoring probability given the patient characteristics
and treatment choice. Zhu et al. (2017) adopted the accelerated failure time model to
estimate an interpretable single-tree treatment decision rule. Cui et al. (2017a) pro-
posed a random forest approach for right-censored outcome weighted learning, which
avoids both the inverse probability of censoring weighting and restrictive modeling
assumptions. However, for observational studies with censored survival outcomes,
these methods may suffer from confounding and selection bias. In addition, none of
the existing approaches estimates the heterogeneous treatment effect and the associ-
ated confidence interval. Hence, it is challenging to provide valid interference for the
suggested treatment.
To address these limitations in the existing literature, the proposed random forest
approach, namely causal survival forest, aims at estimating the heterogeneous treat-
ment effects from right-censored observational survival data. Compared with exist-
ing approaches, the proposed causal survival forest enjoys two advantages. Firstly,
our random forest and its associated splitting rules target a direct estimation of the
treatment effect while adjusting for the biasedness caused by treatment confounding
variables and censoring. Secondly, we can provide a valid inference of the hetero-
geneous treatment effect, which is much needed in the practical implementation of
precision medicine. Our approach is motivated by classical results on semiparametric
efficiency theory and survival analysis (Tsiatis, 2007). We construct local estimating
equations for the conditional average treatment effect (Athey et al., 2019), which leads
to an unbiased splitting rule that addresses the difference between the two potential
treatments.
The proposed approach is quite general in the sense that it includes many missing
3
data and causal inference problems. The proposed method can deal with coarsening
and missing data as long as the functional of interest admits an unbiased estimating
equation. Considering a broader class is to demonstrate the flexibility of our method
for various functionals that are generally of interest. Several functionals, such as the
marginal mean of an outcome subject to missingness as well as the closely related
marginal mean of a counterfactual outcome are within our framework. Thus, the con-
ditional average treatment effect under unconfoudedness can be viewed as a running
example to develop the proposed methodology.
2 Causal survival forests
Our goal is to construct a survival forest model (Ishwaran et al., 2008) that can over-
come the potential bias caused by observational data and right censoring. Suppose
X is a p dimensional covariate, W ∈ {0, 1} is the treatment label, T is the survival
time, and C is the censoring time. As we focus on the observational study setting,
throughout the paper, we make the following three assumptions on the failure time
T for identifiability of the conditional average treatment effect.
Assumption 1. (Consistency) T = T (W ) almost surely.
Assumption 2. (Unconfoundedness) {T (0), T (1)} ⊥ W | X.
Assumption 3. (Positivity) pr(W = w|X = x) > 0 almost surely if f(x) > 0, where
w = 0, 1.
The consistency assumption states that we observe a realization of T (w) only if the
treatment w is equal to a subject’s actual treatment assignment W . This assumption
links the potential outcomes to the observed data because for any failure event, only
one of the two potential outcomes T (1) and T (0) is observed. The unconfoundedness
assumption basically states that conditioning on covariate vector X, treatment W
is independent of potential outcomes. This assumption is satisfied if all prognostic
factors used to determine the treatment label W are recorded in X. Finally, this
positivity assumption essentially states that any subject with an observed value of x
has a positive probability of receiving both values of the treatment.
4
2.1 Causal forests without censoring
Before discussing treatment effect estimation with censored outcomes, we briefly re-
view the causal forest approach to treatment heterogeneity without censoring. Sup-
pose we want to estimate a treatment effect θ(x) = E[y(T (1)) − y(T (0))|X = x],
where y(T ) is some deterministic transformation of the survival time T . Typical
choices of outcome function include expected thresholded survival time y(T ) = T ∧ τ
and the survivor function y(T ) = 1({T ≥ τ}).
If we knew that the treatment effects were constant, i.e., θ(x) = θ for all x, then
the following estimator θ due to Robinson (1988) attains√n rates of convergence,
provided the three assumptions detailed above hold and that we estimate nuisance
components sufficiently fast (Robins et al., 2017; Chernozhukov et al., 2018):
n∑i=1
ψ(c)
θ(Xi, Ti, Wi; e, m) = 0,
ψθ(Xi, Ti, Wi; e, m) = (Wi − e(Xi))(y(Ti)− m(Xi)− θ(Wi − e(Xi))),
(1)
where e(x) = pr[W = 1|X = x], m(x) = E[y(T )|X = x], and e(Xi) and m(Xi)
are estimates of these quantities derived via cross-fitting (Schick, 1986). We use the
superscript (c) to remind ourselves that this estimator requires access to the complete
(uncensored) data.
Here, however, our goal is not to estimate a constant treatment effect θ, but
rather to fit covariate-dependent treatment heterogeneity θ(x). For this purpose, we
use forests. As background, recall that given a target point x, tree-based methods
seek to find training examples which are close to x and uses the local kernel weights to
obtain a weighted averaging estimator. An essential ingredient of tree-based methods
is recursive partitioning on the covariate space X , which induces the local weighting.
When the splitting variables are adaptively chosen, the width of a leaf can be narrower
along the directions where the causal effect is changing faster. After the tree fitting is
completed, the closest points to x are those that fall into the same terminal node as
x. The observations that fall into the same node as the target point x can be treated
asymptotically as coming from a homogeneous group.
Athey et al. (2019) generalizes using random forest-based weights for generic kernel
5
estimation. The most closely related precedent from the perspective of adaptive
nearest neighbor estimation are quantile regression forests (Meinshausen, 2006) and
bagging survival trees (Hothorn et al., 2004), which can be viewed as special cases
of generalized random forests. The idea of adaptive nearest neighbors also underlies
theoretical analyses of random forests such as Lin and Jeon (2006); Biau et al. (2008);
Arlot and Genuer (2014). The random forest-based weights αi are derived from the
fraction of trees in which an observation appears in the same terminal node as the
target point. Specifically, given a test point x, the weights αi(x) are the frequency
with which the i-th training example falls in the same leaf as x, i.e.,
αi(x) =1
B
B∑b=1
I{Xi ∈ Nb(x)}|Nb(x)|
,
where Nb(x) is the terminal node that contains x in the b-th tree, and | · | denotes
the cardinality.
The crux of the causal forest algorithm presented in Athey et al. (2019) is to
pair the kernel-based view of forests with Robinson’s estimating equation (1). Causal
forests seek to grow a forest such that the resulting weighting function αi(x) can
be used to express heterogeneity in θ(·), meaning that θ(·) is roughly constant over
observations given positive weight αi(x) for predicting at x. Then, we estimate θ(x)
by solving a localized version of (1):
n∑i=1
αi(x)ψ(c)
θ(x)(Xi, Ti, Wi; e, m) = 0. (2)
For further details, including the choice of a splitting rule targeted for treatment
heterogeneity, see Athey et al. (2019); further examples are given in Athey and Wager
(2019). The idea of using Robinson’s transformation to fit treatment heterogeneity
is further explored by Nie and Wager (2017).
2.2 Adjusting for censoring
The central goal of this paper is to develop a causal forest algorithm that can be
used despite censoring, i.e., despite the fact that we sometimes do not observe T ,
and instead only observe U = T ∧ C. Throughout, we assume that T is the survival
6
time up to a fixed maximum follow-up time τ and the censoring is noninformative
conditionally on X (Fleming and Harrington, 2011).
Assumption 4. (Conditionally independent censoring) T ⊥ C | X.
Our approach builds on the following recipe to making estimating equations ro-
bust to censoring, described in Tsiatis (2007, Chapter 10.4). If the true value θ
of our parameter of interest is identified by a complete data estimating equation,
E[ψ(c)θ (Xi, Ti, Wi)] = 0, then θ is also identified via the following estimator that gen-
eralizes the celebrated augmented inverse-propensity weighting estimator of Robins
et al. (1994): E[ψθ(Xi, Ui, Wi, ∆i)] = 0 with
ψθ(Xi, Ui, Wi, ∆i) =∆iψ
(c)θ (Xi, Ui, Wi)
SC(Ui|Xi)+
(1−∆i)E[ψ(c)θ (Xi, Ti, Wi)|Ti ≥ Ui, Xi]
SC(Ui|Xi)
−∫ Ui
0
λC(s|Xi)
SC(s|Xi)E[ψ
(c)θ (Xi, Ti, Wi)|Ti ≥ s, Xi] ds,
(3)
where SC(s|x) = pr[Ci ≥ s|Xi = x] is the conditional survival function for the
censoring process, and λC(s|x) = −d/ds logSC(s|x) is the associated conditional
hazard function. In our case, the specific form of our complete data estimating
equation enables us to simplify this expression resulting in
ψθ(Xi, Ui, Wi, ∆i; e, m, λC , SC , Q)
=
(QWi
(Ui|Xi) + ∆i[y(Ti)− QWi(Ui|Xi)]− m(Xi)− θ(Wi − e(Xi))
SCWi(Ui|Xi)
−∫ Ui
0
λCWi(s|Xi)
SCWi(s|Xi)
[QWi(s|Xi)− m(Xi)− θ(Wi − e(Xi))] ds
)(Wi − e(Xi)),
(4)
where QWi(t|Xi) = E[Ti|Xi,Wi, Ti > t], SCWi
(s|x) and λCWi(s|x) are the estimated con-
ditional survival function and hazard function, respectively. The associated estimator
θ is characterized by∑n
i=1 ψθ(Xi, Ui, Wi, ∆i; e, m, λC , SC , Q) = 0. This estimator
attains√n rates for θ under the setting of Chernozhukov et al. (2018), i.e., with cross-
fitting and 4-th root rates for the nuisance components provided the Assumptions 1-3
and conditionally independent censoring assumption.
7
2.3 Proposed causal survival forests
We now return to our main proposal, i.e., estimating heterogeneous treatment effects
using causal survival forests. Following our discussion above, we proceed as follows.
First, we estimate the nuisance components e, m, λC , SC , Q required to to form the
score (4). Then, however, instead of estimating a constant parameter, we pair this
estimating equation with the forest weighting scheme (2), resulting in estimates θ(x)
characterized byn∑i=1
αi(x)ψθ(x)(Xi, Ui, Wi, ∆i; e, m, λC , SC , Q) = 0. (5)
In order to use this estimator, we of course need to specify how to grow the forest,
so that the resulting forest weights αi(x) adequately express heterogeneity in the
underlying signal θ(x). Here, for the splitting rule, we use the ∆-criterion proposed
in Athey et al. (2019). In particular, we generate pseudo-outcomes by the following
relabeling strategy at each internal node.
ρi = ψiθ(x)×
{l∑
i=1
(Wi − e(Xi))2
(1
SCWi(Ui|Xi)
−∫ Ui
0
λCWi(t|Xi)
SCWi(t|Xi)
dt
)}−1,
where ψiθ(x)
is a shorthand of ψθ(x)(Xi, Ui, Wi, ∆i; e, m, λC , SC , Q). Next, the split-
ting criterion proceeds exactly the same as a regression tree (Breiman et al., 1984)
problem by treating the pseudo-outcomes ρi’s as a continuous outcome variable.
Specifically, we split the parent node into two child nodes NL and NR such as to
maximize the following quantity:
∆(L,R) =1
|{i : Xi ∈ NL}|
( ∑i:Xi∈NL
ρi
)2
+1
|{i : Xi ∈ NR}|
( ∑i:Xi∈NR
ρi
)2
.
2.4 Confidence intervals for the estimated treatment effects
Establishing a confidence interval allows proper statistical inference on the suggested
treatment strategy. To build asymptotically valid confidence intervals for θ(x) cen-
tered on θ(x), it suffices to derive an estimator for var{θ(x)}. As shown in Lemma 3.1
and Theorem 3.2 in Section 3, it is enough to study the variance of θ∗, where
θ∗(x) = θ(x) +n∑i=1
αi(x)ρ∗i (x),
8
and ρ∗i (x) is the influence function of the i-th observation with respect to the true
parameter value θ(x), i.e.,
ρ∗i (x) = ψiθ(x)V (x)−1,
and
V (x) = E
[{W − e(X)}2
{1
SCW (U |X)−∫ U
0
λCW (t|X)
SCW (t|X)dt
} ∣∣∣∣X = x
].
It is easy to see that
var{θ∗(x)} = var
{n∑i=1
αi(x)ψiθ(x)
}V (x)−2.
We estimate the above variance by
σ2n = Hn(x)Vn(x)−2,
where Vn(x) and Hn(x) are consistent estimators of V (x) and var{∑n
i=1 αi(x)ψiθ(x)},
respectively. Many strategies are available for estimating V (x). We estimate it by
fitting honest and regular regression forests. To obtain a valid estimation Hn(x),
notice that the term∑n
i=1 αi(x)ψiθ(x) is equivalent to the output of a regression forest
with weights αi(x) and effective outcomes ψiθ(x). There are many methods which have
been developed to estimate the variance of a regression forest, including work by
Sexton and Laake (2009); Wager et al. (2014); Mentch and Hooker (2016); Wager and
Athey (2018); Athey et al. (2019). We follow Athey et al. (2019) and use bootstrap
of little bags in our implementation.
3 Theoretical results
In this section, we study the asymptotic normality of the estimated treatment effect
θ(x). Throughout this section, we assume that the covariates X ∈ [0, 1]p are dis-
tributed according to a density that is bounded away from zero and infinity. The
following assumption guarantees the smoothness of E(ψθ(x)|X = x).
9
Assumption 5. (Lipschitz continuity) The treatment effect function θ(x) is L′-
Lipschitz continuous in terms of x. In addition, the propensity score e(x), hazard
function λC(t|X = x) and conditional survival function SC(t|X = x) are Lipschitz
continuous in terms of x.
In addition, our trees are symmetric, i.e., their output is invariant to permuting
the indices of training samples. Our algorithm also guarantees honesty (Wager and
Athey, 2018), and the following two conditions.
Random split tree: At each internal node, the probability of splitting at the j-th
dimension is greater than ς, where 0 < ς < 1 for j = 1, · · · , p.
Subsampling: Each child node contains at least a fraction w of the data points in
its parent node for some 0 < w < 0.5, and trees are grown on subsamples of size s
scaling as s = nγ, where κ < γ < 1, κ ≡ 1−[1 + ς−1
{log(w−1)
}/{
log(1−w)−1}]−1
.
Furthermore, we need the following Assumptions 6-9 to couple θ(x) and θ(x),
where θ(x) is an oracle estimator, with e(x),m(x), λC(t|X = x), and SC(t|X = x)
being the underlying truth. Assumptions 6-8 are commonly assumed in the causal
inference and survival analysis literature.
Assumption 6. The failure time and censoring time are bounded. The density func-
tions of the failure time T and censoring time C are both bounded above for all x ∈ X .
Assumption 7. Any individual has a positive probability of receiving both treatments.
Furthermore, 1−M1 < e(x) < M1 for any x ∈ X , and some M1 ∈ (1/2, 1).
Assumption 8. There exists a fixed positive constant M2 ∈ (0, 1), such that P (U ≥
τ |X = x) > M2 for all x ∈ X .
Assumption 9. Consistency of the non-parametric plug-in estimators: we have the
following convergences in probability,
|e(x)− e(x)| → 0, supt<τ,w∈{0,1}
|λCw(t|x)− λCw(t|x)| → 0,
supt<τ,w∈{0,1}
|SCw (t|x)− SCw (t|x)| → 0, supt<τ,w∈{0,1}
|STw(t|x)− STw(t|x)| → 0,
10
for each x ∈ X . Furthermore, we assume that for each x ∈ X , |e(x)−e(x)| = Op(bn),
and for both W = 0, 1, supt |SCW (t|x)−SCW (t|x)| = Op(cn), supt |STW (t|x)−STW (t|x)| =
Op(cn), supt |λCW (t|x)− λCW (t|x)| = Op(dn).
Assumption 9 is quite general. Biau (2012); Wager and Walther (2015) show that
for the random forest models, bn can be faster than n−2/(p+2) as long as the intrinsic
signal dimension is less than 0.54p. As shown in Cui et al. (2017b), cn = n−1/(p+2) is
achievable for survival forest models. Nonparametric kernel smoothing methods such
as Sun et al. (2019) provide estimation with dn = n−1/2+κ, where κ < 1/2 depending
on the dimension p.
Consequently, from Lemma in the Supplementary Material, we have for each x,
|m(x) − m(x)| = Op(cn), supt |QW (t, x) − QW (t, x)| = Op(cn) for W = 0, 1. The
following lemma provides an intermediate result for our main theorem,
Lemma 3.1. We assume Assumptions 6-9 hold and Op{kBmax(bn, cn, dn)} con-
verges to zero, where k is the minimum terminal node size and B is the number of
trees. Then for any x ∈ X , we have that |θ(x) − θ(x)| = Op{kBmax(bn, cn, dn)},
where B is the number of trees fitted in the forest model.
The proof of Lemma 3.1 is collected in the Supplementary Material. The technical
results in Wager and Athey (2018); Athey et al. (2019) paired with Lemma 3.1 lead
to the following asymptotic Gaussianity result.
Theorem 3.2. Assume Assumptions 5-9 hold. If Op(kBmax(bn, cn, dn)) converges
faster than polylog(n/s)−1/2(s/n)1/2, where polylog(n/s) is a function that is bounded
away from 0 and increases at most polynomially with the log-inverse sampling ratio
log(n/s). Then there exists a sequence σn(x) such that for any x ∈ X ,
{θ(x)− θ(x)}/σn(x)→ N(0, 1),
where σ2n(x) = polylog(n/s)−1s/n.
The proof of Theorem 3.2 is deferred to the Appendix. This asymptotic Gaus-
sianity result yields valid asymptotic confidence intervals for the true treatment effect
θ(x).
11
4 Simulation studies
We perform simulation studies to compare the proposed method with existing alter-
natives, including the Cox proportional hazards model using covariates (X,W,XW ),
random survival forests using covariates (X,W ), (X,W,XW ), respectively, and ran-
dom survival forests fitted on treatment arm W separately to learn the optimal treat-
ment decision. Note that the last method essentially mimics the virtual twin method
(Foster et al., 2011) applied to observational survival data. There are many existing
implementations of random survival forests, including R packages randomForestSRC
(Ishwaran and Kogalur, 2019), party (Hothorn et al., 2006b), ranger (Wright and
Ziegler, 2017), RLT (Zhu, 2018), etc. However, to streamline our presentation and
highlight the causal approach in observational studies, we only compare with Ish-
waran and Kogalur (2019) to demonstrate the strength of the proposed causal survival
forests among the survival forests designed for randomized experiments.
For each of the simulation settings, the optimal treatment assignment was learned
based on the estimated treatment effects from a training dataset with sample size n =
600. A testing dataset with size 1000 was used to calculate the value function under
the estimated rule. Each simulation was repeated 500 times. Tuning parameters
need to be chosen for forest-based methods. The minimal number of observations
in each terminal node was chosen as 15 (Ishwaran and Kogalur, 2019), dp1/2e, and
dlog(p)e, where dxe denotes the least integer greater than or equal to x. The number
of variables available for splitting at each tree node was chosen from dp/3e, d2p/3e
and dp1/2e. The total number of trees was set to 500.
4.1 Simulation settings
We considered the following four scenarios. For each scenario, we generated covariates
independently from a uniform distribution on [0, 1]5. In the first scenario, T was
generated from an accelerated failure time model, and C was generated from a Cox
12
model,
log(T ) = −1.85− 0.8I(X(1) < 0.5) + 0.7X1/2(2) + 0.2X(3) + (0.7− 0.4I(X(1) < 0.5)− 0.4X
1/2(2) )W + ε,
λC(t | W,X) = λ0(t) exp{−1.75− 0.5X1/2(2) + 0.2X(3) + (1.15 + 0.5I(X(1) < 0.5)− 0.3X
1/2(2) )W},
where the baseline hazard function λ0(t) = 2t, and ε followed a standard normal distri-
bution. The follow-up time τ = 1.5, and propensity score e(x) = (1 + β(x(1); 2, 4))/4,
where β(·; a, b) is the density function of Beta distribution with shape parameters a
and b.
In the second scenario, T was generated from a proportional hazard model with
a non-linear structure, and C was generated from an accelerated failure time model
λT (t | W,X) =λ0(t) exp{X(1) + (−0.4 +X(2)W},
log(C) =X(1) −X(3)W + ε,
where the baseline hazard function λ0(t) = 1/2t−1/2, and ε followed a standard normal
distribution. The maximum follow-up time τ = 2, and propensity score e(x) =
(1 + β(x(2); 2, 4))/4.
In the third scenario, T was generated from poisson distribution with mean X2(2)+
X(3) + 6 + 2(X1/2(1) − 0.3)W , and C was generated from a Poisson distribution with
mean 12 + log{1 + exp(X(3))}. The maximum follow-up time τ = 15, and propensity
score e(x) = (1 + β(x(1); 2, 4))/4.
In the fourth scenario, T was generated from poisson distribution with mean
max(0, X(2) +X(3)) + max(0, X(1)− 0.3)W , and C was generated from poisson distri-
bution with mean 1 + log{1 + exp(X(3))}. The maximum follow-up time τ = 5, and
propensity score e(x) = [{1 + exp(−x(1))}{1 + exp(−x(2))}]−1. Note that for subjects
with X(1) < 0.3, treatment does not affect survival time, and thus both w = 0, 1 are
defined as optimal treatment.
In addition, we evaluated the coverage of the proposed 95% confidence intervals
at points x1 = (0.1, · · · , 0.1)T , x2 = (0.3, · · · , 0.3)T , x3 = (0.5, · · · , 0.5)T , x4 =
(0.7, · · · , 0.7)T , x5 = (0.9, · · · , 0.9)T for the above four scenarios, respectively. The
true treatment effect was estimated by the Monte Carlo method with sample size
13
100000. We used the default tuning parameters: the minimal number of observations
in each terminal node was set to dp1/2e, and the number of variables available for
splitting at each tree node was chosen from d2p/3e. In order to obtain valid confidence
intervals, we followed the suggestion of Athey et al. (2019) and fit a large enough
number of trees B = 100000. Each simulation was repeated 500 times. The numerical
results are summarized in Table 1.
4.2 Simulation results
Figures 1a-1d show the boxplots of correct classification rates in test samples with vir-
tual twins as the reference level. In all figures, “VT” denotes the virtual twin method
with random survival forests; “SRC1” denotes random survival forests using covari-
ates (X,W ); “SRC2” denotes random survival forests using covariates (X,W,XW );
“Cox” denotes Cox proportional hazard model using covariates (X,W,XW ); “CSF”
denotes the proposed causal survival forests.
In Scenarios 1, 3 and 4, the proposed causal survival forest achieves the best
performance among all competing methods. In Scenario 2, because the true model for
the failure time is the Cox model, it is not surprising that the Cox model performs the
best here. Our estimated treatment rule performs better than other random survival
forest approaches. In addition, because the proposed method requires the estimation
of nuisances and plug-in quantities, the standard deviations of the proposed causal
survival forests are slightly larger than other forest-based methods.
Overall, the proposed causal survival forest is superior to ordinary random survival
forests which do not target the causal parameter directly. The reason is that the
proposed forests directly model the causal effect of the treatment on the survival
time. Thus, more splits are placed on the covariates interacting with W rather than
on the covariates, which only appears in the main effect. Furthermore, as shown in
Table 1, the proposed confidence intervals have relatively good coverage at different
testing points in various settings.
14
●
●●
●
●
●
●
●
●●●
●
●●
●●
●●
●
●
●
●●●
−0.1
0.0
0.1
0.2
VT SRC1 SRC2 Cox CSF
Cor
rect
cla
ssifi
catio
n ra
te
(a) Scenario 1
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
−0.1
0.0
0.1
0.2
0.3
VT SRC1 SRC2 Cox CSF
Cor
rect
cla
ssifi
catio
n ra
te
(b) Scenario 2
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−0.1
0.0
0.1
0.2
VT SRC1 SRC2 Cox CSF
Cor
rect
cla
ssifi
catio
n ra
te
(c) Scenario 3
●
●●●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
0.0
0.1
0.2
VT SRC1 SRC2 Cox CSF
Cor
rect
cla
ssifi
catio
n ra
te
(d) Scenario 4
Figure 1: Correct classification rate of different methods
Table 1: Coverage in percent of the proposed 95% confidence intervals in the four
scenarios
Scenario x1 x2 x3 x4 x5
1 89.2 79.0 83.8 96.0 97.0
2 95.2 96.2 96.6 96.6 98.0
3 82.8 96.8 95.6 92.4 86.6
4 86.2 81.0 97.8 93.0 84.8
5 HIV data analysis
We demonstrate the proposed method by an application to the data from AIDS
Clinical Trials Group Protocol 175 (ACTG175) (Hammer et al., 1996). The original
15
dataset consists of 2139 HIV-infected subjects. The enrolled subjects were random-
ized to four treatment groups: zidovudine (ZDV) monotherapy, ZDV+didanosine
(ddI), ZDV+zalcitabine, and ddI monotherapy. We focus on the subset of patients
receiving the treatment ZDV+ddI or ddI monotherapy as considered in Lu et al.
(2013). Treatment indicator W = 0 denotes the treatment ddI with 561 subjects,
and W = 1 denotes the treatment ZDV+ddI with 524 subjects. Though ACTG175 is
a randomized study, there seem to be some selection effects in the subsets used here.
For example, for covariate race equals to 1, there are 138 receiving ZDV+ddI and 173
receiving ddI. A binomial test with null probability 0.5 gives p-value 0.05. For this
reason, we analyze the study as an observational rather than randomized study.
Here we are interested in the causal effect between ZDV+ddI and ddI on survival
time of HIV-infected patients. 12 selected baseline covariates were studied in Tsiatis
et al. (2008); Zhang et al. (2008); Lu et al. (2013); Fan et al. (2017). There are 5
continuous covariates: age (year), weight (kg), Karnofsky score (scale of 0-100), CD4
count (cells/mm3) at baseline, CD8 count (cells/mm3) at baseline. There are 7 binary
variables: gender (male = 1, female = 0), homosexual activity (yes = 1, no = 0), race
(non-white = 1, white = 0), symptomatic status (symptomatic = 1, asymptomatic =
0), history of intravenous drug use (yes = 1, no = 0), and hemophilia (yes = 1, no =
0). As the outcome considered here is the survival time, we also include CD4 count
(cells/mm3) at 20 ± 5 weeks, CD8 count (cells/mm3) at 20 ± 5 weeks as covariates,
as well as above 12 covariates.
We applied the proposed causal survival forest to this dataset. We used the
default tuning parameters: the minimal number of observations in each terminal
node was set to dp1/2e, and the number of variables available for splitting at each tree
node was chosen from d2p/3e. We fit a large enough number of trees B = 100000.
The point estimation and 95% confidence intervals from causal survival forest are
presented in Figure 2. The confidence interval is wide may due to small sample size.
According to the estimated optimal treatment rule obtained from the heterogeneous
effect estimation, for the subgroup of patients with age less than 34 (median), 349
patients should be assigned to treatment ddI, while 205 patients should be assigned
16
−5
0
5
20 40 60Age
CAT
E(*
102 )
Figure 2: The point estimation (solid line) and 95% confidence intervals (dashed line)
from the proposed causal survival forest
to treatment ZDV+ddI; For the subgroup of patients with age larger than 34, 272
should be assigned to treatment ZDV+ddI, while 257 patients should be assigned to
treatment ddI. In addition, we varied the patients’ age, and other covariates were
set to their median values. The estimated treatment effects are all negative when
age is less than or equal to 48, while the estimated effects are positive when age is
larger than 48. The results suggest that ZDV+ddI is more favorable for older HIV-
infected patients. A similar finding was also observed in Lu et al. (2013) and Fan
et al. (2017). We note, however, that the confidence intervals for the pointwise effect
as reported in Figure 2 are wide, and so we ought not over-interpret the shape of the
fitted heterogeneity θ(x).
Acknowledgement
We thank Julie Tibshirani for helpful conversations and suggestions.
17
Appendix
A Proof of Theorem 3.2
Proof. Given the set of forest weights αi(x) used to define the generalized random
forest estimation θ(x) with unknown true nuisance parameters, we have the following
linear approximation
θ∗(x) = θ(x) +n∑i=1
αi(x)ρ∗i (x),
where ρ∗j(x) denotes the influence function of the j-th observation with respect to the
true parameter value θ(x), and θ∗(x) is a pseudo-forest output with weights αi(x)
and outcomes θ(x) + ρ∗i (x).
Note that Assumptions 2-6 in Athey et al. (2019) hold immediately from the
definition of the estimating equation ψθ(x). In particular, ψθ(x) is Lipschitz continuous
in terms of θ(x) for their Assumption 4; The solution of∑n
i=1 αiψiθ(x) = 0 always
exists for their Assumption 5. By the results shown in Wager and Athey (2018),
there exists a sequence σn(x) for which
{θ∗(x)− θ(x)}/σn(x)→ N(0, 1),
where σ2n(x) = polylog(n/s)−1s/n and polylog(n/s) is a function that is bounded
away from 0 and increases at most polynomially with the log-inverse sampling ratio
log(n/s).
Furthermore, by Lemma 4 in Athey et al. (2019),
(n/s)1/2{θ(x)− θ∗(x)} = Op(max{s−π log((1−w)−1)
2 log(w−1) , (s
n)1/6}).
Following Lemma 3.1, as long asOp{kBmax(bn, cn, dn)} goes faster than polylog(n/s)−1/2(s/n)1/2,
we have
{θ(x)− θ(x)}/σn(x)→ N(0, 1).
18
References
Arlot, S. and Genuer, R. (2014), “Analysis of purely random forests bias,” arXiv
preprint arXiv:1407.3939.
Athey, S., Tibshirani, J., and Wager, S. (2019), “Generalized Random Forests,” The
Annals of Statistics, 47(2).
Athey, S. and Wager, S. (2019), “Estimating Treatment Effects with Causal Forests:
An Application,” Observational Studies, 5, 36–51.
Biau, G. (2012), “Analysis of a random forests model,” Journal of Machine Learning
Research, 13, 1063–1095.
Biau, G., Devroye, L., and Lugosi, G. (2008), “Consistency of random forests and
other averaging classifiers,” Journal of Machine Learning Research, 9, 2015–2033.
Breiman, L. (2001), “Random forests,” Machine learning, 45, 5–32.
Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. A. (1984), Classification and
regression trees, CRC press.
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey,
W., and Robins, J. (2018), “Double/debiased machine learning for treatment and
structural parameters,” The Econometrics Journal, 21, 1–68.
Ciampi, A., Thiffault, J., Nakache, J.-P., and Asselain, B. (1986), “Stratification by
stepwise regression, correspondence analysis and recursive partition: a comparison
of three methods of analysis for survival data with covariates,” Computational
Statistics & Data Analysis, 4, 185 – 204.
Cui, Y., Zhu, R., and Kosorok, M. (2017a), “Tree based weighted learning for esti-
mating individualized treatment rules with censored data,” Electronic Journal of
Statistics, 11, 3927–3953.
Cui, Y., Zhu, R., Zhou, M., and Kosorok, M. R. (2017b), “Consistency of survival
tree and forest models: splitting bias and correction,” arXiv:1707.09631.
19
Fan, C., Lu, W., Song, R., and Zhou, Y. (2017), “Concordance-assisted learning
for estimating optimal individualized treatment regimes,” Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 79, 1565–1582.
Fleming, T. R. and Harrington, D. P. (2011), Counting processes and survival analy-
sis, vol. 169, John Wiley & Sons.
Foster, J. C., Taylor, J. M., and Ruberg, S. J. (2011), “Subgroup identification from
randomized clinical trial data,” Statistics in medicine, 30, 2867–2880.
Friedberg, R., Tibshirani, J., Athey, S., and Wager, S. (2018), “Local linear forests,”
arXiv preprint arXiv:1807.11408.
Hammer, S. M., Katzenstein, D. A., Hughes, M. D., Gundacker, H., Schooley, R. T.,
Haubrich, R. H., Henry, W. K., Lederman, M. M., Phair, J. P., Niu, M., et al.
(1996), “A trial comparing nucleoside monotherapy with combination therapy in
HIV-infected adults with CD4 cell counts from 200 to 500 per cubic millimeter,”
New England Journal of Medicine, 335, 1081–1090.
Hothorn, T., Buhlmann, P., Dudoit, S., Molinaro, A., and Van Der Laan, M. J.
(2006a), “Survival ensembles,” Biostatistics, 7, 355–373.
Hothorn, T., Hornik, K., and Zeileis, A. (2006b), “Unbiased Recursive Partitioning: A
Conditional Inference Framework,” Journal of Computational and Graphical Statis-
tics, 15, 651–674.
Hothorn, T., Lausen, B., Benner, A., and Radespiel-Troger, M. (2004), “Bagging
survival trees,” Statistics in medicine, 23, 77–91.
Ishwaran, H. and Kogalur, U. (2019), Random Forests for Survival, Regression, and
Classification (RF-SRC), r package version 2.8.0.
Ishwaran, H., Kogalur, U. B., Blackstone, E. H., and Lauer, M. S. (2008), “Random
survival forests,” The annals of applied statistics, 841–860.
20
Kunzel, S. R., Sekhon, J. S., Bickel, P. J., and Yu, B. (2019), “Metalearners for
estimating heterogeneous treatment effects using machine learning,” Proceedings of
the National Academy of Sciences, 116, 4156–4165.
Laber, E. and Zhao, Y. (2015), “Tree-based methods for individualized treatment
regimes,” Biometrika, 102, 501–514.
Leblanc, M. and Crowley, J. (1993), “Survival Trees by Goodness of Split,” Journal
of the American Statistical Association, 88, 457–467.
Lin, Y. and Jeon, Y. (2006), “Random forests and adaptive nearest neighbors,” Jour-
nal of the American Statistical Association, 101, 578–590.
Lu, M., Sadiq, S., Feaster, D., and Ishwaran, H. (2018), “Estimating Individual
Treatment Effect in Observational Data Using Random Forest Methods,” Journal
of Computational and Graphical Statistics, 1–11.
Lu, W., Zhang, H. H., and Zeng, D. (2013), “Variable selection for optimal treatment
decision,” Statistical methods in medical research, 22, 493–504.
Meinshausen, N. (2006), “Quantile regression forests,” Journal of Machine Learning
Research, 7, 983–999.
Mentch, L. and Hooker, G. (2014), “Ensemble trees and clts: Statistical inference for
supervised learning,” stat, 1050, 25.
— (2016), “Quantifying Uncertainty in Random Forests via Confidence Intervals and
Hypothesis Tests,” Journal of Machine Learning Research, 17, 1–41.
Murphy, S. A. (2003), “Optimal dynamic treatment regimes,” Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 65, 331–355.
Nie, X. and Wager, S. (2017), “Quasi-Oracle Estimation of Heterogeneous Treatment
Effects,” arXiv:1712.04912.
21
Oprescu, M., Syrgkanis, V., and Wu, Z. S. (2019), “Orthogonal Random Forest for
Causal Inference,” in International Conference on Machine Learning, pp. 4932–
4941.
Qian, M. and Murphy, S. A. (2011), “Performance guarantees for individualized treat-
ment rules,” Annals of statistics, 39, 1180.
Robins, J. M., Li, L., Mukherjee, R., Tchetgen, E. T., van der Vaart, A., et al. (2017),
“Minimax estimation of a functional on a structured high-dimensional model,” The
Annals of Statistics, 45, 1951–1987.
Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1994), “Estimation of regression
coefficients when some regressors are not always observed,” Journal of the American
statistical Association, 89, 846–866.
Robinson, P. M. (1988), “Root-N-Consistent Semiparametric Regression,” Economet-
rica, 56, 931–954.
Schick, A. (1986), “On asymptotically efficient estimation in semiparametric models,”
The Annals of Statistics, 14, 1139–1151.
Segal, M. R. (1988), “Regression Trees for Censored Data,” Biometrics, 44, 35–47.
Sexton, J. and Laake, P. (2009), “Standard errors for bagged and random forest
estimators,” Computational Statistics & Data Analysis, 53, 801–811.
Steingrimsson, J. A., Diao, L., Molinaro, A. M., and Strawderman, R. L. (2016),
“Doubly robust survival trees,” Statistics in medicine, 35, 3595–3612.
Steingrimsson, J. A., Diao, L., and Strawderman, R. L. (2019), “Censoring Unbiased
Regression Trees and Ensembles,” Journal of the American Statistical Association,
114, 370–383.
Sun, Q., Zhu, R., Wang, T., and Zeng, D. (2019), “Counting process-based dimension
reduction methods for censored outcomes,” Biometrika, 106, 181–196.
22
Tsiatis, A. (2007), Semiparametric Theory and Missing Data, Springer Series in
Statistics, Springer New York.
Tsiatis, A. A., Davidian, M., Zhang, M., and Lu, X. (2008), “Covariate adjustment
for two-sample treatment comparisons in randomized clinical trials: a principled
yet flexible approach,” Statistics in medicine, 27, 4658–4677.
Wager, S. and Athey, S. (2018), “Estimation and inference of heterogeneous treatment
effects using random forests,” Journal of the American Statistical Association, 113,
1228–1242.
Wager, S., Hastie, T., and Efron, B. (2014), “Confidence Intervals for Random Forests:
The Jackknife and the Infinitesimal Jackknife,” Journal of Machine Learning Re-
search, 15, 1625–1651.
Wager, S. and Walther, G. (2015), “Adaptive Concentration of Regression Trees, with
Application to Random Forests,” arXiv preprint arXiv:1503.06388.
Wright, M. N. and Ziegler, A. (2017), “ranger: A Fast Implementation of Random
Forests for High Dimensional Data in C++ and R,” Journal of Statistical Software,
77, 1–17.
Zhang, B., Tsiatis, A. A., Laber, E. B., and Davidian, M. (2012), “A robust method
for estimating optimal treatment regimes,” Biometrics, 68, 1010–1018.
Zhang, M., Tsiatis, A. A., and Davidian, M. (2008), “Improving efficiency of in-
ferences in randomized clinical trials using auxiliary covariates,” Biometrics, 64,
707–715.
Zhao, Y., Zeng, D., Rush, A. J., and Kosorok, M. R. (2012), “Estimating individu-
alized treatment rules using outcome weighted learning,” Journal of the American
Statistical Association, 107, 1106–1118.
Zhao, Y.-Q., Zeng, D., Laber, E. B., Song, R., Yuan, M., and Kosorok, M. R. (2015),
“Doubly robust learning for estimating individualized treatment with censored
data,” Biometrika, 102, 151–168.
23
Zhu, R. (2018), Reinforcement Learning Trees, r package version 3.2.2.
Zhu, R. and Kosorok, M. R. (2012), “Recursively imputed survival trees,” Journal of
the American Statistical Association, 107, 331–340.
Zhu, R., Zhao, Y.-Q., Chen, G., Ma, S., and Zhao, H. (2017), “Greedy outcome
weighted tree learning of optimal personalized treatment rules,” Biometrics, 73,
391–400.
24