Estimating heterogeneous treatment e ects with right ...Estimating heterogeneous treatment e ects...

transcript

Estimating heterogeneous treatment effects with

right-censored data via causal survival forests ∗

Yifan Cui† Michael R. Kosorok‡ Stefan Wager § Ruoqing Zhu ¶

January 28, 2020

Abstract

There is fast-growing literature on estimating heterogeneous treatment ef-

fects via random forests in observational studies. However, there are few ap-

proaches available for right-censored survival data. In clinical trials, right-

censored survival data are frequently encountered. Quantifying the causal re-

lationship between a treatment and the survival outcome is of great interest.

Random forests provide a robust, nonparametric approach to statistical estima-

tion. In addition, recent developments allow forest-based methods to quantify

the uncertainty of the estimated heterogeneous treatment effects. We propose

causal survival forests that directly target on estimating the treatment effect

from an observational study. We establish consistency and asymptotic normal-

ity of the proposed estimators and provide an estimator of the asymptotic vari-

ance that enables valid confidence intervals of the estimated treatment effect.

The performance of our approach is demonstrated via extensive simulations

and data from an HIV study.

∗Alphabetical order†Department of Statistics, The Wharton School, University of Pennsylvania, 3730 Walnut Street,

Philadelphia, Pennsylvania 19104, USA; e-mail: cuiy@wharton.upenn.edu.‡Department of Biostatistics, University of North Carolina, 3101 McGavran-Greenberg Hall,

Chapel Hill, North Carolina 27599, USA; email: kosorok@unc.edu.§Stanford Graduate School of Business, 655 Knight Way, Stanford, California 94035, USA; email:

swager@stanford.edu.¶Department of Statistics, University of Illinois at Urbana-Champaign, 725 South Wright Street,

Champaign, Illinois 61820, USA; email: rqzhu@illinois.edu.

keywords Precision medicine, Random forests, Right-censored data, Heterogeneous

treatment effects, Confidence intervals

1 Introduction

Recently, random forests have been considered to estimate heterogeneous treatment

effects in observational studies. Several examples in statistical and biomedical settings

include Athey et al. (2019), Friedberg et al. (2018), Lu et al. (2018), Kunzel et al.

(2019) and Oprescu et al. (2019). The advantages of forest and tree-based methods,

as described in Breiman (2001) and Athey et al. (2019), are two-fold. First, random

forests provide a robust nonparametric approach for estimating heterogeneous treat-

ment effects. Second, random forests can quantify the uncertainty of the estimated

heterogeneous treatment effects through recent developments (Mentch and Hooker,

2014; Wager and Athey, 2018).

In clinical trials and other biomedical research studies, right-censored survival

data are frequently encountered. The literature on random forests is well developed

for survival data, e.g., Leblanc and Crowley (1993); Hothorn et al. (2006b) studied

survival tree model in context of conditional inference trees. Hothorn et al. (2006a)

proposed to use the inverse probability of censoring weighting to compensate cen-

soring in forest models; Ishwaran et al. (2008) proposed so-called random survival

forests, which extend random forests to handle survival data via using log-rank test

at each split on an individual survival tree (Ciampi et al., 1986; Segal, 1988); Zhu

and Kosorok (2012) studied the impact of recursive imputation of survival forests

on model fitting; Steingrimsson et al. (2016) proposed doubly robust survival trees

by constructing doubly robust loss functions that use more information to improve

efficiency; Steingrimsson et al. (2019) constructed censoring unbiased regression trees

and forests by considering a class of censoring unbiased loss functions. However,

none of these methods were directly targeted at heterogeneous treatment effects in

observational studies.

Another topic related to estimating heterogeneous treatment effects is optimal

treatment regimes. A significant amount of work has been devoted to estimating opti-

mal treatment rules in randomized trials with complete data (Murphy, 2003; Qian and

Murphy, 2011; Zhang et al., 2012; Zhao et al., 2012; Laber and Zhao, 2015). Adapt-

ing the outcome weighted learning framework (Zhao et al., 2012), Zhao et al. (2015)

proposed two new approaches, inverse censoring weighted outcome weighted learning,

and doubly robust outcome weighted learning, both of which require semi-parametric

estimation of the conditional censoring probability given the patient characteristics

and treatment choice. Zhu et al. (2017) adopted the accelerated failure time model to

estimate an interpretable single-tree treatment decision rule. Cui et al. (2017a) pro-

posed a random forest approach for right-censored outcome weighted learning, which

avoids both the inverse probability of censoring weighting and restrictive modeling

assumptions. However, for observational studies with censored survival outcomes,

these methods may suffer from confounding and selection bias. In addition, none of

the existing approaches estimates the heterogeneous treatment effect and the associ-

ated confidence interval. Hence, it is challenging to provide valid interference for the

suggested treatment.

To address these limitations in the existing literature, the proposed random forest

approach, namely causal survival forest, aims at estimating the heterogeneous treat-

ment effects from right-censored observational survival data. Compared with exist-

ing approaches, the proposed causal survival forest enjoys two advantages. Firstly,

our random forest and its associated splitting rules target a direct estimation of the

treatment effect while adjusting for the biasedness caused by treatment confounding

variables and censoring. Secondly, we can provide a valid inference of the hetero-

geneous treatment effect, which is much needed in the practical implementation of

precision medicine. Our approach is motivated by classical results on semiparametric

efficiency theory and survival analysis (Tsiatis, 2007). We construct local estimating

equations for the conditional average treatment effect (Athey et al., 2019), which leads

to an unbiased splitting rule that addresses the difference between the two potential

treatments.

The proposed approach is quite general in the sense that it includes many missing

data and causal inference problems. The proposed method can deal with coarsening

and missing data as long as the functional of interest admits an unbiased estimating

equation. Considering a broader class is to demonstrate the flexibility of our method

for various functionals that are generally of interest. Several functionals, such as the

marginal mean of an outcome subject to missingness as well as the closely related

marginal mean of a counterfactual outcome are within our framework. Thus, the con-

ditional average treatment effect under unconfoudedness can be viewed as a running

example to develop the proposed methodology.

2 Causal survival forests

Our goal is to construct a survival forest model (Ishwaran et al., 2008) that can over-

come the potential bias caused by observational data and right censoring. Suppose

X is a p dimensional covariate, W ∈ {0, 1} is the treatment label, T is the survival

time, and C is the censoring time. As we focus on the observational study setting,

throughout the paper, we make the following three assumptions on the failure time

T for identifiability of the conditional average treatment effect.

Assumption 1. (Consistency) T = T (W ) almost surely.

Assumption 2. (Unconfoundedness) {T (0), T (1)} ⊥ W | X.

Assumption 3. (Positivity) pr(W = w|X = x) > 0 almost surely if f(x) > 0, where

w = 0, 1.

The consistency assumption states that we observe a realization of T (w) only if the

treatment w is equal to a subject’s actual treatment assignment W . This assumption

links the potential outcomes to the observed data because for any failure event, only

one of the two potential outcomes T (1) and T (0) is observed. The unconfoundedness

assumption basically states that conditioning on covariate vector X, treatment W

is independent of potential outcomes. This assumption is satisfied if all prognostic

factors used to determine the treatment label W are recorded in X. Finally, this

positivity assumption essentially states that any subject with an observed value of x

has a positive probability of receiving both values of the treatment.

2.1 Causal forests without censoring

Before discussing treatment effect estimation with censored outcomes, we briefly re-

view the causal forest approach to treatment heterogeneity without censoring. Sup-

pose we want to estimate a treatment effect θ(x) = E[y(T (1)) − y(T (0))|X = x],

where y(T ) is some deterministic transformation of the survival time T . Typical

choices of outcome function include expected thresholded survival time y(T ) = T ∧ τ

and the survivor function y(T ) = 1({T ≥ τ}).

If we knew that the treatment effects were constant, i.e., θ(x) = θ for all x, then

the following estimator θ due to Robinson (1988) attains√n rates of convergence,

provided the three assumptions detailed above hold and that we estimate nuisance

components sufficiently fast (Robins et al., 2017; Chernozhukov et al., 2018):

n∑i=1

θ(Xi, Ti, Wi; e, m) = 0,

ψθ(Xi, Ti, Wi; e, m) = (Wi − e(Xi))(y(Ti)− m(Xi)− θ(Wi − e(Xi))),

where e(x) = pr[W = 1|X = x], m(x) = E[y(T )|X = x], and e(Xi) and m(Xi)

are estimates of these quantities derived via cross-fitting (Schick, 1986). We use the

superscript (c) to remind ourselves that this estimator requires access to the complete

(uncensored) data.

Here, however, our goal is not to estimate a constant treatment effect θ, but

rather to fit covariate-dependent treatment heterogeneity θ(x). For this purpose, we

use forests. As background, recall that given a target point x, tree-based methods

seek to find training examples which are close to x and uses the local kernel weights to

obtain a weighted averaging estimator. An essential ingredient of tree-based methods

is recursive partitioning on the covariate space X , which induces the local weighting.

When the splitting variables are adaptively chosen, the width of a leaf can be narrower

along the directions where the causal effect is changing faster. After the tree fitting is

completed, the closest points to x are those that fall into the same terminal node as

x. The observations that fall into the same node as the target point x can be treated

asymptotically as coming from a homogeneous group.

Athey et al. (2019) generalizes using random forest-based weights for generic kernel

estimation. The most closely related precedent from the perspective of adaptive

nearest neighbor estimation are quantile regression forests (Meinshausen, 2006) and

bagging survival trees (Hothorn et al., 2004), which can be viewed as special cases

of generalized random forests. The idea of adaptive nearest neighbors also underlies

theoretical analyses of random forests such as Lin and Jeon (2006); Biau et al. (2008);

Arlot and Genuer (2014). The random forest-based weights αi are derived from the

fraction of trees in which an observation appears in the same terminal node as the

target point. Specifically, given a test point x, the weights αi(x) are the frequency

with which the i-th training example falls in the same leaf as x, i.e.,

αi(x) =1

B∑b=1

I{Xi ∈ Nb(x)}|Nb(x)|

where Nb(x) is the terminal node that contains x in the b-th tree, and | · | denotes

the cardinality.

The crux of the causal forest algorithm presented in Athey et al. (2019) is to

pair the kernel-based view of forests with Robinson’s estimating equation (1). Causal

forests seek to grow a forest such that the resulting weighting function αi(x) can

be used to express heterogeneity in θ(·), meaning that θ(·) is roughly constant over

observations given positive weight αi(x) for predicting at x. Then, we estimate θ(x)

by solving a localized version of (1):

n∑i=1

αi(x)ψ(c)

θ(x)(Xi, Ti, Wi; e, m) = 0. (2)

For further details, including the choice of a splitting rule targeted for treatment

heterogeneity, see Athey et al. (2019); further examples are given in Athey and Wager

(2019). The idea of using Robinson’s transformation to fit treatment heterogeneity

is further explored by Nie and Wager (2017).

2.2 Adjusting for censoring

The central goal of this paper is to develop a causal forest algorithm that can be

used despite censoring, i.e., despite the fact that we sometimes do not observe T ,

and instead only observe U = T ∧ C. Throughout, we assume that T is the survival

time up to a fixed maximum follow-up time τ and the censoring is noninformative

conditionally on X (Fleming and Harrington, 2011).

Assumption 4. (Conditionally independent censoring) T ⊥ C | X.

Our approach builds on the following recipe to making estimating equations ro-

bust to censoring, described in Tsiatis (2007, Chapter 10.4). If the true value θ

of our parameter of interest is identified by a complete data estimating equation,

E[ψ(c)θ (Xi, Ti, Wi)] = 0, then θ is also identified via the following estimator that gen-

eralizes the celebrated augmented inverse-propensity weighting estimator of Robins

et al. (1994): E[ψθ(Xi, Ui, Wi, ∆i)] = 0 with

ψθ(Xi, Ui, Wi, ∆i) =∆iψ

(c)θ (Xi, Ui, Wi)

SC(Ui|Xi)+

(1−∆i)E[ψ(c)θ (Xi, Ti, Wi)|Ti ≥ Ui, Xi]

SC(Ui|Xi)

−∫ Ui

λC(s|Xi)

SC(s|Xi)E[ψ

(c)θ (Xi, Ti, Wi)|Ti ≥ s, Xi] ds,

where SC(s|x) = pr[Ci ≥ s|Xi = x] is the conditional survival function for the

censoring process, and λC(s|x) = −d/ds logSC(s|x) is the associated conditional

hazard function. In our case, the specific form of our complete data estimating

equation enables us to simplify this expression resulting in

ψθ(Xi, Ui, Wi, ∆i; e, m, λC , SC , Q)

(Ui|Xi) + ∆i[y(Ti)− QWi(Ui|Xi)]− m(Xi)− θ(Wi − e(Xi))

SCWi(Ui|Xi)

−∫ Ui

λCWi(s|Xi)

SCWi(s|Xi)

[QWi(s|Xi)− m(Xi)− θ(Wi − e(Xi))] ds

)(Wi − e(Xi)),

where QWi(t|Xi) = E[Ti|Xi,Wi, Ti > t], SCWi

(s|x) and λCWi(s|x) are the estimated con-

ditional survival function and hazard function, respectively. The associated estimator

θ is characterized by∑n

i=1 ψθ(Xi, Ui, Wi, ∆i; e, m, λC , SC , Q) = 0. This estimator

attains√n rates for θ under the setting of Chernozhukov et al. (2018), i.e., with cross-

fitting and 4-th root rates for the nuisance components provided the Assumptions 1-3

and conditionally independent censoring assumption.

2.3 Proposed causal survival forests

We now return to our main proposal, i.e., estimating heterogeneous treatment effects

using causal survival forests. Following our discussion above, we proceed as follows.

First, we estimate the nuisance components e, m, λC , SC , Q required to to form the

score (4). Then, however, instead of estimating a constant parameter, we pair this

estimating equation with the forest weighting scheme (2), resulting in estimates θ(x)

characterized byn∑i=1

αi(x)ψθ(x)(Xi, Ui, Wi, ∆i; e, m, λC , SC , Q) = 0. (5)

In order to use this estimator, we of course need to specify how to grow the forest,

so that the resulting forest weights αi(x) adequately express heterogeneity in the

underlying signal θ(x). Here, for the splitting rule, we use the ∆-criterion proposed

in Athey et al. (2019). In particular, we generate pseudo-outcomes by the following

relabeling strategy at each internal node.

ρi = ψiθ(x)×

(Wi − e(Xi))2

SCWi(Ui|Xi)

−∫ Ui

λCWi(t|Xi)

SCWi(t|Xi)

)}−1,

where ψiθ(x)

is a shorthand of ψθ(x)(Xi, Ui, Wi, ∆i; e, m, λC , SC , Q). Next, the split-

ting criterion proceeds exactly the same as a regression tree (Breiman et al., 1984)

problem by treating the pseudo-outcomes ρi’s as a continuous outcome variable.

Specifically, we split the parent node into two child nodes NL and NR such as to

maximize the following quantity:

∆(L,R) =1

|{i : Xi ∈ NL}|

( ∑i:Xi∈NL

|{i : Xi ∈ NR}|

( ∑i:Xi∈NR

2.4 Confidence intervals for the estimated treatment effects

Establishing a confidence interval allows proper statistical inference on the suggested

treatment strategy. To build asymptotically valid confidence intervals for θ(x) cen-

tered on θ(x), it suffices to derive an estimator for var{θ(x)}. As shown in Lemma 3.1

and Theorem 3.2 in Section 3, it is enough to study the variance of θ∗, where

θ∗(x) = θ(x) +n∑i=1

αi(x)ρ∗i (x),

and ρ∗i (x) is the influence function of the i-th observation with respect to the true

parameter value θ(x), i.e.,

ρ∗i (x) = ψiθ(x)V (x)−1,

V (x) = E

[{W − e(X)}2

SCW (U |X)−∫ U

λCW (t|X)

SCW (t|X)dt

} ∣∣∣∣X = x

It is easy to see that

var{θ∗(x)} = var

{n∑i=1

αi(x)ψiθ(x)

}V (x)−2.

We estimate the above variance by

σ2n = Hn(x)Vn(x)−2,

where Vn(x) and Hn(x) are consistent estimators of V (x) and var{∑n

i=1 αi(x)ψiθ(x)},

respectively. Many strategies are available for estimating V (x). We estimate it by

fitting honest and regular regression forests. To obtain a valid estimation Hn(x),

notice that the term∑n

i=1 αi(x)ψiθ(x) is equivalent to the output of a regression forest

with weights αi(x) and effective outcomes ψiθ(x). There are many methods which have

been developed to estimate the variance of a regression forest, including work by

Sexton and Laake (2009); Wager et al. (2014); Mentch and Hooker (2016); Wager and

Athey (2018); Athey et al. (2019). We follow Athey et al. (2019) and use bootstrap

of little bags in our implementation.

3 Theoretical results

In this section, we study the asymptotic normality of the estimated treatment effect

θ(x). Throughout this section, we assume that the covariates X ∈ [0, 1]p are dis-

tributed according to a density that is bounded away from zero and infinity. The

following assumption guarantees the smoothness of E(ψθ(x)|X = x).

Assumption 5. (Lipschitz continuity) The treatment effect function θ(x) is L′-

Lipschitz continuous in terms of x. In addition, the propensity score e(x), hazard

function λC(t|X = x) and conditional survival function SC(t|X = x) are Lipschitz

continuous in terms of x.

In addition, our trees are symmetric, i.e., their output is invariant to permuting

the indices of training samples. Our algorithm also guarantees honesty (Wager and

Athey, 2018), and the following two conditions.

Random split tree: At each internal node, the probability of splitting at the j-th

dimension is greater than ς, where 0 < ς < 1 for j = 1, · · · , p.

Subsampling: Each child node contains at least a fraction w of the data points in

its parent node for some 0 < w < 0.5, and trees are grown on subsamples of size s

scaling as s = nγ, where κ < γ < 1, κ ≡ 1−[1 + ς−1

{log(w−1)

log(1−w)−1}]−1

Furthermore, we need the following Assumptions 6-9 to couple θ(x) and θ(x),

where θ(x) is an oracle estimator, with e(x),m(x), λC(t|X = x), and SC(t|X = x)

being the underlying truth. Assumptions 6-8 are commonly assumed in the causal

inference and survival analysis literature.

Assumption 6. The failure time and censoring time are bounded. The density func-

tions of the failure time T and censoring time C are both bounded above for all x ∈ X .

Assumption 7. Any individual has a positive probability of receiving both treatments.

Furthermore, 1−M1 < e(x) < M1 for any x ∈ X , and some M1 ∈ (1/2, 1).

Assumption 8. There exists a fixed positive constant M2 ∈ (0, 1), such that P (U ≥

τ |X = x) > M2 for all x ∈ X .

Assumption 9. Consistency of the non-parametric plug-in estimators: we have the

following convergences in probability,

|e(x)− e(x)| → 0, supt<τ,w∈{0,1}

|λCw(t|x)− λCw(t|x)| → 0,

supt<τ,w∈{0,1}

|SCw (t|x)− SCw (t|x)| → 0, supt<τ,w∈{0,1}

|STw(t|x)− STw(t|x)| → 0,

for each x ∈ X . Furthermore, we assume that for each x ∈ X , |e(x)−e(x)| = Op(bn),

Op(cn), supt |λCW (t|x)− λCW (t|x)| = Op(dn).

Assumption 9 is quite general. Biau (2012); Wager and Walther (2015) show that

for the random forest models, bn can be faster than n−2/(p+2) as long as the intrinsic

signal dimension is less than 0.54p. As shown in Cui et al. (2017b), cn = n−1/(p+2) is

achievable for survival forest models. Nonparametric kernel smoothing methods such

as Sun et al. (2019) provide estimation with dn = n−1/2+κ, where κ < 1/2 depending

on the dimension p.

Consequently, from Lemma in the Supplementary Material, we have for each x,

|m(x) − m(x)| = Op(cn), supt |QW (t, x) − QW (t, x)| = Op(cn) for W = 0, 1. The

following lemma provides an intermediate result for our main theorem,

Lemma 3.1. We assume Assumptions 6-9 hold and Op{kBmax(bn, cn, dn)} con-

verges to zero, where k is the minimum terminal node size and B is the number of

trees. Then for any x ∈ X , we have that |θ(x) − θ(x)| = Op{kBmax(bn, cn, dn)},

where B is the number of trees fitted in the forest model.

The proof of Lemma 3.1 is collected in the Supplementary Material. The technical

results in Wager and Athey (2018); Athey et al. (2019) paired with Lemma 3.1 lead

to the following asymptotic Gaussianity result.

Theorem 3.2. Assume Assumptions 5-9 hold. If Op(kBmax(bn, cn, dn)) converges

faster than polylog(n/s)−1/2(s/n)1/2, where polylog(n/s) is a function that is bounded

away from 0 and increases at most polynomially with the log-inverse sampling ratio

log(n/s). Then there exists a sequence σn(x) such that for any x ∈ X ,

{θ(x)− θ(x)}/σn(x)→ N(0, 1),

where σ2n(x) = polylog(n/s)−1s/n.

The proof of Theorem 3.2 is deferred to the Appendix. This asymptotic Gaus-

sianity result yields valid asymptotic confidence intervals for the true treatment effect

θ(x).

4 Simulation studies

We perform simulation studies to compare the proposed method with existing alter-

natives, including the Cox proportional hazards model using covariates (X,W,XW ),

random survival forests using covariates (X,W ), (X,W,XW ), respectively, and ran-

dom survival forests fitted on treatment arm W separately to learn the optimal treat-

ment decision. Note that the last method essentially mimics the virtual twin method

(Foster et al., 2011) applied to observational survival data. There are many existing

implementations of random survival forests, including R packages randomForestSRC

(Ishwaran and Kogalur, 2019), party (Hothorn et al., 2006b), ranger (Wright and

Ziegler, 2017), RLT (Zhu, 2018), etc. However, to streamline our presentation and

highlight the causal approach in observational studies, we only compare with Ish-

waran and Kogalur (2019) to demonstrate the strength of the proposed causal survival

forests among the survival forests designed for randomized experiments.

For each of the simulation settings, the optimal treatment assignment was learned

based on the estimated treatment effects from a training dataset with sample size n =

600. A testing dataset with size 1000 was used to calculate the value function under

the estimated rule. Each simulation was repeated 500 times. Tuning parameters

need to be chosen for forest-based methods. The minimal number of observations

in each terminal node was chosen as 15 (Ishwaran and Kogalur, 2019), dp1/2e, and

dlog(p)e, where dxe denotes the least integer greater than or equal to x. The number

of variables available for splitting at each tree node was chosen from dp/3e, d2p/3e

and dp1/2e. The total number of trees was set to 500.

4.1 Simulation settings

We considered the following four scenarios. For each scenario, we generated covariates

independently from a uniform distribution on [0, 1]5. In the first scenario, T was

generated from an accelerated failure time model, and C was generated from a Cox

model,

log(T ) = −1.85− 0.8I(X(1) < 0.5) + 0.7X1/2(2) + 0.2X(3) + (0.7− 0.4I(X(1) < 0.5)− 0.4X

1/2(2) )W + ε,

λC(t | W,X) = λ0(t) exp{−1.75− 0.5X1/2(2) + 0.2X(3) + (1.15 + 0.5I(X(1) < 0.5)− 0.3X

1/2(2) )W},

where the baseline hazard function λ0(t) = 2t, and ε followed a standard normal distri-

bution. The follow-up time τ = 1.5, and propensity score e(x) = (1 + β(x(1); 2, 4))/4,

where β(·; a, b) is the density function of Beta distribution with shape parameters a

and b.

In the second scenario, T was generated from a proportional hazard model with

a non-linear structure, and C was generated from an accelerated failure time model

λT (t | W,X) =λ0(t) exp{X(1) + (−0.4 +X(2)W},

log(C) =X(1) −X(3)W + ε,

where the baseline hazard function λ0(t) = 1/2t−1/2, and ε followed a standard normal

distribution. The maximum follow-up time τ = 2, and propensity score e(x) =

(1 + β(x(2); 2, 4))/4.

In the third scenario, T was generated from poisson distribution with mean X2(2)+

X(3) + 6 + 2(X1/2(1) − 0.3)W , and C was generated from a Poisson distribution with

mean 12 + log{1 + exp(X(3))}. The maximum follow-up time τ = 15, and propensity

score e(x) = (1 + β(x(1); 2, 4))/4.

In the fourth scenario, T was generated from poisson distribution with mean

max(0, X(2) +X(3)) + max(0, X(1)− 0.3)W , and C was generated from poisson distri-

bution with mean 1 + log{1 + exp(X(3))}. The maximum follow-up time τ = 5, and

propensity score e(x) = [{1 + exp(−x(1))}{1 + exp(−x(2))}]−1. Note that for subjects

with X(1) < 0.3, treatment does not affect survival time, and thus both w = 0, 1 are

defined as optimal treatment.

In addition, we evaluated the coverage of the proposed 95% confidence intervals

at points x1 = (0.1, · · · , 0.1)T , x2 = (0.3, · · · , 0.3)T , x3 = (0.5, · · · , 0.5)T , x4 =

(0.7, · · · , 0.7)T , x5 = (0.9, · · · , 0.9)T for the above four scenarios, respectively. The

true treatment effect was estimated by the Monte Carlo method with sample size

100000. We used the default tuning parameters: the minimal number of observations

in each terminal node was set to dp1/2e, and the number of variables available for

splitting at each tree node was chosen from d2p/3e. In order to obtain valid confidence

intervals, we followed the suggestion of Athey et al. (2019) and fit a large enough

number of trees B = 100000. Each simulation was repeated 500 times. The numerical

results are summarized in Table 1.

4.2 Simulation results

Figures 1a-1d show the boxplots of correct classification rates in test samples with vir-

tual twins as the reference level. In all figures, “VT” denotes the virtual twin method

with random survival forests; “SRC1” denotes random survival forests using covari-

ates (X,W ); “SRC2” denotes random survival forests using covariates (X,W,XW );

“Cox” denotes Cox proportional hazard model using covariates (X,W,XW ); “CSF”

denotes the proposed causal survival forests.

In Scenarios 1, 3 and 4, the proposed causal survival forest achieves the best

performance among all competing methods. In Scenario 2, because the true model for

the failure time is the Cox model, it is not surprising that the Cox model performs the

best here. Our estimated treatment rule performs better than other random survival

forest approaches. In addition, because the proposed method requires the estimation

of nuisances and plug-in quantities, the standard deviations of the proposed causal

survival forests are slightly larger than other forest-based methods.

Overall, the proposed causal survival forest is superior to ordinary random survival

forests which do not target the causal parameter directly. The reason is that the

proposed forests directly model the causal effect of the treatment on the survival

time. Thus, more splits are placed on the covariates interacting with W rather than

on the covariates, which only appears in the main effect. Furthermore, as shown in

Table 1, the proposed confidence intervals have relatively good coverage at different

testing points in various settings.

●●

●●●

●●

●●●

−0.1

VT SRC1 SRC2 Cox CSF

(a) Scenario 1

●●

● ●

−0.1

(b) Scenario 2

−0.1

(c) Scenario 3

●●●

●●

(d) Scenario 4

Figure 1: Correct classification rate of different methods

Table 1: Coverage in percent of the proposed 95% confidence intervals in the four

scenarios

Scenario x1 x2 x3 x4 x5

1 89.2 79.0 83.8 96.0 97.0

2 95.2 96.2 96.6 96.6 98.0

3 82.8 96.8 95.6 92.4 86.6

4 86.2 81.0 97.8 93.0 84.8

5 HIV data analysis

We demonstrate the proposed method by an application to the data from AIDS

Clinical Trials Group Protocol 175 (ACTG175) (Hammer et al., 1996). The original

dataset consists of 2139 HIV-infected subjects. The enrolled subjects were random-

ized to four treatment groups: zidovudine (ZDV) monotherapy, ZDV+didanosine

(ddI), ZDV+zalcitabine, and ddI monotherapy. We focus on the subset of patients

receiving the treatment ZDV+ddI or ddI monotherapy as considered in Lu et al.

(2013). Treatment indicator W = 0 denotes the treatment ddI with 561 subjects,

and W = 1 denotes the treatment ZDV+ddI with 524 subjects. Though ACTG175 is

a randomized study, there seem to be some selection effects in the subsets used here.

For example, for covariate race equals to 1, there are 138 receiving ZDV+ddI and 173

receiving ddI. A binomial test with null probability 0.5 gives p-value 0.05. For this

reason, we analyze the study as an observational rather than randomized study.

Here we are interested in the causal effect between ZDV+ddI and ddI on survival

time of HIV-infected patients. 12 selected baseline covariates were studied in Tsiatis

et al. (2008); Zhang et al. (2008); Lu et al. (2013); Fan et al. (2017). There are 5

continuous covariates: age (year), weight (kg), Karnofsky score (scale of 0-100), CD4

count (cells/mm3) at baseline, CD8 count (cells/mm3) at baseline. There are 7 binary

variables: gender (male = 1, female = 0), homosexual activity (yes = 1, no = 0), race

(non-white = 1, white = 0), symptomatic status (symptomatic = 1, asymptomatic =

0), history of intravenous drug use (yes = 1, no = 0), and hemophilia (yes = 1, no =

0). As the outcome considered here is the survival time, we also include CD4 count

(cells/mm3) at 20 ± 5 weeks, CD8 count (cells/mm3) at 20 ± 5 weeks as covariates,

as well as above 12 covariates.

We applied the proposed causal survival forest to this dataset. We used the

default tuning parameters: the minimal number of observations in each terminal

node was set to dp1/2e, and the number of variables available for splitting at each tree

node was chosen from d2p/3e. We fit a large enough number of trees B = 100000.

The point estimation and 95% confidence intervals from causal survival forest are

presented in Figure 2. The confidence interval is wide may due to small sample size.

According to the estimated optimal treatment rule obtained from the heterogeneous

effect estimation, for the subgroup of patients with age less than 34 (median), 349

patients should be assigned to treatment ddI, while 205 patients should be assigned

20 40 60Age

Figure 2: The point estimation (solid line) and 95% confidence intervals (dashed line)

from the proposed causal survival forest

to treatment ZDV+ddI; For the subgroup of patients with age larger than 34, 272

should be assigned to treatment ZDV+ddI, while 257 patients should be assigned to

treatment ddI. In addition, we varied the patients’ age, and other covariates were

set to their median values. The estimated treatment effects are all negative when

age is less than or equal to 48, while the estimated effects are positive when age is

larger than 48. The results suggest that ZDV+ddI is more favorable for older HIV-

infected patients. A similar finding was also observed in Lu et al. (2013) and Fan

et al. (2017). We note, however, that the confidence intervals for the pointwise effect

as reported in Figure 2 are wide, and so we ought not over-interpret the shape of the

fitted heterogeneity θ(x).

Acknowledgement

We thank Julie Tibshirani for helpful conversations and suggestions.

Appendix

A Proof of Theorem 3.2

Proof. Given the set of forest weights αi(x) used to define the generalized random

forest estimation θ(x) with unknown true nuisance parameters, we have the following

linear approximation

θ∗(x) = θ(x) +n∑i=1

αi(x)ρ∗i (x),

where ρ∗j(x) denotes the influence function of the j-th observation with respect to the

true parameter value θ(x), and θ∗(x) is a pseudo-forest output with weights αi(x)

and outcomes θ(x) + ρ∗i (x).

Note that Assumptions 2-6 in Athey et al. (2019) hold immediately from the

definition of the estimating equation ψθ(x). In particular, ψθ(x) is Lipschitz continuous

in terms of θ(x) for their Assumption 4; The solution of∑n

i=1 αiψiθ(x) = 0 always

exists for their Assumption 5. By the results shown in Wager and Athey (2018),

there exists a sequence σn(x) for which

{θ∗(x)− θ(x)}/σn(x)→ N(0, 1),

where σ2n(x) = polylog(n/s)−1s/n and polylog(n/s) is a function that is bounded

away from 0 and increases at most polynomially with the log-inverse sampling ratio

log(n/s).

Furthermore, by Lemma 4 in Athey et al. (2019),

(n/s)1/2{θ(x)− θ∗(x)} = Op(max{s−π log((1−w)−1)

2 log(w−1) , (s

n)1/6}).

Following Lemma 3.1, as long asOp{kBmax(bn, cn, dn)} goes faster than polylog(n/s)−1/2(s/n)1/2,

we have

{θ(x)− θ(x)}/σn(x)→ N(0, 1).

References

Arlot, S. and Genuer, R. (2014), “Analysis of purely random forests bias,” arXiv

preprint arXiv:1407.3939.

Athey, S., Tibshirani, J., and Wager, S. (2019), “Generalized Random Forests,” The

Annals of Statistics, 47(2).

Athey, S. and Wager, S. (2019), “Estimating Treatment Effects with Causal Forests:

An Application,” Observational Studies, 5, 36–51.

Biau, G. (2012), “Analysis of a random forests model,” Journal of Machine Learning

Research, 13, 1063–1095.

Biau, G., Devroye, L., and Lugosi, G. (2008), “Consistency of random forests and

other averaging classifiers,” Journal of Machine Learning Research, 9, 2015–2033.

Breiman, L. (2001), “Random forests,” Machine learning, 45, 5–32.

Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. A. (1984), Classification and

regression trees, CRC press.

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey,

W., and Robins, J. (2018), “Double/debiased machine learning for treatment and

structural parameters,” The Econometrics Journal, 21, 1–68.

Ciampi, A., Thiffault, J., Nakache, J.-P., and Asselain, B. (1986), “Stratification by

stepwise regression, correspondence analysis and recursive partition: a comparison

of three methods of analysis for survival data with covariates,” Computational

Statistics & Data Analysis, 4, 185 – 204.

Cui, Y., Zhu, R., and Kosorok, M. (2017a), “Tree based weighted learning for esti-

mating individualized treatment rules with censored data,” Electronic Journal of

Statistics, 11, 3927–3953.

Cui, Y., Zhu, R., Zhou, M., and Kosorok, M. R. (2017b), “Consistency of survival

tree and forest models: splitting bias and correction,” arXiv:1707.09631.

Fan, C., Lu, W., Song, R., and Zhou, Y. (2017), “Concordance-assisted learning

for estimating optimal individualized treatment regimes,” Journal of the Royal

Statistical Society: Series B (Statistical Methodology), 79, 1565–1582.

Fleming, T. R. and Harrington, D. P. (2011), Counting processes and survival analy-

sis, vol. 169, John Wiley & Sons.

Foster, J. C., Taylor, J. M., and Ruberg, S. J. (2011), “Subgroup identification from

randomized clinical trial data,” Statistics in medicine, 30, 2867–2880.

Friedberg, R., Tibshirani, J., Athey, S., and Wager, S. (2018), “Local linear forests,”

arXiv preprint arXiv:1807.11408.

Hammer, S. M., Katzenstein, D. A., Hughes, M. D., Gundacker, H., Schooley, R. T.,

Haubrich, R. H., Henry, W. K., Lederman, M. M., Phair, J. P., Niu, M., et al.

(1996), “A trial comparing nucleoside monotherapy with combination therapy in

HIV-infected adults with CD4 cell counts from 200 to 500 per cubic millimeter,”

New England Journal of Medicine, 335, 1081–1090.

Hothorn, T., Buhlmann, P., Dudoit, S., Molinaro, A., and Van Der Laan, M. J.

(2006a), “Survival ensembles,” Biostatistics, 7, 355–373.

Hothorn, T., Hornik, K., and Zeileis, A. (2006b), “Unbiased Recursive Partitioning: A

Conditional Inference Framework,” Journal of Computational and Graphical Statis-

tics, 15, 651–674.

Hothorn, T., Lausen, B., Benner, A., and Radespiel-Troger, M. (2004), “Bagging

survival trees,” Statistics in medicine, 23, 77–91.

Ishwaran, H. and Kogalur, U. (2019), Random Forests for Survival, Regression, and

Classification (RF-SRC), r package version 2.8.0.

Ishwaran, H., Kogalur, U. B., Blackstone, E. H., and Lauer, M. S. (2008), “Random

survival forests,” The annals of applied statistics, 841–860.

Kunzel, S. R., Sekhon, J. S., Bickel, P. J., and Yu, B. (2019), “Metalearners for

estimating heterogeneous treatment effects using machine learning,” Proceedings of

the National Academy of Sciences, 116, 4156–4165.

Laber, E. and Zhao, Y. (2015), “Tree-based methods for individualized treatment

regimes,” Biometrika, 102, 501–514.

Leblanc, M. and Crowley, J. (1993), “Survival Trees by Goodness of Split,” Journal

of the American Statistical Association, 88, 457–467.

Lin, Y. and Jeon, Y. (2006), “Random forests and adaptive nearest neighbors,” Jour-

nal of the American Statistical Association, 101, 578–590.

Lu, M., Sadiq, S., Feaster, D., and Ishwaran, H. (2018), “Estimating Individual

Treatment Effect in Observational Data Using Random Forest Methods,” Journal

of Computational and Graphical Statistics, 1–11.

Lu, W., Zhang, H. H., and Zeng, D. (2013), “Variable selection for optimal treatment

decision,” Statistical methods in medical research, 22, 493–504.

Meinshausen, N. (2006), “Quantile regression forests,” Journal of Machine Learning

Research, 7, 983–999.

Mentch, L. and Hooker, G. (2014), “Ensemble trees and clts: Statistical inference for

supervised learning,” stat, 1050, 25.

— (2016), “Quantifying Uncertainty in Random Forests via Confidence Intervals and

Hypothesis Tests,” Journal of Machine Learning Research, 17, 1–41.

Murphy, S. A. (2003), “Optimal dynamic treatment regimes,” Journal of the Royal

Statistical Society: Series B (Statistical Methodology), 65, 331–355.

Nie, X. and Wager, S. (2017), “Quasi-Oracle Estimation of Heterogeneous Treatment

Effects,” arXiv:1712.04912.

Oprescu, M., Syrgkanis, V., and Wu, Z. S. (2019), “Orthogonal Random Forest for

Causal Inference,” in International Conference on Machine Learning, pp. 4932–

Qian, M. and Murphy, S. A. (2011), “Performance guarantees for individualized treat-

ment rules,” Annals of statistics, 39, 1180.

Robins, J. M., Li, L., Mukherjee, R., Tchetgen, E. T., van der Vaart, A., et al. (2017),

“Minimax estimation of a functional on a structured high-dimensional model,” The

Annals of Statistics, 45, 1951–1987.

Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1994), “Estimation of regression

coefficients when some regressors are not always observed,” Journal of the American

statistical Association, 89, 846–866.

Robinson, P. M. (1988), “Root-N-Consistent Semiparametric Regression,” Economet-

rica, 56, 931–954.

Schick, A. (1986), “On asymptotically efficient estimation in semiparametric models,”

The Annals of Statistics, 14, 1139–1151.

Segal, M. R. (1988), “Regression Trees for Censored Data,” Biometrics, 44, 35–47.

Sexton, J. and Laake, P. (2009), “Standard errors for bagged and random forest

estimators,” Computational Statistics & Data Analysis, 53, 801–811.

Steingrimsson, J. A., Diao, L., Molinaro, A. M., and Strawderman, R. L. (2016),

“Doubly robust survival trees,” Statistics in medicine, 35, 3595–3612.

Steingrimsson, J. A., Diao, L., and Strawderman, R. L. (2019), “Censoring Unbiased

Regression Trees and Ensembles,” Journal of the American Statistical Association,

114, 370–383.

Sun, Q., Zhu, R., Wang, T., and Zeng, D. (2019), “Counting process-based dimension

reduction methods for censored outcomes,” Biometrika, 106, 181–196.

Tsiatis, A. (2007), Semiparametric Theory and Missing Data, Springer Series in

Statistics, Springer New York.

Tsiatis, A. A., Davidian, M., Zhang, M., and Lu, X. (2008), “Covariate adjustment

for two-sample treatment comparisons in randomized clinical trials: a principled

yet flexible approach,” Statistics in medicine, 27, 4658–4677.

Wager, S. and Athey, S. (2018), “Estimation and inference of heterogeneous treatment

effects using random forests,” Journal of the American Statistical Association, 113,

1228–1242.

Wager, S., Hastie, T., and Efron, B. (2014), “Confidence Intervals for Random Forests:

The Jackknife and the Infinitesimal Jackknife,” Journal of Machine Learning Re-

search, 15, 1625–1651.

Wager, S. and Walther, G. (2015), “Adaptive Concentration of Regression Trees, with

Application to Random Forests,” arXiv preprint arXiv:1503.06388.

Wright, M. N. and Ziegler, A. (2017), “ranger: A Fast Implementation of Random

Forests for High Dimensional Data in C++ and R,” Journal of Statistical Software,

77, 1–17.

Zhang, B., Tsiatis, A. A., Laber, E. B., and Davidian, M. (2012), “A robust method

for estimating optimal treatment regimes,” Biometrics, 68, 1010–1018.

Zhang, M., Tsiatis, A. A., and Davidian, M. (2008), “Improving efficiency of in-

ferences in randomized clinical trials using auxiliary covariates,” Biometrics, 64,

707–715.

Zhao, Y., Zeng, D., Rush, A. J., and Kosorok, M. R. (2012), “Estimating individu-

alized treatment rules using outcome weighted learning,” Journal of the American

Statistical Association, 107, 1106–1118.

Zhao, Y.-Q., Zeng, D., Laber, E. B., Song, R., Yuan, M., and Kosorok, M. R. (2015),

“Doubly robust learning for estimating individualized treatment with censored

data,” Biometrika, 102, 151–168.

Zhu, R. (2018), Reinforcement Learning Trees, r package version 3.2.2.

Zhu, R. and Kosorok, M. R. (2012), “Recursively imputed survival trees,” Journal of

the American Statistical Association, 107, 331–340.

Zhu, R., Zhao, Y.-Q., Chen, G., Ma, S., and Zhao, H. (2017), “Greedy outcome

weighted tree learning of optimal personalized treatment rules,” Biometrics, 73,

391–400.

Estimating heterogeneous treatment e ects with right ...Estimating heterogeneous treatment e ects...

Documents