Review Article Entering the Era of Data Science:...

Review ArticleEntering the Era of Data Science Targeted Learning and theIntegration of Statistics and Computational Data Analysis

Mark J van der Laan1 and Richard J C M Starmans2

1 University of California Berkeley 108 Haviland Hall Berkeley CA 94720-7360 USA2Department of Computer Science Utrecht University The Netherlands

Correspondence should be addressed to Mark J van der Laan laanberkeleyedu

Received 16 February 2014 Revised 9 July 2014 Accepted 10 July 2014 Published 10 September 2014

Academic Editor Chin-Shang Li

Copyright copy 2014 M J van der Laan and R J C M Starmans This is an open access article distributed under the CreativeCommons Attribution License which permits unrestricted use distribution and reproduction in any medium provided theoriginal work is properly cited

This outlook paper reviews the research of van der Laanrsquos group on Targeted Learning a subfield of statistics that is concernedwith the construction of data adaptive estimators of user-supplied target parameters of the probability distribution of the dataand corresponding confidence intervals aiming at only relying on realistic statistical assumptions Targeted Learning fully utilizesthe state of the art in machine learning tools while still preserving the important identity of statistics as a field that is concernedwith both accurate estimation of the true target parameter value and assessment of uncertainty in order to make sound statisticalconclusions We also provide a philosophical historical perspective on Targeted Learning also relating it to the new developmentsin Big Data We conclude with some remarks explaining the immediate relevance of Targeted Learning to the current Big Datamovement

1 Introduction

In Section 2 we start out with reviewing some basic statisticalconcepts such as data probability distribution statisticalmodel and target parameter allowing us to define the fieldTargeted Learning a subfield of statistics that develops dataadaptive estimators of user supplied target parameters ofdata distributions based on high dimensional data underrealistic assumptions (eg incorporating the state of the art inmachine learning) while preserving statistical inferenceThisalso allows us to clarify how Targeted Learning distinguishesfrom typical current practice in data analysis that relies onunrealistic assumptions and describe the key ingredients oftargeted minimum loss based estimation (TMLE) a generaltool to achieve the goals set out by Targeted Learninga substitution estimator construction of initial estimatorthrough super-learning targeting of the initial estimator toachieve asymptotic linearity with known influence curveby solving the efficient influence curve estimating equa-tion and statistical inference in terms of a normal limitingdistribution

Targeted Learning resurrects the pillars of statistics suchas the facts that a model represents actual knowledge aboutthe data generating experiment and that a target parameterrepresents the feature of the data generating distributionwe want to learn from the data In this manner TargetedLearning defines a truth and sets a scientific standard for esti-mation procedures while current practice typically defines aparameter as a coefficient in a misspecified parametric model(eg logistic linear regression repeatedmeasures generalizedlinear regression) or small unrealistic semi parametric regres-sion models (eg Cox proportional hazards regression)where different choices of such misspecified models yielddifferent answers This lack of truth in current practicesupported by statements such as ldquoAll models are wrong butsome are usefulrdquo allows a user to make arbitrary choiceseven though these choices result in different answers to thesame estimation problem In fact this lack of truth in currentpractice presents a fundamental drive behind the epidemicof false positives and lack of power to detect true positivesour field is suffering from In addition this lack of truthmakes many of us question the scientific integrity of the field

Hindawi Publishing CorporationAdvances in StatisticsVolume 2014 Article ID 502678 19 pageshttpdxdoiorg1011552014502678

2 Advances in Statistics

we call statistics and makes it impossible to teach statisticsas a scientific discipline even though the foundations ofstatistics including a very rich theory are purely scientificThat is our field has suffered from a disconnect betweenthe theory of statistics and the practice of statistics whilepractice should be driven by relevant theory and theoreticaldevelopments should be driven by practice For example atheorem establishing consistency and asymptotic normalityof a maximum likelihood estimator for a parametric modelthat is known to be misspecified is not a relevant theoremfor practice since the true data generating distribution is notcaptured by this theorem

Defining the statistical model to actually contain the trueprobability distribution has enormous implications for thedevelopment of valid estimators For example maximumlikelihood estimators are now ill defined due to the curse ofdimensionality of the model In addition even regularizedmaximum likelihood estimators are seriously flawed a gen-eral problem with maximum likelihood based estimators isthat the maximum likelihood criterion only cares about howwell the density estimator fits the true density resulting ina wrong trade-off for the actual target parmaeter of interestFrom a practical perspective whenwe useAIC BIC or cross-validated log-likelihood to select variables in our regressionmodel then that procedure is ignorant of the specific featureof the data distribution we want to estimate That is in largestatistical models it is immediately apparent that estimatorsneed to be targeted towards their goal just like a human beinglearns the answer to a specific question in a targeted mannerand maximum likelihood based estimators fail to do that

In Section 3we review the roadmap for Targeted Learningof a causal quantity involving defining a causal model andcausal quantity of interest establishing an estimand of thedata distribution that equals the desired causal quantity underadditional causal assumptions applying the pure statisticalTargeted Learning of the relevant estimandbased on a statisti-cal model compatible with the causal model but for sure con-taining the true data distribution and careful interpretationof the results In Section 4 we proceed with describing ourproposed targeted minimum loss-based estimation (TMLE)template which represents a concrete template for construc-tion of targeted efficient substitution estimators which arenot only asymptotically consistent asymptotically normallydistributed and asymptotically efficient but also tailoredto have robust finite sample performance Subsequently inSection 5 we review some of our most important advancesin Targeted Learning demonstrating the remarkable powerand flexibility of this TMLE methodology and in Section 6we describe future challenges and areas of research InSection 7 we provide a historical philosophical perspectiveof Targeted Learning Finally in Section 8 we conclude withsome remarks puttingTargeted Learning in the context of themodern era of Big Data

We refer to our papers and book on Targeted Learningfor overviews of relevant parts of the literature that put ourspecific contributions within the field of Targeted Learningin the context of the current literature thereby allowing usto focus on Targeted Learning itself in the current outlookpaper

2 Targeted Learning

Our research takes place in a subfield of statistics we namedTargeted Learning [1 2] In statistics the data (119874

1 119874

119899)

on 119899 units is viewed as a realization of a random variableor equivalently an outcome of a particular experiment andthereby has a probability distribution 119875119899

0 often called the

data distribution For example one might observe 119874119894=

(119882119894 119860119894 119884119894) on a subject 119894 where 119882

119894are baseline character-

istics of the subject 119860119894is a binary treatment or exposure the

subject received and119884119894is a binary outcome of interest such as

an indicator of death 119894 = 1 119899 Throughout this paper wewill use this data structure to demonstrate the concepts andestimation procedures

21 Statistical Model A statistical model M119899 is defined as aset of possible probability distributions for the data distribu-tion and thus represents the available statistical knowledgeabout the true data distribution 119875119899

0 In Targeted Learning

this core-definition of the statistical model is fully respectedin the sense that one should define the statistical model tocontain the true data distribution 119875119899

0isin M119899 So contrary to

the often conveniently used slogan ldquoAllmodels are wrong butsome are usefulrdquo and erosion over time of the original truemeaning of a statistical model throughout applied researchTargeted Learning defines the model for what it actually is[3] If there is truly no statistical knowledge available thenthe statistical model is defined as all data distributions Apossible statistical model is the model that assumes that(1198741 119874

119899) are 119899 independent and identically distributed

random variables with completely unknown probability dis-tribution 119875

0 representing the case that the sampling of the

data involved repeating the same experiment independentlyIn our example this would mean that we assume that(119882119894 119860119894 119884119894) are independent with a completely unspecified

common probability distribution For example if 119882 is 10-dimensional while 119860 and 119884 are two-dimensional then 119875

0

is described by a 12-dimensional density and this statisticalmodel does not put any restrictions on this 12-dimensionaldensity One could factorize this density of (119882119860 119884) asfollows

1199010(119882119860 119884) = 119901

1198820(119882) 119901

119860|1198820(119860 | 119882) 119901

119884|1198601198820(119884 | 119860119882)

(1)

where 1199011198820

is the density of the marginal distribution of119882119901119860|1198820

is the conditional density of 119860 given119882 and 119901119884|1198601198820

is the conditional density of 119884 given 119860 119882 In this modeleach of these factors is unrestricted On the other handsuppose now that the data is generated by a randomizedcontrolled trial in which we randomly assign treatment 119860 isin0 1 with probability 05 to a subject In that case theconditional density of119860 given119882 is known but themarginaldistribution of the covariates and the conditional distributionof the outcome given covariates and treatment might still beunrestricted Even in an observational study onemight knowthat treatment decisions were only based on a small subset ofthe available covariates119882 so that it is known that 119901

119860|1198820(1 |

119882) only depends on119882 through these few covariates In the

Advances in Statistics 3

case that death 119884 = 1 represents a rare event it might alsobe known that the probability of death 119875

119884|119860119882(1 | 119860119882) is

known to be between 0 and some small number (eg 003)This restriction should then be included in the modelM

In various applications careful understanding of theexperiment that generated the data might show that eventhese rather large statistical models assuming the data gen-erating experiment equals the independent repetition of acommon experiment are too small to be true see [4ndash8]for models in which (119874

1 119874

119899) is a joint random variable

described by a single experiment which nonetheless involvesa variety of conditional independence assumptions That isthe typical statement that 119874

1 119874

119899are independent and

identically distributed (iid)might already represent awrongstatistical model For example in a community randomizedtrial it is often the case that the treatments are assigned bythe following type of algorithm based on the characteristics(1198821 119882

119899) one first applies an algorithm that aims to split

the 119899 communities in 1198992 pairs that are similar with respect tobaseline characteristics subsequently one randomly assignstreatment and control to each pair Clearly even when thecommunities would have been randomly sampled from atarget population of communities the treatment assignmentmechanism creates dependence so that the data generatingexperiment cannot be described as an independent repetitionof experiments see [7] for a detailed presentation

In a study in which one observes a single community of 119899interconnected individuals one might have that the outcome119884119894for subject 119894 is not only affected by the subjectrsquos past

(119882119894 119860119894) but also affected by the covariate and treatment

of friends of subject 119894 Knowing the friends of each sub-ject 119894 would now impose strong conditional independenceassumptions on the density of the data (119874

1 119874

119899) but one

cannot assume that the data is a result of 119899 independentexperiments in fact as in the community randomized trialexample such data sets have sample size 1 since the data canonly be described as the result of a single experiment [8]

In group sequential randomized trials one often mayuse a randomization probability for a next recruited 119894thsubject that depends on the observed data of the previouslyrecruited and observed subjects 119874

1 119874

119894minus1 which makes

the treatment assignment 119860119894a function of 119874

1 119874

119894minus1

Even when the subjects are sampled randomly from a targetpopulation this type of dependence between treatment 119860

119894

and the past data1198741 119874

119894minus1implies that the data is the result

of a single large experiment (again the sample size equals 1)[4ndash6]

Indeed many realistic statistical models only involveindependence and conditional independence assumptionsand known bounds (eg it is known that the observedclinical outcome is bounded between [0 1] or the conditionalprobability of death is bounded between 0 and a smallnumber) Either way if the data distribution is described bya sequence of independent (and possibly identical) exper-iments or by a single experiment satisfying a variety ofconditional independence restrictions parametric modelsthough representing common practice are practically alwaysinvalid statisticalmodels since such knowledge about the datadistribution is essentially never available

An important by-product of requiring that the statisticalmodel needs to be truthful is that one is forced to obtain asmuch knowledge about the experiment before committing toa model which is precisely the role a good statistician shouldplay On the other hand if one commits to a parametricmodel then why would one still bother trying to find out thetruth about the data generating experiment

22 Target Parameter The target parameter is defined as amappingΨ M119899 rarr R119889 that maps the data distribution intothe desired finite dimensional feature of the data distributionone wants to learn from the data 120595119899

0= Ψ(119875

119899

0) This choice of

target parameter requires careful thought independent fromthe choice of statistical model and is not a choice made outof convenience The use of parametric or semiparametricmodels such as the Cox-proportional hazards model is oftenaccompanied with the implicit statement that the unknowncoefficients represent the parameter of interest Even inthe unrealistic scenario that these small statistical modelswould be true there is absolutely no reason why the veryparametrization of the data distribution should correspondwith the target parameter of interest Instead the statisticalmodel M119899 and the choice of target parameter Ψ M119899 rarr

R119889 are two completely separate choices and by no meansone should imply the other That is the statistical knowledgeabout the experiment that generated the data and definingwhat we hope to learn from the data are two importantkey steps in science that should not be convoluted The truetarget parameter value 120595119899

0is obtained by applying the target

parameter mapping Ψ to the true data distribution 1198751198990and

represents the estimand of interestFor example if 119874

119894= (119882

119894 119860119894 119884119894) are independent and

have common probability distribution 1198750 then one might

define the target parameter as an average of the conditional119882-specific treatment effects

1205950= Ψ (119875

0)

= 11986411987501198641198750(119884 | 119860 = 1119882) minus 119864

1198750(119884 | 119860 = 0119882)

(2)

By using that 119884 is binary this can also be written as follows

1205950= int119908

119875119884|1198601198820

(1 | 119860 = 1119882 = 119908)

minus119875119884|1198601198820

(1 | 119860 = 0119882 = 119908) 1198751198820(119889119908)

(3)

where 119875119884|1198601198820

(1 | 119860 = 119886119882 = 119908) denotes the true con-ditional probability of death given treatment 119860 = 119886 andcovariate119882 = 119908

For example suppose that the true conditional probabil-ity of death is given by some logistic function

119875119884|119860119882

(1 | 119860119882) =1

1 + exp (minus1198980(119860119882))

(4)

for some function1198980of treatments119860 and119882 The reader can

plug in a possible form for 1198980such as 119898

0(119886 119908) = 03119886 +


021199081+01119908

11199082+119886119908111990821199083 Given this function119898

0 the true

value 1205950is computed by the above formula as follows

1205950= int119908

(1

1 + exp (minus1198980(1 119908))

minus1

1 + exp (minus1198980(0 119908))

)

times 1198751198820(119889119908)

(5)

This parameter 1205950has a clear statistical interpretaion as

the average of all the 119908-specific additive treatment effects119875119884|1198601198820

(1 | 119860 = 1119882 = 119908) minus 119875119884|1198601198820

(1 | 119860 = 0119882 = 119908)

23 The Important Role of Models Also Involving NontestableAssumptions However this particular statistical estimand120595

0

has an even richer interpretation if one is willing to makeadditional so called causal (nontestable) assumptions Letus assume that 119882 119860 119884 are generated by a set of so calledstructural equations

119882 = 119891119882(119880119882)

119860 = 119891119860(119882119880

119860)

119884 = 119891119884(119882119860119880

119884)

(6)

where 119880 = (119880119882 119880119860 119880119884) are random inputs following a par-

ticular unknown probability distribution while the functions119891119882 119891119860 119891119884

deterministically map the realization of therandom input 119880 = 119906 sequentially into a realization of119882 =

119891119882(119906119882) 119860 = 119891

119860(119882 119906119860) 119884 = 119891

119884(119882119860 119906

119910) One might

not make any assumptions about the form of these functions119891119882 119891119860 119891119884 In that case these causal assumptions put no

restrictions on the probability distribution of (119882119860 119884) butthrough these assumptions we have parametrized 119875

0by a

choice of functions (119891119882 119891119860 119891119884) and a choice of distribution

of119880 Pearl [9] refers to such assumptions as a structural causalmodel for the distribution of (119882119860 119884)

This structural causal model allows one to define acorresponding postintervention probability distribution thatcorresponds with replacing 119860 = 119891

119860(119882119880

119860) by our desired

intervention on the intervention node119860 For example a staticintervention 119860 = 1 results in a new system of equations119882 = 119891

119882(119880119882) 119860 = 1 119884

1= 119891119884(119882 1 119880

119884) where this new

random variable 1198841is called a counterfactual outcome or

potential outcome corresponding with intervention 119860 = 1Similarly one can define 119884

0= 119891119884(119882 0 119880

119884) Thus 119884

0(1198841)

represent the outcome on the subject one would have seen ifthe subject would have been assigned treatment 119860 = 0 (119860 =1) One might now define the causal effect of interest as11986401198841minus 11986401198840 that is the difference between the expected

outcome of 1198841and the expected outcome of 119884

0 If one also

assumes that119860 is independent of119880119884 given119882 which is often

referred to as the assumption of no unmeasured confoundingor the randomization assumption then it follows that 120595

0=

11986401198841minus 11986401198840 That is under the structural causal model

including this no unmeasured confounding assumption 1205950

can not only be interpreted purely statistically as an averageof conditional treatment effects but it actually equals themarginal additive causal effect

In general causal models or more generally sets ofnontestable assumptions can be used to define underlyingtarget quantities of interest and corresponding statisticaltarget parameters that equal this target quantity under theseassumptions Well known classes of such models are modelsfor censored data in which the observed data is representedas a many to one mapping on the full data of interest andcensoring variable and the target quantity is a parameter ofthe full data distribution Similarly causal inference modelsrepresent the observed data as a mapping on counterfactualsand the observed treatment (either explicitly as in theNeyman-Rubin model or implicitly as in the Pearl structuralcausal models) and one defines the target quantity as aparameter of the distribution of the counterfactuals One isnow often concerned with providing sets of assumptions onthe underlying distribution (ie of the full-data) that allowidentifiability of the target quantity from the observed datadistribution (eg coarsening at random or randomizationassumption) These nontestable assumptions do not changethe statistical modelM and as a consequence once one hasdefined the relevant estimand120595

0 do not affect the estimation

problem either

24 Estimation Problem The estimation problem is definedby the statistical model (ie (119874

1 119874

119899) sim 119875119899

0isin M119899) and

choice of target parameter (ie Ψ M119899 rarr R) TargetedLearning is now the field concerned with the developmentof estimators of the target parameter that are asymptot-ically consistent as the number of units 119899 converges toinfinity and whose appropriately standardized version (egradic119899(120595119899minus 1205951198990)) converges in probability distribution to some

limit probability distribution (eg normal distribution) sothat one can construct confidence intervals that for largeenough sample size 119899contain with a user supplied highprobability the true value of the target parameter In the casethat 119874

1 119874

119899simiid1198750 a common method for establishing

asymptotic normality of an estimator is to demonstratethat the estimator minus truth can be approximated by anempirical mean of a function of 119874

119894 Such an estimator is

called asymptotically linear at 1198750 Formally an estimator

120595119899is asymptotically linear under iid sampling from 119875

0if

120595119899minus 1205950= (1119899)sum

119899

119894=1IC(1198750)(119874119894) + 119900119875(1radic119899) where 119874 rarr

IC(1198750)(119874) is the so called influence curve at 119875

0 In that

case the central limit theorem teaches us that radic119899(120595119899minus 1205950)

converges to a normal distribution 119873(0 1205902) with variance1205902 = 119864

1198750IC(1198750)(119874)2 defined as the variance of the influence

curve An asymptotic 095 confidence interval for 1205950is then

given by 120595119899plusmn 196120590

119899radic119899 where 1205902

119899is the sample variance of

an estimate IC119899(119874119894) of the true influence curve IC(119875

0)(119874119894)

119894 = 1 119899The empirical mean of the influence curve IC(119875

0) of an

estimator 120595119899represents the first order linear approximation

of the estimator as a functional of the empirical distributionand the derivation of the influence curve is a by-productof the application of the so called functional delta-methodfor statistical inference based on functionals of the empiricaldistribution [10ndash12] That is the influence curve IC(119875

0)(119874)

of an estimator viewed as a mapping from the empirical


distribution 119875119899into the estimated value Ψ(119875

119899) is defined

as the directional derivative at 1198750in the direction (119875

119899=1minus

1198750) where 119875

119899=1is the empirical distribution at a single

observation 119874

25 Targeted Learning Respects Both Local and Global Con-straints of the Statistical Model Targeted Learning is not justsatisfied with asymptotic performance such as asymptoticefficiency Asymptotic efficiency requires fully respectingthe local statistical constraints for shrinking neighborhoodsaround the true data distribution implied by the statisticalmodel defined by the so called tangent space generated byall scores of parametric submodels through 119875119899

0[13] but it

does not require respecting the global constraints on the datadistribution implied by the statistical model (eg see [14])Instead Targeted Learning pursues the development of suchasymptotically efficient estimators that also have excellentand robust practical performance by also fully respectingthe global constraints of the statistical model In additionTargeted Learning is also concerned with the developmentof confidence intervals with good practical coverage For thatpurpose our proposed methodology for Targeted Learningso called targeted minimum loss based estimation discussedbelow does not only result in asymptotically efficient esti-mators but the estimators (1) utilize unified cross-validationto make practically sound choices for estimator constructionthat actually work well with the very data set at hand [15ndash19] (2) focus on the construction of substitution estimatorsthat by definition also fully respect the global constraintsof the statistical model and (3) use influence curve theoryto construct targeted computer friendly estimators of theasymptotic distribution such as the normal limit distributionbased on an estimator of the asymptotic variance of theestimator

Let us succinctly review the immediate relevance toTargeted Learning of the above mentioned basic conceptsinfluence curve efficient influence curve substitution esti-mator cross-validation and super-learning For the sake ofdiscussion let us consider the case that the 119899 observations areindependent and identically distributed 119874

119894simiid1198750 isinM and

Ψ M rarr R119889 can now be defined as a parameter on thecommon distribution of 119874

119894 but each of the concepts has a

generalization to dependent data as well (eg see [8])

26 Targeted Learning Is Based on a Substitution EstimatorSubstitution estimators are estimators that can be describedas the target parametermapping applied to an estimator of thedata distribution that is an element of the statistical modelMore generally if the target parameter is represented as amapping on a part 119876

0= 119876(119875

0) of the data distribution 119875

0

(eg factor of likelihood) then a substitution estimator canbe represented asΨ(119876

119899) where119876

119899is an estimator of119876

0that is

contained in the parameter space 119876(119875) 119875 isinM implied bythe statistical modelM Substitution estimators are known tobe particularly robust by fully respecting that the true targetparameter is obtained by evaluating the target parametermapping on this statistical model For example substitutionestimators are guaranteed to respect known bounds on the

target parameter (eg it is a probability or difference betweentwo probabilities) as well as known bounds on the datadistribution implied by the modelM

In our running example we can define 1198760= (1198761198820 1198760)

where1198761198820

is the probability distribution of119882 under 1198750 and

1198760(119860119882) = 119864

1198750(119884 | 119860119882) is the conditional mean of the

outcome given the treatment and covariates and representthe target parameter

1205950= Ψ (119876

0) = 1198641198761198820

1198760(1119882) minus 119876

0(0119882) (7)

as a function of the conditional mean 1198760and the probability

distribution 1198761198820

of 119882 The model M might restrict 1198760to

be between 0 and a small number delta lt 1 but otherwiseputs no restrictions on 119876

0 A substitution estimator is now

obtained by plugging in the empirical distribution 119876119882119899

for1198761198820

and a data adaptive estimator 0 lt 119876119899lt 120575 of the

regression 1198760

120595119899= Ψ (119876

119882119899 119876119899) =

1

119899

119899

sum119894=1

119876119899(1119882119894) minus 119876119899(0119882119894) (8)

Not every type of estimator is a substitution estimator Forexample an inverse probability of treatment type estimator of1205950could be defined as

120595119899119868119875119879119882

=1

119899

119899

sum119894=1

2119860119894minus 1

119866119899(119860119894| 119882119894)119884119894 (9)

where119866119899(sdot | 119882) is an estimator of the conditional probability

of treatment 1198660(sdot | 119882) This is clearly not a substitution esti-

mator In particular if 119866119899(119860119894| 119882119894) is very small for some

observations this estimator might not be between minus1 and 1and thus completely ignores known constraints

27 Targeted Estimator Relies on Data Adaptive Estimator ofNuisance Parameter The construction of targeted estimatorsof the target parameter requires construction of an estimatorof infinite dimensional nuisance parameters specifically theinitial estimator of the relevant part 119876

0of the data dis-

tribution in the TMLE and the estimator of the nuisanceparameter 119866

0= 119866(119875

0) that is needed to target the fit of this

relevant part in the TMLE In our running example we have1198760= (1198761198820 1198760) and the 119866

0is the conditional distribution of

119860 given119882

28 Targeted Learning Uses Super-Learning to Estimate theNuisance Parameter In order to optimize these estimators ofthe nuisance parameters (119876

0 1198660) we use a so called super-

learner that is guaranteed to asymptotically outperform anyavailable procedure by simply including it in the library ofestimators that is used to define the super-learner

The super-learner is defined by a library of estimators ofthe nuisance parameter and uses cross-validation to selectthe best weighted combination of these estimators Theasymptotic optimality of the super-learner is implied bythe oracle inequality for the cross-validation selector that


compares the performance of the estimator that minimizesthe cross-validated risk over all possible candidate estimatorswith the oracle selector that simply selects the best possiblechoice (as if one has available an infinite validation sample)The only assumption this asymptotic optimality relies uponis that the loss function used in cross-validation is uniformlybounded and that the number of algorithms in the librarydoes not increase at a faster rate than a polynomial powerin sample size when sample size converges to infinity [15ndash19] However cross-validation is a method that goes beyondoptimal asymptotic performance since the cross-validatedrisk measures the performance of the estimator on the verysample it is based uponmaking it a practically very appealingmethod for estimator selection

In our running example we have that 1198760= arg min

1198761198641198750

119871(119876)(119874) where 119871(119876) = (119884 minus 119876(119860119882))2 is the squared errorloss or one can also use the log-likelihood loss 119871(119876)(119874) =minus119884 log119876(119860119882)+(1minus119884) log(1minus119876(119860119882)) Usually there area variety of possible loss functions one could use to define thesuper-learner the choice could be based on the dissimilarityimplied by the loss function [15] but probably should itselfbe data adaptively selected in a targeted manner The cross-validated risk of a candidate estimator of119876

0is then defined as

the empirical mean over a validation sample of the loss of thecandidate estimator fitted on the training sample averagedacross different spits of the sample in a validation and trainingsample A typical way to obtain such sample splits is socalled119881-fold cross-validation inwhich one first partitions thesample in 119881 subsets of equal size and each of the 119881 subsetsplays the role of a validation sample while its complementof 119881 minus 1 subsets equals the corresponding training sampleThus 119881-fold cross-validation results in 119881 sample splits intoa validation sample and corresponding training sampleA possible candidate estimator is a maximum likelihoodestimator based on a logistic linear regression workingmodelfor 119875(119884 = 1 | 119860119882) Different choices of such logistic linearregression working models result in different possible candi-date estimators So in this manner one can already generatea rich library of candidate estimators However the statisticsand machine learning literature has also generated lots ofdata adaptive estimators based on smoothing data adaptiveselection of basis functions and so on resulting in anotherlarge collection of possible candidate estimators that can beadded to the library Given a library of candidate estimatorsthe super-learner selects the estimator that minimizes thecross-validated risk over all the candidate estimators Thisselected estimator is now applied to the whole sample to giveour final estimate 119876

119899of 1198760 One can enrich the collection of

candidate estimators by taking any weighted combination ofan initial library of candidate estimators thereby generatinga whole parametric family of candidate estimators

Similarly one can define a super-learner of the condi-tional distribution of 119860 given119882

The super-learnerrsquos performance improves by enlargingthe library Even though for a given data set one of the can-didate estimators will do as well as the super-learner acrossa variety of data sets the super-learner beats an estimatorthat is betting on particular subsets of the parameter space

containing the truth or allowing good approximations ofthe truth The use of super-learner provides on importantstep in creating a robust estimator whose performance isnot relying on being lucky but on generating a rich libraryso that a weighted combination of the estimators provides agood approximation of the truth wherever the truth mightbe located in the parameter space

29 Asymptotic Efficiency An asymptotically efficient esti-mator of the target parameter is an estimator that can berepresented as the target parameter value plus an empiricalmean of a so called (mean zero) efficient influence curve119863lowast(1198750)(119874) up till a second order term that is asymptotically

negligible [13] That is an estimator is efficient if and only ifit is asymptotically linear with influence curve119863(119875

0) equal to

the efficient influence curve119863lowast(1198750)

120595119899minus 1205950=1

119899

119899

sum119894=1

119863lowast(1198750) (119874119894) + 119900119875(1

radic119899) (10)

The efficient influence curve is also called the canonicalgradient and is indeed defined as the canonical gradient ofthe pathwise derivative of the target parameter Ψ M rarr

R Specifically one defines a rich family of one-dimensionalsubmodels 119875(120598) 120598 through 119875 at 120598 = 0 and onerepresents the pathwise derivative (119889119889120598)Ψ(119875(120598))|

120598=0as an

inner product (the covariance operator in theHilbert space offunctions of 119874 with mean zero and inner product ⟨ℎ

1 ℎ2⟩ =

119864119875ℎ1(119874)ℎ2(119874)119864119875119863(119875)(119874)119878(119875)(119874) where 119878(119875) is the score

of the path 119875(120598) 120598 and 119863(119875) is a so called gradientThe unique gradient that is also in the closure of the linearspan of all scores generated by the family of one-dimensionalsubmodels through 119875 also called the tangent space at 119875 isnow the canonical gradient119863lowast(119875) at 119875 Indeed the canonicalgradient can be computed as the projection of any givengradient 119863(119875) onto the tangent space in the Hilbert space11987120(119875) An interesting result in efficiency theory is that an

influence curve of a regular asymptotically linear estimatoris a gradient

In our running example it can be shown that the efficientinfluence curve of the additive treatment effect Ψ M rarr R

is given by

119863lowast (1198750) (119874) =

2119860 minus 1

1198660(119860119882)

(119884 minus 1198760(119860119882))

+1198760(1119882) minus 119876

0(0119882) minus Ψ (119876

0)

(11)

As noted earlier the influence curve IC(1198750) of an estima-

tor 120595119899also characterizes the limit variance 1205902

0= 1198750IC(1198750)2 of

the mean zero normal limit distribution ofradic119899(120595119899minus1205950) This

variance 12059020can be estimated with 1119899sum119899

119894=1IC119899(119874119894)2 where

IC119899is an estimator of the influence curve IC(119875

0) Efficiency

theory teaches us that for any regular asymptotically linearestimator 120595

119899its influence curve has a variance that is larger

than or equal to the variance of the efficient influence curve1205902lowast0

= 1198750119863lowast(1198750)2 which is also called the generalized

Cramer-Rao lower bound In our running example the


asymptotic variance of an efficient estimator is thus estimatedwith the sample variance of an estimate119863lowast

119899(119874119894) of119863lowast(119875

0)(119874119894)

obtained by plugging in the estimator 119866119899of 1198660and the

estimator 119876119899of 1198760 and Ψ(119876

0) is replaced by Ψ(119876

119899) =

(1119899)sum119899

119894=1(119876119899(1119882119894) minus 119876119899(0119882119894))

210 Targeted Estimator Solves the Efficient Influence CurveEquation The efficient influence curve is a function of119874 thatdepends on 119875

0through 119876

0and possible nuisance parameter

1198660 and it can be calculated as the canonical gradient of the

pathwise derivative of the target parameter mapping alongpaths through 119875

0 It is also called the efficient score Thus

given the statistical model and target parameter mappingone can calculate the efficient influence curve whose variancedefines the best possible asymptotic variance of an estimatoralso referred to as the generalized Cramer-Rao lower boundfor the asymptotic variance of a regular estimator The prin-cipal building block for achieving asymptotic efficiency of asubstitution estimator Ψ(119876

119899) beyond 119876

119899being an excellent

estimator of 1198760as achieved with super-learning is that the

estimator 119876119899solves the so called efficient influence curve

equation sum119899119894=1119863lowast(119876

119899 119866119899)(119874119894) = 0 for a good estimator

119866119899of 1198660 This property cannot be expected to hold for

a super-learner and that is why the TMLE discussed inSection 4 involves an additional update of the super-learnerthat guarantees that it solves this efficient influence curveequation

For example maximum likelihood estimators solve allscore equations including this efficient score equation thattargets the target parameter butmaximum likelihood estima-tors for large semi parametricmodelsM typically do not existfor finite sample sizes Fortunately for efficient estimationof the target parameter one should only be concerned withsolving this particular efficient score tailored for the targetparameter Using the notation 119875119891 equiv int119891(119900)119889119875(119900) forthe expectation operator one way to understand why theefficient influence curve equation indeed targets the truetarget parameter value is that there are many cases in which1198750119863lowast(119875) = Ψ(119875

0) minus Ψ(119875) and in general as a consequence

of119863lowast(119875) being a canonical gradient1198750119863lowast(119875) = Ψ (119875

0) minus Ψ (119875) + 119877 (119875 119875

0) (12)

where 119877(119875 1198750) = 119900(119875 minus 119875

0) is a term involving second

order differences (119875 minus 1198750)2 This key property explains why

solving 1198750119863lowast(119875) = 0 targets Ψ(119875) to be close to Ψ(119875

0) and

thus explains why solving 119875119899119863lowast(119876

119899 119866119899) = 0 targets 119876

119899to fit

Ψ(1198760)

In our running example we have 119877(119875 1198750) = 119877

1(119875 1198750) minus

1198770(119875 1198750) where 119877

119886(119875 1198750) = int

119908((119866 minus 119866

0)(119886 | 119882)119866(119886 |

119882))(119876 minus1198760)(119886119882)119889119875

1198820(119908) So in our example the remain-

der 119877(119875 1198750) only involves a cross-product difference (119866 minus

1198660)(119876 minus 119876

0) In particular the remainder equals zero if

either 119866 = 1198660or 119876 = 119876

0 which is often referred to as

double robustness of the efficient influence curvewith respectto (119876 119866) in the causal and censored data literature (seeeg [20]) This property translates into double robustness ofestimators that solve the efficient influence curve estimatingequation

Due to this identity (12) an estimator that solves119875119899119863lowast() = 0 and is in a local neighborhood of 119875

0so that

119877( 1198750) = 119900

119875(1radic119899) approximately solves Ψ() minus Ψ(119875

0) asymp

(119875119899minus 1198750)119863lowast() where the latter behaves as a mean zero

centered empirical mean with minimal variance that will beapproximately normally distributed This is formalized in anactual proof of asymptotic efficiency in the next subsection

211 Targeted Estimator Is Asymptotically Linear and EfficientIn fact combining 119875

119899119863lowast(119876

119899 119866119899) = 0 with (12) at 119875 =

(119876119899 119866119899) yields

Ψ (119876119899) minus Ψ (119876

0) = (119875

119899minus 1198750)119863lowast(119876119899 119866119899) + 119877119899 (13)

where 119877119899is a second order term Thus if second order

differences such as (119876119899minus 1198760)2 (119876119899minus 1198760)(119866119899minus 1198660) and

(119866119899minus 1198660)2 converge to zero at a rate faster than 1radic119899 then

it follows that 119877119899= 119900119875(1radic119899) To make this assumption as

reasonable as possible one should use super-learning for both119876119899and 119866

119899 In addition empirical process theory teaches us

that (119875119899minus1198750)119863lowast(119876

119899 119892119899) = (119875

119899minus1198750)119863lowast(119876

0 1198920) + 119900119875(1radic119899) if

1198750119863lowast(119876

119899 119866119899)minus119863lowast(119876

0 1198660)2 converges to zero in probability

as 119899 converges to infinity (a consistency condition) and if119863lowast(119876119899 119866119899) falls in a so called Donsker class of functions

119874 rarr 119891(119874) [11] An important Donsker class is the classof all 119889-variate real valued functions that have a uniformsectional variation norm that is bounded by some universal119872 lt infin that is the variation norm of the function itselfand the variation norm of its sections are all bounded by this119872 lt infin This Donsker class condition essentially excludesestimators 119876

119899 119866119899that heavily overfit the data so that their

variation norms converge to infinity as 119899 converges to infinitySo under this Donsker class condition 119877

119899= 119900119875(1radic119899) and

the consistency condition we have

120595119899minus 1205950=1

119899

119899

sum119894=1

119863lowast(1198760 1198660) (119874119894) + 119900119875(1

radic119899) (14)

That is 120595119899is asymptotically efficient In addition the right-

hand side converges to a normal distribution with mean zeroand variance equal to the variance of the efficient influencecurve So in spite of the fact that the efficient influence curveequation only represents a finite dimensional equation foran infinite dimensional object 119876

119899 it implies consistency of

Ψ(119876119899) up till a second order term 119877

119899and even asymptotic

efficiency if 119877119899= 119900119875(1radic119899) under some weak regularity

conditions

3 Road Map for Targeted Learning of CausalQuantity or Other Underlying Full-DataTarget Parameters

This is a good moment to review the roadmap for TargetedLearningWe have formulated a roadmap for Targeted Learn-ing of a causal quantity that provides a transparent roadmap[2 9 21] involving the following steps

(i) defining a full-data model such as a causal modeland a parameterization of the observed data distri-bution in terms of the full-data distribution (eg the


Neyman-Rubin-Robins counterfactual model [22ndash28]) or the structural causal model [9]

(ii) defining the target quantity of interest as a targetparameter of the full-data distribution

(iii) establishing identifiability of the target quantity fromthe observed data distribution under possible addi-tional assumptions that are not necessarily believedto be reasonable

(iv) committing to the resulting estimand and the statisti-cal model that is believed to contain the true 119875

0

(v) a subroadmap for the TMLE discussed below toconstruct an asymptotically efficient substitution esti-mator of the statistical target parameter

(vi) establishing an asymptotic distribution and corre-sponding estimator of this limit distribution to con-struct a confidence interval

(vii) honest interpretation of the results possibly includinga sensitivity analysis [29ndash32]

That is the statistical target parameters of interestare often constructed through the following process Oneassumes an underlying model of probability distributionswhich we will call the full-data model and one definesthe data distribution in terms of this full-data distributionThis can be thought of as modeling that is one obtains aparameterization M = 119875

120579 120579 isin Θ for the statistical

model M for some underlying parameter space Θ andparameterization 120579 rarr 119875

120579 The target quantity of interest

is defined as some parameter of the full-data distributionthat is of 120579

0 Under certain assumptions one establishes that

the target quantity can be represented as a parameter of thedata distribution a so called estimand such a result is calledan identifiability result for the target quantity One mightnow decide to use this estimand as the target parameterand develop a TMLE for this target parameter Under thenontestable assumptions the identifiability result relied uponthe estimand can be interpreted as the target quantity ofinterest but importantly it can always be interpreted as astatistical feature of the data distribution (due to the statisticalmodel being true) possibly of independent interest In thismanner one can define estimands that are equal to a causalquantity of interest defined in an underlying (counterfactual)world The TMLE of this estimand which is only defined bythe statistical model and the target parameter mapping andthus ignorant of the nontestable assumptions that allowedthe causal interpretation of the estimand provides now anestimator of this causal quantity In this manner TargetedLearning is in complete harmony with the developmentof models such as causal and censored data models andidentification results for underlying quantities the latterjust provides us with a definition of a target parametermapping and statistical model and thereby the pure statisticalestimation problem that needs to be addressed

4 Targeted Minimum Loss Based Estimation(TMLE)

TheTMLE [1 2 4] is defined according to the following stepsFirstly one writes the target parametermapping as amappingapplied to a part of the data distribution 119875

0 say 119876

0= 119876(119875

0)

that can be represented as the minimizer of a criterion at thetrue data distribution 119875

0over all candidate values 119876(119875)

119875 isinM for this part of the data distribution we refer to thiscriterion as the risk 119877

1198750(119876) of the candidate value 119876

Typically the risk at a candidate parameter value 119876 canbe defined as the expectation under the data distribution ofa loss function (119874 119876) 997891rarr 119871(119876)(119874) that maps the unit datastructure and the candidate parameter value in a real valuenumber 119877

1198750(119876) = 119864

1198750119871(119876)(119874) Examples of loss functions

are the squared error loss for a conditional mean and the log-likelihood loss for a (conditional) densityThis representationof1198760as aminimizer of a risk allows us to estimate it with (eg

loss-based) super-learningSecondly one computes the efficient influence curve

(119874 119875) 997891rarr 119863lowast(119876(119875) 119866(119875))(119874) identified by the canonical

gradient of the pathwise derivative of the target parametermapping along paths through a data distribution 119875 wherethis efficient influence curve does only depend on 119875 through119876(119875) and some nuisance parameter119866(119875) Given an estimator119866119899 one now defines a path 119876

119899119866119899(120598) 120598 with Euclidean

parameter 120598 through the super-learner 119876119899whose score

119889

119889120598119871 (119876119899119866119899

(120598))10038161003816100381610038161003816100381610038161003816120598=0

(15)

at 120598 = 0 spans the efficient influence curve 119863lowast(119876119899 119866119899) at

the initial estimator (119876119899 119866119899) this is called a least favorable

parametric submodel through the super-learnerIn our running example we have 119876 = (119876119876

119882) so

that it suffices to construct a path through 119876 and 119876119882

withcorresponding loss functions and show that their scores spanthe efficient influence curve (11) We can define the path119876119866(120598) = 119876+120598119862(119866) 120598 where119862(119866)(119874) = (2119860minus1)119866(119860 | 119882)

and loss function119871(119876)(119874) = minus119884 log119876(119860119882)+(1minus119884) log(1minus119876(119860119882)) Note that

119889

119889120598119871 (119876119866(120598)) (119874)

10038161003816100381610038161003816100381610038161003816120598=0

= 119863lowast119884(119876 119866) =

2119860 minus 1

119866 (119860119882)(119884 minus 119876 (119860119882))

(16)

We also define the path 119876119882(120598) = (1 + 120598119863lowast

119882(119876 119876119882))119876119882

with loss function 119871(119876119882)(119882) = minus log119876

119882(119882) where

119863lowast119882(119876)(119874) = 119876(1119882) minus 119876(0119882) minus Ψ(119876) Note that

119889

119889120598119871 (119876119882(120598))10038161003816100381610038161003816100381610038161003816120598=0

= 119863lowast

119882(119876) (17)

Thus if we define the sum loss function119871(119876) = 119871(119876)+119871(119876119882)

then119889

119889120598119871 (119876119866(120598) 119876

119882(120598))

10038161003816100381610038161003816100381610038161003816120598=0= 119863lowast(119876 119866) (18)


This proves that indeed these proposed paths through 119876and 119876

119882and corresponding loss functions span the efficient

influence curve 119863lowast(119876 119866) = 119863lowast119882(119876) + 119863lowast

119884(119876 119866) at (119876 119866) as

requiredThe dimension of 120598 can be selected to be equal to the

dimension of the target parameter 1205950 but by creating extra

components in 120598 one can arrange to solve additional scoreequations beyond the efficient score equation providingimportant additional flexibility and power to the procedureIn our running example we can use an 120598

1for the path

through 119876 and a separate 1205982for the path through 119876

119882 In

this case the TMLE update119876lowast119899will solve two score equations

119875119899119863lowast119882(119876lowast119899) = 0 and 119875

119899119863lowast119884(119876lowast

119899 119866119899) = 0 and thus in

particular 119875119899119863lowast(119876lowast

119899 119866119899) = 0 In this example the main

benefit of using a bivariate 120598 = (1205981 1205982) is that the TMLE does

not update 119876119882119899

(if selected to be the empirical distribution)and converges in a single step

One fits the unknown parameter 120598 of this path byminimizing the empirical risk 120598 rarr 119875

119899119871(119876119899119866119899(120598)) along this

path through the super-learner resulting in an estimator 120598119899

This defines now an update of the super-learner fit defined as1198761

119899= 119876119899119866119899(120598119899) This updating process is iterated till 120598

119899asymp 0

The final update we will denote with 119876lowast119899 the TMLE of 119876

0

and the target parameter mapping applied to 119876lowast119899defines the

TMLE of the target parameter 1205950 This TMLE 119876lowast

119899solves the

efficient influence curve equation sum119899119894=1119863lowast(119876lowast

119899 119866119899)(119874119894) = 0

providing the basis in combination with statistical propertiesof (119876lowast119899 119866119899) for establishing that the TMLE Ψ(119876lowast

119899) is asymp-

totically consistent normally distributed and asymptoticallyefficient as shown above

In our running example we have 1205981119899= arg min

120598119875119899119871

(1198760

119899119866119899(120598)) while 120598

2119899= arg min119875

119899119871(119876119882119899(1205982)) equals zero

That is the TMLE does not update 119876119882119899

since the empiricaldistribution is already a nonparametric maximum likelihoodestimator solving all score equations In this case 119876lowast

119899= 1198761

119899

since the convergence of the TMLE-algorithm occurs in onestep and of course 119876lowast

119882119899= 119876119882119899

is just the initial empiricaldistribution function of 119882

1 119882

119899 The TMLE of 120595

0is the

substitution estimator Ψ(119876lowast119899)

5 Advances in Targeted Learning

As apparent from the above presentation TMLE is a generalmethod that can be developed for all types of challengingestimation problems It is a matter of representing the targetparameters as a parameter of a smaller119876

0 defining a path and

loss function with generalized score that spans the efficientinfluence curve and the corresponding iterative targetedminimum loss-based estimation algorithm

We have used this framework to develop TMLE in alarge number of estimation problems that assumes that1198741 119874

119899simiid1198750 Specifically we developed TMLE of a

large variety of effects (eg causal) of single and multipletime point interventions on an outcome of interest thatmay be subject to right-censoring interval censoring case-control sampling and time-dependent confounding see forexample [4 33ndash63 63ndash72]

An original example of a particular type of TMLE(based on a double robust parametric regression model) forestimation of a causal effect of a point-treatment interventionwas presented in [73] andwe refer to [47] for a detailed reviewof this earlier literature and its relation to TMLE

It is beyond the scope of this overview paper to getinto a review of some of these examples For a generalcomprehensive book on Targeted Learning which includesmany of these applications on TMLE and more we refer to[2]

To provide the reader with a sense consider generalizingour running example to a general longitudinal data structure119874 = (119871(0) 119860(0) 119871(119870) 119860(119870) 119884) where 119871(0) are base-line covariates 119871(119896) are time dependent covariates realizedbetween intervention nodes 119860(119896 minus 1) and 119860(119896) and 119884 isthe final outcome of interest The intervention nodes couldinclude both censoring variables and treatment variables thedesired intervention for the censoring variables is always ldquonocensoringrdquo since the outcome 119884 is only of interest when itis not subject to censoring (in which case it might just be aforward imputed value eg)

One may now assume a structural causal model ofthe type discussed earlier and be interested in the meancounterfactual outcome under a particular intervention onall the intervention nodes where these interventions couldbe static dynamic or even stochastic Under the so calledsequential randomization assumption this target quantity isidentified by the so called G-computation formula for thepostintervention distribution corresponding with a stochas-tic intervention 119892lowast

1198750119892lowast (119874) =

119870+1

prod119896=0

1198750119871(119896)|119871(119896minus1)119860(119896minus1)

(119871 (119896) | 119871 (119896 minus 1) 119860 (119896 minus 1))

119870

prod119896=0

119892lowast

119896(119860 (119896) | 119860 (119896 minus 1) 119871 (119896))

(19)

Note that this postintervention distribution is nothing elsebut the actual distribution of 119874 factorized according to thetime-ordering but with the true conditional distributionsof 119860(119896) given parenethesis 119871(119896) 119860(119896 minus 1)) replaced bythe desired stochastic intervention The statistical targetparameter is thus 119864

1198750119892lowast119884119892lowast that is the mean outcome under

this postintervention distribution A big challenge in theliterature has been to develop robust efficient estimators ofthis estimand and more generally one likes to estimatethis mean outcome under a user supplied class of stochasticinterventions119892lowast Such robust efficient substitution estimatorshave now been developed using the TMLE framework [5860] where the latter is a TMLE inspired by important doublerobust estimators established in earlier work of [74] Thiswork thus includes causal effects defined byworkingmarginalstructural models for static and dynamic treatment regimenstime to event outcomes and incorporating right-censoring

In many data sets one is interested in assessing the effectof one variable on an outcome controlling for many othervariables across a large collection of variables For example


one might want to know the effect of a single nucleotidepolymorphism (SNP) on a trait of a subject across a wholegenome controlling each time for a large collection of otherSNPs in the neighborhood of the SNP in question Or oneis interested in assessing the effect of a mutation in the HIVvirus on viral load drop (measure of drug resistance) whentreated with a particular drug class controlling for the othermutations in the HIV virus and for characteristics of the sub-ject in question Therefore it is important to carefully definethe effect of interest for each variable If the variable is binaryone could use the target parameter Ψ(119875) = 119864

119875119864119875(119884 | 119860 = 1

119882) minus 119864119875(119884 | 119860 = 0119882) in our running example but with

119860 now being the SNP in question and119882 being the variablesone wants to control for while 119884 is the outcome of interestWe often refer to such a measure as a particular variableimportance measure Of course one now defines such avariable importance measure for each variable When thevariable is continuous the above measure is not appropriateIn that case one might define the variable importance asthe projection of 119864

119875(119884 | 119860119882) minus 119864

119875(119884 | 119860 = 0119882)

onto a linear model such as 120573119860 and use 120573 as the variableimportance measure of interest [75] but one could think ofa variety of other interesting effect measures Either way foreach variable one uses a TMLE of the corresponding variableimportancemeasureThe stackedTMLE across all variables isnow an asymptotically linear estimator of the stacked variableimportance measure with stacked influence curve and thusapproximately follows amultivariate normal distribution thatcan be estimated from the data One can now carry outmultiple testing procedures controlling a desired family wisetype I error rate and construct simultaneous confidenceintervals for the stacked variable importance measure basedon this multivariate normal limit distribution In this man-ner one uses Targeted Learning to target a large familyof target parameters while still providing honest statisticalinference taking into account multiple testingThis approachdeals with a challenge in machine learning in which onewants estimators of a prediction function that simultaneouslyyield good estimates of the variable importance measuresExamples of such efforts are random forest and LASSObut both regression methods fail to provide reliable variableimportancemeasures and fail to provide any type of statisticalinference The truth is that if the goal is not prediction but toobtain a good estimate of the variable importance measuresacross the variables then one should target the estimatorof the prediction function towards the particular variableimportance measure for each variable separately and onlythen one obtains valid estimators and statistical inference ForTMLE of effects of variables across a large set of variables a socalled variable importance analysis including the applicationto genomic data sets we refer to [48 75ndash79]

Software has been developed in the form of general R-packages implementing super-learning and TMLE for gen-eral longitudinal data structures these packages are publiclyavailable on CRAN under the function names tmle() ltmle()and superlearner()

Beyond the development of TMLE in this large varietyof complex statistical estimation problems as usual thecareful study of real world applications resulted in new

challenges for the TMLE and in response to that we havedeveloped general TMLE that have additional propertiesdealing with these challenges In particular we have shownthat TMLE has the flexibility and capability to enhance thefinite sample performance of TMLE under the followingspecific challenges that come with real data applications

Dealing with Rare Outcomes If the outcome is rare then thedata is still sparse even though the sample size might be quitelarge When the data is sparse with respect to the questionof interest the incorporation of global constraints of thestatistical model becomes extremely important and canmakea real difference in a data analysis Consider our runningexample and suppose that 119884 is the indicator of a rare eventIn such cases it is often known that the probability of 119884 = 1conditional on a treatment and covariate configurationshould not exceed a certain value 120575 gt 0 for example themarginal prevalence is known and it is known that thereare no subpopulations that increase the relative risk by morethan a certain factor relative marginal prevalence So thestatistical model should now include the global constraintthat 119876

0(119860119882) lt 120575 for some known 120575 gt 0 A TMLE

should now be based on an initial estimator 119876119899satisfying

this constraint and the least favorable submodel 119876119899119866119899(120598)

120598 should also satisfy this constraint for each 120598 so that itis a real submodel In [80] such a TMLE is constructedand it is demonstrated to very significantly enhance itspractical performance for finite sample sizes Even though aTMLE ignoring this constraint would still be asymptoticallyefficient by ignoring this important knowledge its practicalperformance for finite samples suffers

Targeted Estimation of Nuisance Parameter 1198660in TMLE Even

though an asymptotically consistent estimator119866119899of1198660yields

an asymptotically efficient TMLE the practical performanceof the TMLE might be enhanced by tuning this estimator119866119899not only with respect to to its performance in estimating

1198660 but also with respect to how well the resulting TMLE

fits 1205950 Consider our running example Suppose that among

the components of119882 there is a119882119895that is an almost perfect

predictor of 119860 but has no effect on the outcome 119884 Inclusionof such a covariate 119882

119895in the fit of 119866

119899makes sense if the

sample size is very large and one tries to remove someresidual confounding due to not adjusting for 119882

119895 but in

most finite samples adjustment for 119882119895in 119866119899will hurt the

practical performance of TMLE and effort should be putin variables that are stronger confounders than 119882

119895 We

developed a method for building an estimator119866119899that uses as

criterion the change in fit between initial estimator of1198760and

the updated estimator (ie the TMLE) and thereby selectsvariables that result in the maximal increase in fit duringthe TMLE updating step However eventually as sample sizeconverges to infinity all variables will be adjusted for sothat asymptotically the resulting TMLE is still efficient Thisversion of TMLE is called the collaborative TMLE since itfits 1198660in collaboration with the initial estimator 119876

119899[2 44ndash

46 58] Finite sample simulations and data analyses haveshown remarkable important finite sample gains of C-TMLErelative to TMLE (see above references)


Cross-Validated TMLE The asymptotic efficiency of TMLErelies on a so called Donsker class condition For examplein our running example it requires that 119876

119899and 119866

119899are not

too erratic functions of (119860119882) This condition is not justtheoretical but one can observe its effects in finite samples byevaluating the TMLE when using a heavily overfitted initialestimator This makes sense since if we use an overfittedinitial estimator there is little reason to think that the 120598

119899

that maximizes the fit of the update of the initial estimatoralong the least favorable parametric model will still do agood job Instead one should use the fit of 120598 that maximizeshonest estimate of the fit of the resulting update of the initialestimator as measured by the cross-validated empirical meanof the loss function This insight results in a so called cross-validated TMLE and we have proven that one can establishasymptotic linearity of this CV-TMLE without a Donskerclass condition [2 81] thus the CV-TMLE is asymptoticallylinear under weak conditions compared to the TMLE

Guaranteed Minimal Performance of TMLE If the initialestimator 119876

119899is inconsistent but 119866

119899is consistent then the

TMLE is still consistent for models and target parameters inwhich the efficient influence curve is double robust Howeverthere might be other estimators that will now asymptoticallybeat the TMLE since the TMLE is not efficient anymoreThe desire for estimators to have a guarantee to beat certainuser supplied estimators was formulated and implementedfor double robust estimating equation based estimators in[82] Such a property can also be arranged within theTMLE framework by incorporating additional fluctuationparameters in its least favorable submodel though the initialestimator so that the TMLE solves additional score equationsthat guarantee that it beats a user supplied estimator evenunder heavy misspecification of the initial estimator 119876

119899[58

68]

Targeted Selection of Initial Estimator in TMLE In situationswhere it is unreasonable to expect that the initial estimator119876

119899

will be close to the true1198760 such as in randomized controlled

trials in which the sample size is small one may improve theefficiency of the TMLE by using a criterion for tuning theinitial estimator that directly evaluates the efficiency of theresulting TMLE of 120595

0 This general insight was formulated as

empirical efficiencymaximization in [83] and further workedout in the TMLE context in chapter 12 and Appendix of [2]

Double Robust Inference If the efficient influence curve isdouble robust then the TMLE remains consistent if either119876119899or 119866119899is consistent However if one uses a data adaptive

consistent estimator of 1198660(and thus with bias larger than

1radic119899) and 119876119899is inconsistent then the bias of 119866

119899might

directly map into a bias for the resulting TMLE of 1205950of

the same order As a consequence the TMLE might have abias with respect to 120595

0that is larger than 119874(1radic119899) so that

it is not asymptotically linear However one can incorporateadditional fluctuation parameters in the least favorable sub-model (by also fluctuating 119866

119899) to guarantee that the TMLE

remains asymptotically linear with known influence curvewhen either 119876

119899or 119866119899is inconsistent but we do not know

which one [84] So these enhancements of TMLE result inTMLE that are asymptotically linear under weaker conditionsthan a standard TMLE just like theCV-TMLE that removed acondition for asymptotic linearity These TMLE now involvenot only targeting 119876

119899but also targeting 119866

119899to guarantee that

when 119876119899is misspecified the required smooth function of 119866

119899

will behave as a TMLE and if119866119899is misspecified the required

smooth functional of 119876119899is still asymptotically linear The

same method was used to develop an IPTW estimator thattargets119866

119899so that the IPTW estimator is asymptotically linear

with known influence curve even when the initial estimatorof 1198660is estimated with a highly data adaptive estimator

Super-Learning Based on CV-TMLE of the Conditional Riskof a Candidate Estimator Super-learner relies on a cross-validated estimate of the risk of a candidate estimator Theoracle inequalities of the cross-validation selector assumedthat the cross-validated risk is simply an empirical meanover the validation sample of a loss function at the candidateestimator based on training sample averaged across differentsample splits where we generalized these results to lossfunctions that depend on an unknown nuisance parameter(which are thus estimated in the cross-validated risk)

For example suppose that in our running example 119860 iscontinuous and we are concerned with estimation of thedose-response curve (119864

0119884119886 119886) where 119864

0119884119886= 11986401198640(119884 | 119860 =

119886119882) Onemight define the risk of a candidate dose responsecurve as a mean squared error with respect to the true curve1198640119884119886 However this risk of a candidate curve is itself an

unknown real valued target parameter On the contrary tostandard prediction or density estimation this risk is notsimply a mean of a known loss function and the proposedunknown loss functions indexed by a nuisance parameter canhave large valuesmaking the cross-validated risk a nonrobustestimator Therefore we have proposed to estimate this con-ditional risk of candidate curve with TMLE and similarly theconditional risk of a candidate estimator with a CV-TMLEOne can now develop a super-learner that uses CV-TMLE asan estimate of the conditional risk of a candidate estimator[53 85] We applied this to construct a super-learner of thecausal dose response curve for a continuous valued treatmentand we obtained a corresponding oracle inequality for theperformance of the cross-validation selector [53]

6 Eye on the Future of Targeted Learning

We hope that the above clarifies that Targeted Learning isan ongoing exciting research area that is able to addressimportant practical challenges Each new application con-cerning learning from data can be formulated in termsof a statistical estimation problem with a large statisticalmodel and a target parameter One can now use the generalframework of super-learning and TMLE to develop efficienttargeted substitution estimators and corresponding statisticalinference As is apparent from the previous section thegeneral structure of TMLE and super-learning appears tobe flexible enough to handleadapt to any new challengesthat come up allowing researchers in Targeted Learning tomake important progress in tackling real world problems


By being honest in the formulation typically new challengescome up asking for expert input from a variety of researchersranging from subject-matter scientists computer scientiststo statisticians Targeted Learning requires multidisciplinaryteams since it asks for careful knowledge about data exper-iment the questions of interest possible informed guessesfor estimation that can be incorporated as candidates in thelibrary of the super-learner and input from the state of the artin computer science to produce scalable software algorithmsimplementing the statistical procedures

There are a variety of important areas of research inTargeted Learning we began to explore

Variance EstimationThe asymptotic variance of an estimatorsuch as the TMLE that is the variance of the influence curveof the estimator is just another target parameter of great inter-est It is commonpractice to estimate this asymptotic variancewith an empirical sample variance of the estimated influencecurves However in the context of sparsity influence curvescan be large making such an estimator highly nonrobustIn particular such a sample mean type estimator will notrespect the global constraints of the model Again this is notjust a theoretical issue since we have observed that in sparsedata situations standard estimators of the asymptotic varianceoften underestimate the variance of the estimator therebyresulting in overly optimistic confidence intervals This spar-sity can be due to rare outcomes or strong confoundingor highly informative censoring for example and naturallyoccurs evenwhen sample sizes are large Careful inspection ofthese variance estimators shows that the essential problem isthat these variance estimators are not substitution estimatorsTherefore we are in the process to apply TMLE to improvethe estimators of the asymptotic variance of TMLE of a targetparameter thereby improving the finite sample coverage ofour confidence intervals especially in sparse-data situations

Dependent Data Contrary to experiments that involve ran-dom sampling from a target population if one observes thereal world over time then naturally there is no way to arguethat the experiment can be represented as a collection ofindependent experiments let alone identical independentexperiments An environment over time and space is a singleorganism that cannot be separated out into independentunits without making very artificial assumptions and losingvery essential information the world needs to be seen as awhole to see truth Data collection in our societies is movingmore and more towards measuring total populations overtime resulting in what we often refer to as Big Data andthese populations consist of interconnected units Even inrandomized controlled settings where one randomly samplesunits from a target population one often likes to look at thepast data and change the sampling design in response to theobserved past in order to optimize the data collection withrespect to certain goals Once again this results in a sequenceof experiments that cannot be viewed as independent exper-iments the next experiment is only defined once one knowsthe data generated by the past experiments

Therefore we believe that our models that assume inde-pendence even though they are so much larger than the

models used in current practice are still not realistic modelsin many applications of interest On the other hand evenwhen the sample size equals 1 things are not hopeless if oneis willing to assume that the likelihood of the data factorizesin many factors due to conditional independence assump-tions and stationarity assumptions that state that conditionaldistributions might be constant across time or that differentunits are subject to the same laws for generating their data asa function of their parent variables In more recent researchwe have started to develop TMLE for statistical models thatdo not assume that the unit-specific data structures areindependent handling adaptive pair matching in communityrandomized controlled trials group sequential adaptive ran-domization designs and studies that collect data on units thatare interconnected through a causal network [4ndash8]

Data Adaptive Target Parameters It is common practice thatpeople first look at data before determining their choice oftarget parameter they want to learn even though it is taughtthat this is unacceptable practice since it makes the 119875 valuesand confidence intervals unreliable But maybe we shouldview this common practice as a sign that a priori specificationof the target parameter (and null hypothesis) limits thelearning from data too much and by enforcing it we onlyforce data analysts to cheat Current teaching would tell usthat one is only allowed to do this by splitting the sample useone part of the sample to generate a target parameter and usethe other part of the sample to estimate this target parameterand obtain confidence intervals Clearly this means thatone has to sacrifice a lot of sample size for being allowedto look at the data first Another possible approach forallowing us to obtain inference for a data driven parameteris to a priori formulate a large class of target parametersand use multiple testing or simultaneous confidence intervaladjustments However also with this approach one has to paya big price through the multiple testing adjustment and onestill needs to a priori list the target parameters

For that purpose acknowledging that one likes to minethe data to find interesting questions that are supportedby the data we developed statistical inference based onCV-TMLE for a large class of target parameters that aredefined as functions of the data [86] This allows one todefine an algorithm that when applied to the data generatesan interesting target parameter while we provide formalstatistical inference in terms of confidence intervals for thisdata adaptive target parameterThis provides a much broaderclass of a priori specified statistical analyses than currentpractice which requires a priori specification of the targetparameter while still providing valid statistical inferenceWe believe that this is a very promising direction for futureresearch opening up many new applications which wouldnormally be overlooked

Optimal Individualized Treatment One is often interestedin learning the best rule for treating a subject in responseto certain time-dependent measurements on that subjectwhere best rule might be defined as the rule that optimizesthe expected outcome Such a rule is called an individual-ized treatment rule or dynamic treatment regimen and an


optimal treatment rule is defined as the rule that minimizesthe mean outcome for a certain outcome (eg indicator ofdeath or other health measurement) We started to addressdata adaptive learning of the best possible treatment rule bydeveloping super-learners of this important target parameterwhile still providing statistical inference (and thus confidenceintervals) for the mean of the outcome in the counterfactualworld in which one applies this optimal dynamic treatmentto everybody in the target population [87] In particular thisproblem itself provides amotivation for a data adaptive targetparameter namely the mean outcome under a treatmentrule fitted based on the data Optimal dynamic treatmenthas been an important area in statistics and computerscience but we target this problem within the framework ofTargeted Learning thereby avoiding reliance on unrealisticassumptions that cannot be defended and will heavily affectthe true optimality of the fitted rules

Statistical Inference Based on Higher Order InferenceAnotherkey assumption the asymptotic efficiency or asymptoticlinearity of TMLE relies upon is the remaindersecond orderterm 119877

119899= 119900119875(1radic119899) For example in our running example

this means that the product of the rate at which the super-learner estimators of 119876

0and 119866

0converge to their target

converges to zero at a faster rate than 1radic119899 The densityestimation literature proves that if the density is manytimes differentiable then it is possible to construct densityestimators whose bias is driven by the last term of a higherorder Tailor expansion of the density around a point Robinset al [88] have developed theory based on higher orderinfluence functions under the assumption that the targetparameter is higher order pathwise differentiable Just asdensity estimators exploiting underlying smoothness thistheory also aims to construct estimators of higher orderpathwise differentiable target parameters whose bias is drivenby the last term of the higher order Tailor expansion ofthe target parameter The practical implementation of theproposed estimators has been challenging and is sufferingfrom lack of robustness Targeted Learning based on thesehigher order expansions (thus incorporating not only thefirst order efficient influence function but also the higherorder influence functions that define the Tailor expansion ofthe target parameter) appears to be a natural area of futureresearch to further build on these advances

Online TMLE Trading Off Statistical Optimality and Comput-ing Cost We will be more and more confronted with onlinedata bases that continuously grow and are massive in sizeNonetheless one wants to know if the new data changes theinference about target parameters of interest and one wantsto know it right away Recomputing the TMLE based on theold data augmented with the new chunk of data would beimmensely computer intensive Therefore we are confrontedwith the challenge on constructing an estimator that is ableto update a current estimator without having to recomputethe estimator but instead one wants to update it based oncomputations with the new data only More generally oneis interested in high quality statistical procedures that arescalable We started doing research in such online TMLE that

preserve all or most of the good properties of TMLE but canbe continuously updated where the number of computationsrequired for this update is only a function of the size of thenew chunk of data

7 Historical Philosophical Perspective onTargeted Learning A Reconciliation withMachine Learning

In the previous sections the main characteristics of TMLESLmethodology have been outlined We introduced the mostimportant fundamental ideas and statistical concepts urgedthe need for revision of current data-analytic practice andshowed some recent advances and application areas Alsoresearch in progress on such issues as dependent data anddata adaptive target parameters has been brought forward Inthis section we put the methodology in a broader historical-philosophical perspective trying to support the claim that itsrelevance exceeds the realms of statistics in a strict sense andeven those of methodology To this aim we will discuss boththe significance of TMLESL for contemporary epistemologyand its implications for the current debate on Big Data andthe generally advocated emerging new discipline of DataScience Some of these issues have been elaborated moreextensively in [3 89ndash91] where we have put the present stateof statistical data analysis in a historical and philosophicalperspective with the purpose to clarify understand andaccount for the current situation in statistical data analysisand relate the main ideas underlying TMLESL to it

First and foremost itmust be emphasized that rather thanextending the toolkit of the data analyst TMLESL establishesa newmethodology Froma technical point of view it offers anintegrative approach to data analysis or statistical learning bycombining inferential statistics with techniques derived fromthe field of computational intelligence This field includessuch related and usually eloquently phrased disciplines likemachine learning data mining knowledge discovery indatabases and algorithmic data analysis From a conceptualor methodological point of view it sheds new light on severalstages of the research process including such items as theresearch question assumptions and background knowledgemodeling and causal inference and validation by anchoringthese stages or elements of the research process in statisticaltheory According to TMLESL all these elements should berelated to or defined in terms of (properties of) the datagenerating distribution and to this aim themethodology pro-vides both clear heuristics and formal underpinnings Amongother things this means that the concept of a statistical modelis reestablished in a prudent and parsimonious way allowinghumans to include only their true realistic knowledge in themodel In addition the scientific question and backgroundknowledge are to be translated into a formal causal modeland target causal parameter using the causal graphs andcounterfactual (potential outcome) frameworks includingspecifying a working marginal structural model And evenmore significantly TMLESL reassigns to the very conceptof estimation canonical as it has always been in statisticalinference the leading role in any theory ofapproach to


learning from data whether it deals with establishing causalrelations classifying or clustering time series forecastingor multiple testing Indeed inferential statistics arose atthe background of randomness and variation in a worldrepresented or encoded by probability distributions and ithas therefore always presumed and exploited the sample-population dualism which underlies the very idea of esti-mation Nevertheless the whole concept of estimation seemsto be discredited and disregarded in contemporary dataanalytical practice

In fact the current situation in data analysis is ratherparadoxical and inconvenient From a foundational perspec-tive the field consists of several competing schools with some-times incompatible principles approaches or viewpointsSome of these can be traced back to Karl Pearsons goodness-of-fit-approach to data-analysis or to the Fisherian traditionof significance testing and ML-estimation Some principlesand techniques have been derived from the Neyman-Pearsonschool of hypothesis testing such as the comparison betweentwo alternative hypotheses and the identification of twokinds of errors of usual unequal importance that should bedealt with And last but not least the toolkit contains allkinds of ideas taken from the Bayesian paradigm whichrigorously pulls statistics into the realms of epistemology Weonly have to refer here to the subjective interpretation ofprobability and the idea that hypotheses should be analyzedin a probabilistic way by assigning probabilities to thesehypotheses thus abandoning the idea that the parameter isa fixed unknown quantity and thus moving the knowledgeabout the hypotheses from the meta-language into the objectlanguage of probability calculus In spite of all this theburgeoning statistical textbook market offers many primersand even advanced studies which wrongly suggest a uniformand united field with foundations that are fixed and onwhichfull agreement has been reached It offers a toolkit based onthe alleged unification of ideas andmethods derived from theaforementioned traditions As pointed out in [3] this situationis rather inconvenient from a philosophical point of view fortwo related reasons

First nearly all scientific disciplines have experienced aprobabilistic revolution since the late 19th century Increas-ingly key notions are probabilistic research methods entiretheories are probabilistic if not the underlying worldview isprobabilistic that is they are all dominated by and rooted inprobability theory and statisticsWhen the probabilistic revo-lution emerged in the late 19th century this transition becamerecognizable in old established sciences like physics (kineticgas theory statistical mechanics of Bolzmann Maxwelland Gibbs) but especially in new emerging disciplineslike the social sciences (Quetelet and later Durkheim)biology (evolution genetics zoology) agricultural scienceand psychology Biology even came to maturity due toclose interaction with statistics Today this trend has onlyfurther strengthened and as a result there is a plethora offields of application of statistics ranging from biostatisticsgeostatistics epidemiology and econometrics to actuarialscience statistical finance quality control and operationalresearch in industrial engineering and management scienceProbabilistic approaches have also intruded many branches

of computer science most noticeably they dominate artificialintelligence

Secondly at a more abstract level probabilistic approa-ches also dominate epistemology the branch of philosophycommitted to classical questions on the relation betweenknowledge and reality like What is reality Does it existmind-independent Do we have access to it If yes how Doour postulated theoretical entities exist How do they corre-spond to reality Canwemake true statements about it If yeswhat is truth and how is it connected to reality The analysesconducted to address these issues are usually intrinsicallyprobabilistic As a result these approaches dominate keyissues and controversies in epistemology such as the scientificrealism debate the structure of scientific theories Bayesianconfirmation theory causality models of explanation andnatural laws All too often scientific reasoning seems nearlysynonymous with probabilistic reasoning In view of thefact that scientific inference more and more depends onprobabilistic reasoning and that statistical analysis is not aswell-founded as might be expected the issue addressed inthis chapter is of crucial importance for epistemology [3]

Despite these philosophical objections against the hybridcharacter of inferential statistics its successes were enormousin the first decades of the twentieth century In newly estab-lished disciplines like psychology and economics significancetesting and maximum likelihood estimation were appliedwith methodological rigor in order to enhance prestige andapply scientific method to their field Although criticism thata mere chasing of low 119875 values and naive use of parametricstatistics did not do justice to specific characteristics of thesciences involved emerging from the start of the applicationof statistics the success story was immense However thisrise of the inference experts like Gigerenzer calls them inThe Rise of Statistical Thinking was just a phase or stagein the development of statistics and data analysis whichmanifests itself as a Hegelian triptych that unmistakably isnow being completed in the era of Big Data After this thesisof a successful but ununified field of inferential statistics anantithesis in the Hegelian sense of the word was unavoidableand it was this antithesis that gave rise to the current situationin data-analytical practice as well Apart from the alreadymentioned Bayesian revolt the rise of nonparametric statis-tics in the thirties must be mentioned here as an intrinsicallystatistical criticism that heralds this antithesis The majorcaesura in this process however was the work of John Tukeyin the sixties and seventies of the previous century After along career in statistics and other mathematical disciplinesTukeywrote ExplorativeData analysis in 1978This study is inmany ways a remarkable unorthodox book First it containsno axioms theorems lemmas or proofs and even barelyformulas There are no theoretical distributions significancetests 119875 values hypothesis tests parameter estimation andconfidence intervals No inferential or confirmatory statisticsbut just the understanding of data looking for patternsrelationships and structures in data and visualizing theresults According to Tukey the statistician is a detectiveas a contemporary Sherlock Holmes he must strive forsigns and ldquocluesrdquo Tukeymaintains this metaphor consistentlythroughout the book and wants to provide the data analyst


with a toolbox full of methods for understanding frequencydistributions smoothing techniques scale transformationsand above all many graphical techniques for explorationstorage and summary illustrations of data The unorthodoxapproach of Tukey in EDA reveals not so much a contrarianspirit but rather a fundamental dissatisfaction with theprevailing statistical practice and the underlying paradigm ofinferentialconfirmatory statistics [90]

In EDA Tukey endeavors to emphasize the importanceof confirmatory classical statistics but this looks for themain part a matter of politeness and courtesy In fact hehad already put his cards on the table in 1962 in the famousopening passage from The Future of Data Analysis ldquofor along time I have thought that I was a statistician interestedin inferences from the particular to the general But as Ihave watched mathematical statistics evolve I have had causeto wonder and to doubt And when I have pondered aboutwhy such techniques as the spectrum analysis of time serieshave proved so useful it has become clear that their ldquodealingwith fluctuationsrdquo aspects are in many circumstances oflesser importance than the aspects that would already havebeen required to deal effectively with the simpler case ofvery extensive data where fluctuations would no longer be aproblem All in all I have come to feel that my central interestis in data analysis which I take to include among other thingsprocedures for analyzing data techniques for interpreting theresults of such procedures ways of planning the gatheringof data to make its analysis easier more precise or moreaccurate and all the machinery and results of mathematicalstatisticswhich apply to analyzing data Data analysis is alarger andmore varied field than inference or allocationrdquo Alsoin other writings Tukey makes a sharp distinction betweenstatistics and data analysis

First Tukey gave unmistakable impulse to the emancipa-tion of the descriptivevisual approach after pioneering workof William Playfair (18th century) and Florence Nightingale(19th century) on graphical techniques that were soonovershadowed by the ldquoinferentialrdquo coup which marked theprobabilistic revolution Furthermore it is somewhat ironicthat many consider Tukey a pioneer of computational fieldssuch as data mining and machine learning although hehimself preferred a small role for the computer in his analysisand kept it in the background More importantly howeverbecause of his alleged antitheoretical stance Tukey is some-times considered the man who tried to reverse or undo theFisherian revolution and an exponent or forerunner of todayrsquoserosion of models the view that all models are wrong theclassical notion of truth is obsolete and pragmatic criteria aspredictive success in data analysis must prevail Also the ideacurrently frequently uttered in the data analytical traditionthat the presence of Big Data will makemuch of the statisticalmachinery superfluous is an import aspect of the here verybriefly sketched antithesis Before we come to the intendedsynthesis the final stage of the Hegelian triptych let us maketwo remarks concerning Tukeyrsquos heritage Although it almostsounds like a cliche yet it must be noted that EDA techniquesnowadays are routinely applied in all statistical packagesalong with in itself sometimes hybrid inferential methodsIn the current empirical methodology EDA is integrated

with inferential statistics at different stages of the researchprocess Secondly it could be argued that Tukey did notso much undermine the revolution initiated by Galton andPearson but understood the ultimate consequences of it Itwas Galton who had shown that variation and change areintrinsic in nature and that we have to look for the deviantthe special or the peculiar It was Pearson who did realize thatthe constraints of the normal distribution (Laplace Quetelet)had to be abandoned andwhodistinguished different familiesof distributions as an alternative Galtonrsquos heritage was justslightly under pressure hit by the successes of the parametricFisherian statistics on strongmodel assumptions and it couldwell be stated that this was partially reinstated by Tukey

Unsurprisingly the final stage of the Hegelian triptychstrives for some convergence if not synthesis The 19thcentury dialectical German philosopher GFW Hegel arguedthat history is a process of becoming or development inwhich a thesis evokes and binds itself to an antithesis inaddition both are placed at a higher level to be completedand to result in a fulfilling synthesis Applied to the lessmetaphysically oriented present problem this dialecticalprinciple seems particularly relevant in the era of Big Datawhich makes a reconciliation between inferential statisticsand computational science imperative Big Data sets highdemands and offers challenges to both For example it setshigh standards for data management storage and retrievaland has great influence on the research of efficiency ofmachine learning algorithms But it is also accompanied bynew problems pitfalls and challenges for statistical inferenceand its underlying mathematical theory Examples includethe effects of wrongly specifiedmodels the problems of smallhigh-dimensional datasets (microarray data) the search forcausal relationships in nonexperimental data quantifyinguncertainty efficiency theory and so on The fact that manydata-intensive empirical sciences are highly dependent onmachine learning algorithms and statistics makes bridgingthe gap of course for practical reasons compelling

In addition it seems that Big Data itself also transformsthe nature of knowledge the way of acquiring knowledgeresearchmethodology nature and status ofmodels and theo-ries In the reflections of all the briefly sketched contradictionoften emerges and in the popular literature the differences areusually enhanced leading to annexation of Big Data by oneof the two disciplines

Of course the gap between both has many aspects bothphilosophical and technical that have been left out hereHowever it must be emphasized that for the main part Tar-geted Learning intends to support the reconciliation betweeninferential statistics and computational intelligence It startswith the specification of a nonparametric and semiparametricmodel that contains only the realistic background knowledgeand focuses on the parameter of interest which is consideredas a property of the as yet unknown true data-generatingdistribution From a methodological point of view it is aclear imperative that model and parameter of interest mustbe specified in advance The (empirical) research questionmust be translated in terms of the parameter of interest anda rehabilitation of the concept model is achieved Then Tar-geted Learning involves a flexible data-adaptive estimation


procedure that proceeds in two steps First an initial estimateis searched on the basis of the relevant part of the truedistribution that is needed to evaluate the target parameterThis initial estimator is found bymeans of the super learning-algorithm In short this is based on a library of manydiverse analytical techniques ranging from logistic regressionto ensemble techniques random forest and support vectormachines Because the choice of one of these techniques byhuman intervention is highly subjective and the variation inthe results of the various techniques usually substantial SLuses a sort of weighted sum of the values calculated by meansof cross-validation Based on these initial estimators thesecond stage of the estimation procedure can be initiatedTheinitial fit is updated with the goal of an optimal bias-variancetrade-off for the parameter of interest This is accomplishedwith a targeted maximum likelihood estimator of the fluc-tuation parameter of a parametric submodel selected by theinitial estimatorThe statistical inference is then completed bycalculating standard errors on the basis of ldquoinfluence-curvetheoryrdquo or resampling techniquesThis parameter estimationretains a crucial place in the data analysis If one wants to dojustice to variation and change in the phenomena then youcannot deny Fisherrsquos unshakable insight that randomness isintrinsic and implies that the estimator of the parameter ofinterest itself has a distributionThus Fisher proved himself tobe a dualist inmaking the explicit distinction between sampleand population Neither Big Data nor full census research orany other attempt to take into account the whole of realityor a world encoded or encrypted in data can compensatefor it Although many aspects have remained undiscussedin this contribution we hope to have shown that TMLESLcontributes to the intended reconciliation between inferentialstatistics and computational science and that both ratherthan being in contradiction should be integrating parts inany concept of Data Science

8 Concluding Remark TargetedLearning and Big Data

The expansion of available data has resulted in a new fieldoften referred to as Big Data Some advocate that Big Datachanges the perspective on statistics for example sincewe measure everything why do we still need statisticsClearly Big Data refers to measuring (possibly very) highdimensional data on a very large number of units The truthis that there will never be enough data so that careful designof studies and interpretation of data is not needed anymore

To start with lots of bad data are useless so one will needto respect the experiment that generated the data in order tocarefully define the target parameter and its interpretationand design of experiments is as important as ever so that thetarget parameters of interest can actually be learned

Even though the standard error of a simple samplemean might be so small that there is no need for confidenceintervals one is often interested in much more complexstatistical target parameters For example consider theaverage treatment effect of our running example whichis not a very complex parameter relative to many other

parameters of interest such as an optimal individualizedtreatment rule Evaluation of the average treatment effectbased on a sample (ie substitution estimator obtained byplugging in the empirical distribution of the sample) wouldrequire computing the mean outcome for each possible strataof treatment and covariates Even with 119899 = 1012 observationsmost of these strata will be empty for reasonable dimensionsof the covariates so that this pure empirical estimator isnot defined As a consequence we will need smoothing(ie super learning) and really we will also need TargetedLearning for unbiased estimation and valid statisticalinference

Targeted Learning was developed in response to highdimensional data in which reasonably sized parametricmodels are simply impossible to formulate and are immenselybiased anyway The high dimension of the data only empha-sizes the need for realistic (and thereby large semiparameric)models target parameters defined as features of the data dis-tribution instead of coefficients in these parametric modelsand Targeted Learning

Themassive dimension of the data doesmake it appealingto not be necessarily restricted by a priori specification of thetarget parameters of interest so that Targeted Learning of dataadaptive target parameters discussed above is particularlyimportant future area of research providing an importantadditional flexibility without giving up on statistical infer-ence

One possible consequence of the building of large databases that collect data on total populations is that the datamight correspond with observing a single process like acommunity of individuals over time in which case onecannot assume that the data is the realization of a collectionof independent experiments the typical assumption moststatistical methods rely upon That is data cannot be rep-resented as random samples from some target populationsince we sample all units of the target population In thesecases it is important to document the connections betweenthe units so that one can pose statistical models that relyon the a variety of conditional independence assumptionsas in causal inference for networks developed in [8] Thatis we need Targeted Learning for dependent data whosedata distribution is modeled through realistic conditionalindependence assumptions

Such statistical models do not allow for statistical infer-ence based on simple methods such as the bootstrap (iesample size is 1) so that asymptotic theory for estimatorsbased on influence curves and the state of the art advancesin weak convergence theory is more crucial than ever Thatis the state of the art in probability theory will only be moreimportant in this new era of Big Data Specifically one willneed to establish convergence in distribution of standardizedestimators in these settings in which the data correspondswith the realization of one gigantic randomvariable forwhichthe statistical model assumes a lot of structure in terms ofconditional independence assumptions

Of course Targeted Learning with Big Data will requirethe programming of scalable algorithms putting fundamen-tal constraints on the type of super-learners and TMLE


Clearly Big Data does require integration of differentdisciplines fully respecting the advances made in the dif-ferent fields such as computer science statistics probabilitytheory and scientific knowledge that allows us to reduce thesize of the statistical model and to target the relevant targetparameters Funding agencies need to recognize this so thatmoney can be spent in the best possible way the best possibleway is not to give up on theoretical advances but the theoryhas to be relevant to address the real challenges that comewith real data The Biggest Mistake we can make in this BigData Era is to give up on deep statistical and probabilisticreasoning and theory and corresponding education of ournext generations and somehow think that it is just a matterof applying algorithms to data

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors thank the reviewers for their very helpful com-ments which improved the paper substantially This researchwas supported by an NIH Grant 2R01AI074345

References

[1] M J van der Laan and D Rubin ldquoTargeted maximum likeli-hood learningrdquo International Journal of Biostatistics vol 2 no1 2006

[2] M J van der Laan and S Rose Targeted Learning CausalInference for Observational and Experimental Data SpringerNew York NY USA 2011

[3] R J CM Starmans ldquoModels inference and truth Probabilisticreasoning in the information erardquo in Targeted Learning CausalInference for Observational and Experimental Studies M J vander Laan and S Rose Eds pp 1ndash20 Springer New York NYUSA 2011

[4] M J van der Laan ldquoEstimation based on case-control designswith known prevalence probabilityrdquo The International Journalof Biostatistics vol 4 no 1 2008

[5] A Chambaz and M J van der Laan ldquoTargeting the optimaldesign in randomized clinical trials with binary outcomes andno covariate theoretical studyrdquo The International Journal ofBiostatistics vol 7 no 1 pp 1ndash32 2011 Working paper 258httpbiostatsbepresscomucbbiostat

[6] A Chambaz and M J van der Laan ldquoTargeting the optimaldesign in randomized clinical trials with binary outcomesand no covariate simulation studyrdquo International Journal ofBiostatistics vol 7 no 1 article 33 2011 Working paper 258httpwwwbepresscomucbbiostat

[7] M J van der Laan L B Balzer and M L Petersen ldquoAdaptivematching in randomized trials and observational studiesrdquoJournal of Statistical Research vol 46 no 2 pp 113ndash156 2013

[8] M J van der Laan ldquoCausal inference for networksrdquo Tech Rep300 University of California Berkeley Calif USA 2012

[9] J PearlCausality Models Reasoning and Inference CambridgeUniversity Press Cambridge NY USA 2nd edition 2009

[10] R D Gill ldquoNon- and semi-parametric maximum likelihoodestimators and the von Mises method (part 1)rdquo ScandinavianJournal of Statistics vol 16 pp 97ndash128 1989

[11] A W van der Vaart and J A Wellner Weak Convergence andEmprical Processes Springer Series in Statistics Springer NewYork NY USA 1996

[12] R D Gill M J van der Laan and J A Wellner ldquoInefficientestimators of the bivariate survival function for three modelsrdquoAnnales de lrsquoInstitut Henri Poincare vol 31 no 3 pp 545ndash5971995

[13] P J Bickel C A J Klaassen Y Ritov and J Wellner Efficientand Adaptive Estimation for Semiparametric Models Springer1997

[14] S Gruber and M J van der Laan ldquoA targeted maximumlikelihood estimator of a causal effect on a bounded continuousoutcomerdquo International Journal of Biostatistics vol 6 no 1article 26 2010

[15] M J van der Laan and S Dudoit ldquoUnified cross-validationmethodology for selection among estimators and a generalcross-validated adaptive epsilon-net estimator finite sampleoracle inequalities and examplesrdquo Technical Report Divisionof Biostatistics University of California Berkeley Calif USA2003

[16] A W van der Vaart S Dudoit and M J van der LaanldquoOracle inequalities for multi-fold cross validationrdquo Statisticsand Decisions vol 24 no 3 pp 351ndash371 2006

[17] M J van der Laan S Dudoit and A W van der VaartldquoThe cross-validated adaptive epsilon-net estimatorrdquo Statisticsamp Decisions vol 24 no 3 pp 373ndash395 2006

[18] M J van der Laan E Polley and A Hubbard ldquoSuper learnerrdquoStatistical Applications in Genetics andMolecular Biology vol 6no 1 article 25 2007

[19] E C Polley S Rose and M J van der Laan ldquoSuper learningrdquoin Targeted Learning Causal Inference for Observational andExperimental Data M J van der Laan and S Rose EdsSpringer New York NY USA 2012

[20] M J van der Laan and J M Robins Unified Methods forCensored Longitudinal Data and Causality NewYork NYUSASpringer 2003

[21] M L Petersen and M J van der Laan A General Roadmapfor the Estimation of Causal Effects Division of Biostatis ticsUniversity of California Berkeley Calif USA 2012

[22] J Splawa-Neyman ldquoOn the application of probability theory toagricultural experimentsrdquo Statistical Science vol 5 no 4 pp465ndash480 1990

[23] D B Rubin ldquoEstimating causal effects of treatments in ran-domized and non-randomized studiesrdquo Journal of EducationalPsychology vol 64 pp 688ndash701 1974

[24] D B Rubin Matched Sampling for Causal Effects CambridgeUniversity Press Cambridge Mass USA 2006

[25] P W Holland ldquoStatistics and causal inferencerdquo Journal of theAmerican Statistical Association vol 81 no 396 pp 945ndash9601986

[26] J Robins ldquoAnew approach to causal inference inmortality stud-ies with a sustained exposure periodmdashapplication to controlof the healthy worker survivor effectrdquoMathematical Modellingvol 7 no 9ndash12 pp 1393ndash1512 1986

[27] J M Robins ldquoAddendum to ldquoA new approach to causal infer-ence in mortality studies with a sustained exposure periodmdashapplication to control of the healthy worker survivor effectrdquordquoComputers amp Mathematics with Applications vol 14 no 9ndash12pp 923ndash945 1987


[28] J Robins ldquoA graphical approach to the identification andestimation of causal parameters in mortality studies withsustained exposure periodsrdquo Journal of Chronic Diseases vol40 supplement 2 pp 139Sndash161S 1987

[29] A Rotnitzky D Scharfstein T L Su and J Robins ldquoMethodsfor conducting sensitivity analysis of trials with potentiallynonignorable competing causes of censoringrdquo Biometrics vol57 no 1 pp 103ndash113 2001

[30] J M Robins A Rotnitzky and D O Scharfstein ldquoSensitivityanalysis for se lection bias and unmeasured confounding in missing data and causal inference modelsrdquo in Statistical Modelsin Epidemiology the Environment and Clinical Trials IMAVolumes in Mathematics and Its Applications Springer BerlinGermany 1999

[31] D O Scharfstein A Rotnitzky and J Robins ldquoAdjustingfor nonignorable drop-out using semiparametric nonresponsemodelsrdquo Journal of the American Statistical Association vol 94no 448 pp 1096ndash1146 1999

[32] I Diaz and M J van der Laan ldquoSensitivity analysis forcausal inference under unmeasured confounding and mea-surement error problemsrdquo Tech Rep Division of Biostatis-tics University of California Berkeley Calif USA 2012httpwwwbepresscomucbbiostatpaper303

[33] O Bembom and M J van der Laan ldquoA practical illustration ofthe im-portance of realistic individualized treatment rules incausal inferencerdquo Electronic Journal of Statistics vol 1 pp 574ndash596 2007

[34] S Rose and M J van der Laan ldquoSimple optimal weighting ofcases and controls in case-control studiesrdquo The InternationalJournal of Biostatistics vol 4 no 1 2008

[35] S Rose and M J van der Laan ldquoWhy match Investigatingmatched case-control study designs with causal effect estima-tionrdquo The International Journal of Biostatistics vol 5 no 1article 1 2009

[36] S Rose and M J van der Laan ldquoA targeted maximum likeli-hood estimator for two-stage designsrdquo International Journal ofBiostatistics vol 7 no 1 21 pages 2011

[37] K L Moore and M J van der Laan ldquoApplication of time-to-event methods in the assessment of safety in clinical trialsrdquoin Design Summarization Analysis amp Interpretation of ClinicalTrials with Time-to-Event Endpoints E Karl Ed Chapman andHall 2009

[38] K L Moore and M J van der Laan ldquoCovariate adjustment inrandomized trials with binary outcomes targeted maximumlikelihood estimationrdquo Statistics in Medicine vol 28 no 1 pp39ndash64 2009

[39] K L Moore and M J van der Laan ldquoIncreasing powerin randomized trials with right censored outcomes throughcovariate adjustmentrdquo Journal of Biopharmaceutical Statisticsvol 19 no 6 pp 1099ndash1131 2009

[40] O Bembom M L Petersen S-Y Rhee et al ldquoBiomarker dis-covery using targetedmaximum likelihood estimation applica-tion to the treatment of antiretroviral resistant HIV infectionrdquoStatistics in Medicine vol 28 pp 152ndash172 2009

[41] R Neugebauer M J Silverberg and M J van der LaanldquoObservational study and individualized antiretroviral therapyinitiation rules for reducing cancer incidence in HIV-infectedpatientsrdquo Tech Rep 272 Division of Biostatistics University ofCalifornia Berkeley Calif USA 2010

[42] E C Polley and M J van der Laan ldquoPredicting optimaltreatment assignment based on prognostic factors in cancer

patientsrdquo in Design Summarization Analysis amp Interpretationof Clinical Trials with Time-to-Event Endpoints K E Peace EdChapman amp Hall 2009

[43] M Rosenblum S G Deeks M van der Laan and D RBangsberg ldquoThe risk of virologic failure decreaseswith durationof HIV suppression at greater than 50 adherence to antiretro-viral therapyrdquo PLoS ONE vol 4 no 9 Article ID e7196 2009

[44] M J van der Laan and S Gruber ldquoCollaborative double robusttargeted maximum likelihood estimationrdquo The InternationalJournal of Biostatistics vol 6 no 1 article 17 2010

[45] O M Stitelman andM J van der Laan ldquoCollaborative targetedmaximum like-lihood for time to event datardquo Tech Rep 260Division of Biostatistics University of California BerkeleyCalif USA 2010

[46] S Gruber and M J van der Laan ldquoAn application of collabo-rative targeted maximum likelihood estimation in causal infer-ence and genomicsrdquo The International Journal of Biostatisticsvol 6 no 1 2010

[47] M Rosenblum and M J van der Laan ldquoTargeted maximumlikelihood estimation of the parameter of a marginal structuralmodelrdquo International Journal of Biostatistics vol 6 no 2 2010

[48] HWang S Rose andM J van der Laan ldquoFinding quantitativetrait loci genes with collaborative targetedmaximum likelihoodlearningrdquo Statistics amp Probability Letters vol 81 no 7 pp 792ndash796 2011

[49] I D Munoz and M J van der Laan ldquoSuper learner basedconditional density estimation with application to marginalstructural modelsrdquo International Journal of Biostatistics vol 7no 1 article 38 2011

[50] I D Munoz and M van der Laan ldquoPopulation interventioncausal effects based on stochastic interventionsrdquoBiometrics vol68 no 2 pp 541ndash549 2012

[51] I Diaz and M J van der Laan ldquoSensitivity analysis for causalinference under unmeasured confounding and measurementerror problemsrdquo International Journal of Biostatistics vol 9 no2 pp 149ndash160 2013

[52] I Diaz and M J van der Laan ldquoAssessing the causal effect ofpolicies an example using stochastic interventionsrdquo Interna-tional Journal of Biostatistics vol 9 no 2 pp 161ndash174 2013

[53] I Diaz and J Mark van der Laan ldquoTargeted data adaptiveestimation of the causal dosemdashresponse curverdquo Journal ofCausal Inference vol 1 no 2 pp 171ndash192 2013

[54] O M Stitelman and M J van der Laan ldquoTargeted maximumlikelihood estimation of effect modification parameters insurvival analysisrdquoThe International Journal of Biostatistics vol7 no 1 article 19 2011

[55] M J van der Laan ldquoTargetedmaximum likelihood based causalinference Part Irdquo International Journal of Biostatistics vol 6 no2 Art pages 2010

[56] O M Stitelman and M J van der Laan ldquoTargeted maximumlikelihood estimation of time-to-event parameters with time-dependent covariatesrdquo Tech Rep Division of BiostatisticsUniversity of California Berkeley Calif USA 2011

[57] M Schnitzer E Moodie M J van der Laan R Platt and MKlei ldquoModeling the impact of hepatitis C viral clearance on end-stage liver disease in an HIV co-infected cohort with TargetedMaximum Likelihood Estimationrdquo Biometrics vol 70 no 1 pp144ndash152 2014

[58] S Gruber and M J van der Laan ldquoTargeted minimum lossbased estimator that outperforms a given estimatorrdquoThe Inter-national Journal of Biostatistics vol 8 article 11 no 1 2012


[59] S Gruber and M J van der Laan ldquoConsistent causal effectestimation under dual misspecification and implications forconfounder selection procedurerdquo Statistical Methods in MedicalResearch 2012

[60] M Petersen J Schwab S Gruber N Blaser M Schomaker andM J van der Laan ldquoTargetedminimum loss based estimation ofmarginal structural workingmodelsrdquo Tech Rep 312 Universityof California Berkeley Calif USA 2013

[61] J Brooks M J van der Laan D E Singer and A S GoldquoTargeted minimum loss-based estimation of causal effects inright-censored survival data with time-dependent covariateswarfarin stroke and death in atrial fibrillationrdquo Journal ofCausal Inference vol 1 no 2 pp 235ndash254 2013

[62] J Brooks M J van der Laan and A S Go ldquoTargetedmaximumlikelihood estimation for prediction calibrationrdquo InternationalJournal of Biostatistics vol 8 article 30 no 1 2012

[63] S Sapp M J van der Laan and K Page ldquoTargeted estimationof variable importance measures with interval-censored out-comesrdquo Tech Rep 307 University of California Berkeley CalifUSA 2013

[64] R Neugebauer J A Schmittdiel and M J van der LaanldquoTargeted learning in real-world comparative effectivenessresearch with time-varying interventionsrdquo Tech RepHHSA29020050016I The Agency for Healthcare Researchand Quality 2013

[65] S D Lendle M S Subbaraman and M J van der LaanldquoIdentification and efficient estimation of the natural directeffect among the untreatedrdquo Biometrics vol 69 no 2 pp 310ndash317 2013

[66] S D Lendle B Fireman and M J van der Laan ldquoTargetedmaximum likelihood estimation in safety analysisrdquo Journal ofClinical Epidemiology vol 66 no 8 pp S91ndashS98 2013

[67] M S Subbaraman S Lendle M van der Laan L A Kaskutasand J Ahern ldquoCravings as a mediator and moderator ofdrinking outcomes in the COMBINE studyrdquoAddiction vol 108no 10 pp 1737ndash1744 2013

[68] S D Lendle B Fireman and M J van der Laan ldquoBalancingscore adjusted targeted minimum loss-based estimationrdquo 2013

[69] W Zheng M L Petersen and M J van der Laan ldquoEstimatingthe effect of a community-based intervention with two commu-nitiesrdquo Journal of Causal Inference vol 1 no 1 pp 83ndash106 2013

[70] W Zheng and M J van der Laan ldquoTargeted maximum likeli-hood estimation of natural direct effectsrdquo International Journalof Biostatistics vol 8 no 1 2012

[71] W Zheng and M J van der Laan ldquoCausal mediation in asurvival setting with time-dependent mediatorsrdquo TechnicalReport 295 Division of Biostatistics University of CaliforniaBerkeley Calif USA 2012

[72] M Carone M Petersen and M J van der Laan ldquoTargetedminimum loss based estimation of a casual effect using intervalcensored time to event datardquo in Interval Censored Time to EventData Methods and Applications D-G Chen J Sun and K E Peace Eds Chapman amp HallCRC New York NY USA 2012

[73] D O Scharfstein A Rotnitzky and J M Robins ldquoAdjustingfor nonignorable drop-out using semiparametric nonresponsemodels (with discussion and rejoinder)rdquo Journal of the Ameri-can Statistical Association vol 94 pp 1096ndash1120 1999

[74] H Bang and JM Robins ldquoDoubly robust estimation inmissingdata and causal inference modelsrdquo Biometrics vol 61 no 4 pp962ndash972 2005

[75] A Chambaz N Pierre and M J van der Laan ldquoEstimation ofa non-parametric variable importance measure of a continuousexposurerdquo Electronic Journal of Statistic vol 6 pp 1059ndash10992012

[76] C Tuglus and M J van der Laan ldquoTargeted methods forbiomarker discovery the search for a standardrdquo UC BerkeleyWorking Paper Series 2008 httpwwwbepresscomucbbios-tatpaper233

[77] C Tuglus and M J van der Laan ldquoModified FDR controllingprocedure for multi-stage analysesrdquo Statistical Applications inGenetics and Molecular Biology vol 8 no 1 article 12 2009

[78] C Tuglus and M J van der Laan ldquoTargeted methods forbiomarker discoveriesrdquo in Targeted Learning Causal InferenceforObservational andExperimentalDataM J van der Laan andS Rose Eds chapter 22 Springer New York NY USA 2011

[79] HWang S Rose andM J van der Laan ldquoFinding quantitativetrait loci genesrdquo in Targeted Learning Causal Inference forObservational and Experimental Data MJ van der Laan andS Rose Eds Springer New York NY USA 2011 chapter 23

[80] L B Balzer and M J van der Laan ldquoEstimating effects onrare outcomes knowledge is powerrdquo Tech Rep 310 Divisionof Biostatistics University of California Berkeley Calif USA2013

[81] W Zheng and M J van der Laan ldquoCross-validated targetedminimum loss based estimationrdquo in Targeted Learning CausalInference for Observational and Experimental Studies M J vander Laan and S Rose Eds Springer New York NY USA 2011

[82] A Rotnitzky Q Lei M Sued and J M Robins ldquoImproveddouble-robust estimation in missing data and causal inferencemodelsrdquo Biometrika vol 99 no 2 pp 439ndash456 2012

[83] D B Rubin and M J van der Laan ldquoEmpirical efficiencymaximization improved locally efficient covariate adjustmentin randomized experiments and survival analysisrdquoThe Interna-tional Journal of Biostatistics vol 4 no 1 article 5 2008

[84] M J van der Laan ldquoStatistical inference when using dataadaptive estimators of nuisance parametersrdquo Tech Rep 302Division of Biostatistics University of California BerkeleyCalif USA 2012

[85] M J van der and M L Petersen ldquoTargeted learningrdquo inEnsemble Machine Learning pp 117ndash156 Springer New YorkNY USA 2012

[86] M J van der Laan A E Hubbard and S Kherad ldquoStatisticalinference for data adaptive target parametersrdquo Tech Rep 314University of California Berkeley Calif USA June 2013

[87] M J van der Laan ldquoTargeted learning of an optimal dynamictreatment and statistical inference for its mean outcomerdquo TechRep 317 University of California at Berkeley 2013 To appear inJournal of Causal Inference

[88] JM Robins L Li E Tchetgen andAW van derVaart ldquoHigherorder influence functions and minimax estimation of non-linear functionalsrdquo in Essays in Honor of David A Freedman IMS Collections Probability and Statistics pp 335ndash421 SpringerNew York NY USA 2008

[89] S Rose R J C M Starmans and M J van der Laan ldquoTar-geted learning for causality and statistical analysis in medicalresearchrdquo Tech Rep 297 Division of Biostatistics University ofCalifornia Berkeley Calif USA 2011

[90] R J C M Starmans ldquoPicasso Hegel and the era of big datardquoStator vol 2 no 24 2013 (Dutch)

[91] R J CM Starmans andM J van der Laan ldquoInferential statisticsversusmachine learning a prelude to reconciliationrdquo Stator vol2 no 24 2013 (Dutch)

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of


Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of


Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of


Mathematical PhysicsAdvances in

Complex AnalysisJournal of


OptimizationJournal of


CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of


Operations ResearchAdvances in

Journal of


Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences


The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Algebra

Discrete Dynamics in Nature and Society



Decision SciencesAdvances in

Discrete MathematicsJournal of


Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of


we call statistics and makes it impossible to teach statisticsas a scientific discipline even though the foundations ofstatistics including a very rich theory are purely scientificThat is our field has suffered from a disconnect betweenthe theory of statistics and the practice of statistics whilepractice should be driven by relevant theory and theoreticaldevelopments should be driven by practice For example atheorem establishing consistency and asymptotic normalityof a maximum likelihood estimator for a parametric modelthat is known to be misspecified is not a relevant theoremfor practice since the true data generating distribution is notcaptured by this theorem

Defining the statistical model to actually contain the trueprobability distribution has enormous implications for thedevelopment of valid estimators For example maximumlikelihood estimators are now ill defined due to the curse ofdimensionality of the model In addition even regularizedmaximum likelihood estimators are seriously flawed a gen-eral problem with maximum likelihood based estimators isthat the maximum likelihood criterion only cares about howwell the density estimator fits the true density resulting ina wrong trade-off for the actual target parmaeter of interestFrom a practical perspective whenwe useAIC BIC or cross-validated log-likelihood to select variables in our regressionmodel then that procedure is ignorant of the specific featureof the data distribution we want to estimate That is in largestatistical models it is immediately apparent that estimatorsneed to be targeted towards their goal just like a human beinglearns the answer to a specific question in a targeted mannerand maximum likelihood based estimators fail to do that

In Section 3we review the roadmap for Targeted Learningof a causal quantity involving defining a causal model andcausal quantity of interest establishing an estimand of thedata distribution that equals the desired causal quantity underadditional causal assumptions applying the pure statisticalTargeted Learning of the relevant estimandbased on a statisti-cal model compatible with the causal model but for sure con-taining the true data distribution and careful interpretationof the results In Section 4 we proceed with describing ourproposed targeted minimum loss-based estimation (TMLE)template which represents a concrete template for construc-tion of targeted efficient substitution estimators which arenot only asymptotically consistent asymptotically normallydistributed and asymptotically efficient but also tailoredto have robust finite sample performance Subsequently inSection 5 we review some of our most important advancesin Targeted Learning demonstrating the remarkable powerand flexibility of this TMLE methodology and in Section 6we describe future challenges and areas of research InSection 7 we provide a historical philosophical perspectiveof Targeted Learning Finally in Section 8 we conclude withsome remarks puttingTargeted Learning in the context of themodern era of Big Data

We refer to our papers and book on Targeted Learningfor overviews of relevant parts of the literature that put ourspecific contributions within the field of Targeted Learningin the context of the current literature thereby allowing usto focus on Targeted Learning itself in the current outlookpaper

2 Targeted Learning

Our research takes place in a subfield of statistics we namedTargeted Learning [1 2] In statistics the data (119874

1 119874

119899)

on 119899 units is viewed as a realization of a random variableor equivalently an outcome of a particular experiment andthereby has a probability distribution 119875119899

0 often called the

data distribution For example one might observe 119874119894=

(119882119894 119860119894 119884119894) on a subject 119894 where 119882

119894are baseline character-

istics of the subject 119860119894is a binary treatment or exposure the

subject received and119884119894is a binary outcome of interest such as

an indicator of death 119894 = 1 119899 Throughout this paper wewill use this data structure to demonstrate the concepts andestimation procedures

21 Statistical Model A statistical model M119899 is defined as aset of possible probability distributions for the data distribu-tion and thus represents the available statistical knowledgeabout the true data distribution 119875119899

0 In Targeted Learning

this core-definition of the statistical model is fully respectedin the sense that one should define the statistical model tocontain the true data distribution 119875119899

0isin M119899 So contrary to

the often conveniently used slogan ldquoAllmodels are wrong butsome are usefulrdquo and erosion over time of the original truemeaning of a statistical model throughout applied researchTargeted Learning defines the model for what it actually is[3] If there is truly no statistical knowledge available thenthe statistical model is defined as all data distributions Apossible statistical model is the model that assumes that(1198741 119874

119899) are 119899 independent and identically distributed

random variables with completely unknown probability dis-tribution 119875

0 representing the case that the sampling of the

data involved repeating the same experiment independentlyIn our example this would mean that we assume that(119882119894 119860119894 119884119894) are independent with a completely unspecified

common probability distribution For example if 119882 is 10-dimensional while 119860 and 119884 are two-dimensional then 119875

0

is described by a 12-dimensional density and this statisticalmodel does not put any restrictions on this 12-dimensionaldensity One could factorize this density of (119882119860 119884) asfollows

1199010(119882119860 119884) = 119901

1198820(119882) 119901

119860|1198820(119860 | 119882) 119901

119884|1198601198820(119884 | 119860119882)

(1)

where 1199011198820

is the density of the marginal distribution of119882119901119860|1198820

is the conditional density of 119860 given119882 and 119901119884|1198601198820

is the conditional density of 119884 given 119860 119882 In this modeleach of these factors is unrestricted On the other handsuppose now that the data is generated by a randomizedcontrolled trial in which we randomly assign treatment 119860 isin0 1 with probability 05 to a subject In that case theconditional density of119860 given119882 is known but themarginaldistribution of the covariates and the conditional distributionof the outcome given covariates and treatment might still beunrestricted Even in an observational study onemight knowthat treatment decisions were only based on a small subset ofthe available covariates119882 so that it is known that 119901

119860|1198820(1 |

119882) only depends on119882 through these few covariates In the



119884|119860119882(1 | 119860119882) is



1 119874



1 119874








1 119874

119899) but one



1 119874



1 119874

119894minus1


119894







0= Ψ(119875

119899

0) This choice of






119894= (119882




1205950= Ψ (119875

0)

= 11986411987501198641198750(119884 | 119860 = 1119882) minus 119864

1198750(119884 | 119860 = 0119882)

(2)


1205950= int119908

119875119884|1198601198820

(1 | 119860 = 1119882 = 119908)

minus119875119884|1198601198820

(1 | 119860 = 0119882 = 119908) 1198751198820(119889119908)

(3)

where 119875119884|1198601198820



119875119884|119860119882

(1 | 119860119882) =1

1 + exp (minus1198980(119860119882))

(4)



0(119886 119908) = 03119886 +


021199081+01119908


0 the true


1205950= int119908

(1

1 + exp (minus1198980(1 119908))

minus1

1 + exp (minus1198980(0 119908))

)

times 1198751198820(119889119908)

(5)



(1 | 119860 = 1119882 = 119908) minus 119875119884|1198601198820

(1 | 119860 = 0119882 = 119908)


0


119882 = 119891119882(119880119882)

119860 = 119891119860(119882119880

119860)

119884 = 119891119884(119882119860119880

119884)

(6)




119891119882(119906119882) 119860 = 119891

119860(119882 119906119860) 119884 = 119891

119884(119882119860 119906

119910) One might



0by a




119860(119882119880



119882(119880119882) 119860 = 1 119884

1= 119891119884(119882 1 119880




0= 119891119884(119882 0 119880

119884) Thus 119884

0(1198841)



0 If one also



0=






problem either


1 119874

119899) sim 119875119899

0isin M119899) and



1 119874






0if

120595119899minus 1205950= (1119899)sum

119899



0 In that





given by 120595119899plusmn 196120590

119899radic119899 where 1205902



0)(119874119894)


0) of an



0)(119874)




119899) is defined


119899=1minus

1198750) where 119875


observation 119874


0[13] but it








0= 119876(119875


0


119899) where119876


0that is




where1198761198820


1198760(119860119882) = 119864



1205950= Ψ (119876

0) = 1198641198761198820

1198760(1119882) minus 119876

0(0119882) (7)







for1198761198820


regression 1198760

120595119899= Ψ (119876

119882119899 119876119899) =

1

119899

119899

sum119894=1

119876119899(1119882119894) minus 119876119899(0119882119894) (8)


120595119899119868119875119879119882

=1

119899

119899

sum119894=1

2119860119894minus 1

119866119899(119860119894| 119882119894)119884119894 (9)






0of the data dis-


0= 119866(119875




119860 given119882








1198761198641198750


0is then defined as









0) equal to


120595119899minus 1205950=1

119899

119899

sum119894=1

119863lowast(1198750) (119874119894) + 119900119875(1

radic119899) (10)



120598=0as an


1 ℎ2⟩ =





is given by

119863lowast (1198750) (119874) =

2119860 minus 1

1198660(119860119882)

(119884 minus 1198760(119860119882))

+1198760(1119882) minus 119876

0(0119882) minus Ψ (119876

0)

(11)



0= 1198750IC(1198750)2 of



119894=1IC119899(119874119894)2 where


0) Efficiency








119899(119874119894) of119863lowast(119875

0)(119874119894)




119899) =

(1119899)sum119899

119894=1(119876119899(1119882119894) minus 119876119899(0119882119894))


0through 119876






119899) beyond 119876











0) minus Ψ (119875) + 119877 (119875 119875

0) (12)

where 119877(119875 1198750) = 119900(119875 minus 119875




0) and


119899 119866119899) = 0 targets 119876

119899to fit

Ψ(1198760)


1(119875 1198750) minus

1198770(119875 1198750) where 119877

119886(119875 1198750) = int

119908((119866 minus 119866

0)(119886 | 119882)119866(119886 |

119882))(119876 minus1198760)(119886119882)119889119875



1198660)(119876 minus 119876


either 119866 = 1198660or 119876 = 119876




0so that

119877( 1198750) = 119900


0) asymp




119899119863lowast(119876

119899 119866119899) = 0 with (12) at 119875 =

(119876119899 119866119899) yields

Ψ (119876119899) minus Ψ (119876

0) = (119875

119899minus 1198750)119863lowast(119876119899 119866119899) + 119877119899 (13)








119899 119892119899) = (119875

119899minus1198750)119863lowast(119876

0 1198920) + 119900119875(1radic119899) if

1198750119863lowast(119876

119899 119866119899)minus119863lowast(119876






119899= 119900119875(1radic119899) and


120595119899minus 1205950=1

119899

119899

sum119894=1

119863lowast(1198760 1198660) (119874119894) + 119900119875(1

radic119899) (14)







conditions









0













0 say 119876

0= 119876(119875

0)






1198750(119876) = 119864






119899119866119899(120598) 120598 with Euclidean


119889

119889120598119871 (119876119899119866119899

(120598))10038161003816100381610038161003816100381610038161003816120598=0

(15)




119882) so




119889

119889120598119871 (119876119866(120598)) (119874)

10038161003816100381610038161003816100381610038161003816120598=0

= 119863lowast119884(119876 119866) =

2119860 minus 1

119866 (119860119882)(119884 minus 119876 (119860119882))

(16)


119882(119876 119876119882))119876119882


119882(119882) where


119889

119889120598119871 (119876119882(120598))10038161003816100381610038161003816100381610038161003816120598=0

= 119863lowast

119882(119876) (17)


then119889

119889120598119871 (119876119866(120598) 119876

119882(120598))

10038161003816100381610038161003816100381610038161003816120598=0= 119863lowast(119876 119866) (18)





119884(119876 119866) at (119876 119866) as




1for the path


119882 In



119899119863lowast119884(119876lowast

119899 119866119899) = 0 and thus in




not update 119876119882119899



119899119871(119876119899119866119899(120598)) along this




119899asymp 0


0



119899solves the


119899 119866119899)(119874119894) = 0


119899) is asymp-



120598119875119899119871

(1198760

119899119866119899(120598)) while 120598

2119899= arg min119875

119899119871(119876119882119899(1205982)) equals zero



119899= 1198761

119899


119882119899= 119876119882119899


1 119882

119899 The TMLE of 120595

0is the













1198750119892lowast (119874) =

119870+1

prod119896=0

1198750119871(119896)|119871(119896minus1)119860(119896minus1)

(119871 (119896) | 119871 (119896 minus 1) 119860 (119896 minus 1))

119870

prod119896=0

119892lowast

119896(119860 (119896) | 119860 (119896 minus 1) 119871 (119896))

(19)







119875119864119875(119884 | 119860 = 1



119875(119884 | 119860119882) minus 119864

119875(119884 | 119860 = 0119882)

















119895in the fit of 119866



119895 but in



119895 We




119899[2 44ndash




119899and 119866

119899are not


119899






119899[58

68]


119899








119899might












119899








0119884119886 119886) where 119864

0119884119886= 11986401198640(119884 | 119860 =




















0and 119866







































Acknowledgments


References






































































































Volume 2014




Journal of











Journal of


Function Spaces






Algebra











119884|119860119882(1 | 119860119882) is



1 119874



1 119874








1 119874

119899) but one



1 119874



1 119874

119894minus1


119894







0= Ψ(119875

119899

0) This choice of






119894= (119882




1205950= Ψ (119875

0)

= 11986411987501198641198750(119884 | 119860 = 1119882) minus 119864

1198750(119884 | 119860 = 0119882)

(2)


1205950= int119908

119875119884|1198601198820

(1 | 119860 = 1119882 = 119908)

minus119875119884|1198601198820

(1 | 119860 = 0119882 = 119908) 1198751198820(119889119908)

(3)

where 119875119884|1198601198820



119875119884|119860119882

(1 | 119860119882) =1

1 + exp (minus1198980(119860119882))

(4)



0(119886 119908) = 03119886 +


021199081+01119908


0 the true


1205950= int119908

(1

1 + exp (minus1198980(1 119908))

minus1

1 + exp (minus1198980(0 119908))

)

times 1198751198820(119889119908)

(5)



(1 | 119860 = 1119882 = 119908) minus 119875119884|1198601198820

(1 | 119860 = 0119882 = 119908)


0


119882 = 119891119882(119880119882)

119860 = 119891119860(119882119880

119860)

119884 = 119891119884(119882119860119880

119884)

(6)




119891119882(119906119882) 119860 = 119891

119860(119882 119906119860) 119884 = 119891

119884(119882119860 119906

119910) One might



0by a




119860(119882119880



119882(119880119882) 119860 = 1 119884

1= 119891119884(119882 1 119880




0= 119891119884(119882 0 119880

119884) Thus 119884

0(1198841)



0 If one also



0=






problem either


1 119874

119899) sim 119875119899

0isin M119899) and



1 119874






0if

120595119899minus 1205950= (1119899)sum

119899



0 In that





given by 120595119899plusmn 196120590

119899radic119899 where 1205902



0)(119874119894)


0) of an



0)(119874)




119899) is defined


119899=1minus

1198750) where 119875


observation 119874


0[13] but it








0= 119876(119875


0


119899) where119876


0that is




where1198761198820


1198760(119860119882) = 119864



1205950= Ψ (119876

0) = 1198641198761198820

1198760(1119882) minus 119876

0(0119882) (7)







for1198761198820


regression 1198760

120595119899= Ψ (119876

119882119899 119876119899) =

1

119899

119899

sum119894=1

119876119899(1119882119894) minus 119876119899(0119882119894) (8)


120595119899119868119875119879119882

=1

119899

119899

sum119894=1

2119860119894minus 1

119866119899(119860119894| 119882119894)119884119894 (9)






0of the data dis-


0= 119866(119875




119860 given119882








1198761198641198750


0is then defined as









0) equal to


120595119899minus 1205950=1

119899

119899

sum119894=1

119863lowast(1198750) (119874119894) + 119900119875(1

radic119899) (10)



120598=0as an


1 ℎ2⟩ =





is given by

119863lowast (1198750) (119874) =

2119860 minus 1

1198660(119860119882)

(119884 minus 1198760(119860119882))

+1198760(1119882) minus 119876

0(0119882) minus Ψ (119876

0)

(11)



0= 1198750IC(1198750)2 of



119894=1IC119899(119874119894)2 where


0) Efficiency








119899(119874119894) of119863lowast(119875

0)(119874119894)




119899) =

(1119899)sum119899

119894=1(119876119899(1119882119894) minus 119876119899(0119882119894))


0through 119876






119899) beyond 119876











0) minus Ψ (119875) + 119877 (119875 119875

0) (12)

where 119877(119875 1198750) = 119900(119875 minus 119875




0) and


119899 119866119899) = 0 targets 119876

119899to fit

Ψ(1198760)


1(119875 1198750) minus

1198770(119875 1198750) where 119877

119886(119875 1198750) = int

119908((119866 minus 119866

0)(119886 | 119882)119866(119886 |

119882))(119876 minus1198760)(119886119882)119889119875



1198660)(119876 minus 119876


either 119866 = 1198660or 119876 = 119876




0so that

119877( 1198750) = 119900


0) asymp




119899119863lowast(119876

119899 119866119899) = 0 with (12) at 119875 =

(119876119899 119866119899) yields

Ψ (119876119899) minus Ψ (119876

0) = (119875

119899minus 1198750)119863lowast(119876119899 119866119899) + 119877119899 (13)








119899 119892119899) = (119875

119899minus1198750)119863lowast(119876

0 1198920) + 119900119875(1radic119899) if

1198750119863lowast(119876

119899 119866119899)minus119863lowast(119876






119899= 119900119875(1radic119899) and


120595119899minus 1205950=1

119899

119899

sum119894=1

119863lowast(1198760 1198660) (119874119894) + 119900119875(1

radic119899) (14)







conditions









0













0 say 119876

0= 119876(119875

0)






1198750(119876) = 119864






119899119866119899(120598) 120598 with Euclidean


119889

119889120598119871 (119876119899119866119899

(120598))10038161003816100381610038161003816100381610038161003816120598=0

(15)




119882) so




119889

119889120598119871 (119876119866(120598)) (119874)

10038161003816100381610038161003816100381610038161003816120598=0

= 119863lowast119884(119876 119866) =

2119860 minus 1

119866 (119860119882)(119884 minus 119876 (119860119882))

(16)


119882(119876 119876119882))119876119882


119882(119882) where


119889

119889120598119871 (119876119882(120598))10038161003816100381610038161003816100381610038161003816120598=0

= 119863lowast

119882(119876) (17)


then119889

119889120598119871 (119876119866(120598) 119876

119882(120598))

10038161003816100381610038161003816100381610038161003816120598=0= 119863lowast(119876 119866) (18)





119884(119876 119866) at (119876 119866) as




1for the path


119882 In



119899119863lowast119884(119876lowast

119899 119866119899) = 0 and thus in




not update 119876119882119899



119899119871(119876119899119866119899(120598)) along this




119899asymp 0


0



119899solves the


119899 119866119899)(119874119894) = 0


119899) is asymp-



120598119875119899119871

(1198760

119899119866119899(120598)) while 120598

2119899= arg min119875

119899119871(119876119882119899(1205982)) equals zero



119899= 1198761

119899


119882119899= 119876119882119899


1 119882

119899 The TMLE of 120595

0is the













1198750119892lowast (119874) =

119870+1

prod119896=0

1198750119871(119896)|119871(119896minus1)119860(119896minus1)

(119871 (119896) | 119871 (119896 minus 1) 119860 (119896 minus 1))

119870

prod119896=0

119892lowast

119896(119860 (119896) | 119860 (119896 minus 1) 119871 (119896))

(19)







119875119864119875(119884 | 119860 = 1



119875(119884 | 119860119882) minus 119864

119875(119884 | 119860 = 0119882)

















119895in the fit of 119866



119895 but in



119895 We




119899[2 44ndash




119899and 119866

119899are not


119899






119899[58

68]


119899








119899might












119899








0119884119886 119886) where 119864

0119884119886= 11986401198640(119884 | 119860 =




















0and 119866







































Acknowledgments


References






































































































Volume 2014




Journal of











Journal of


Function Spaces






Algebra










021199081+01119908


0 the true


1205950= int119908

(1

1 + exp (minus1198980(1 119908))

minus1

1 + exp (minus1198980(0 119908))

)

times 1198751198820(119889119908)

(5)



(1 | 119860 = 1119882 = 119908) minus 119875119884|1198601198820

(1 | 119860 = 0119882 = 119908)


0


119882 = 119891119882(119880119882)

119860 = 119891119860(119882119880

119860)

119884 = 119891119884(119882119860119880

119884)

(6)




119891119882(119906119882) 119860 = 119891

119860(119882 119906119860) 119884 = 119891

119884(119882119860 119906

119910) One might



0by a




119860(119882119880



119882(119880119882) 119860 = 1 119884

1= 119891119884(119882 1 119880




0= 119891119884(119882 0 119880

119884) Thus 119884

0(1198841)



0 If one also



0=






problem either


1 119874

119899) sim 119875119899

0isin M119899) and



1 119874






0if

120595119899minus 1205950= (1119899)sum

119899



0 In that





given by 120595119899plusmn 196120590

119899radic119899 where 1205902



0)(119874119894)


0) of an



0)(119874)




119899) is defined


119899=1minus

1198750) where 119875


observation 119874


0[13] but it








0= 119876(119875


0


119899) where119876


0that is




where1198761198820


1198760(119860119882) = 119864



1205950= Ψ (119876

0) = 1198641198761198820

1198760(1119882) minus 119876

0(0119882) (7)







for1198761198820


regression 1198760

120595119899= Ψ (119876

119882119899 119876119899) =

1

119899

119899

sum119894=1

119876119899(1119882119894) minus 119876119899(0119882119894) (8)


120595119899119868119875119879119882

=1

119899

119899

sum119894=1

2119860119894minus 1

119866119899(119860119894| 119882119894)119884119894 (9)






0of the data dis-


0= 119866(119875




119860 given119882








1198761198641198750


0is then defined as









0) equal to


120595119899minus 1205950=1

119899

119899

sum119894=1

119863lowast(1198750) (119874119894) + 119900119875(1

radic119899) (10)



120598=0as an


1 ℎ2⟩ =





is given by

119863lowast (1198750) (119874) =

2119860 minus 1

1198660(119860119882)

(119884 minus 1198760(119860119882))

+1198760(1119882) minus 119876

0(0119882) minus Ψ (119876

0)

(11)



0= 1198750IC(1198750)2 of



119894=1IC119899(119874119894)2 where


0) Efficiency








119899(119874119894) of119863lowast(119875

0)(119874119894)




119899) =

(1119899)sum119899

119894=1(119876119899(1119882119894) minus 119876119899(0119882119894))


0through 119876






119899) beyond 119876











0) minus Ψ (119875) + 119877 (119875 119875

0) (12)

where 119877(119875 1198750) = 119900(119875 minus 119875




0) and


119899 119866119899) = 0 targets 119876

119899to fit

Ψ(1198760)


1(119875 1198750) minus

1198770(119875 1198750) where 119877

119886(119875 1198750) = int

119908((119866 minus 119866

0)(119886 | 119882)119866(119886 |

119882))(119876 minus1198760)(119886119882)119889119875



1198660)(119876 minus 119876


either 119866 = 1198660or 119876 = 119876




0so that

119877( 1198750) = 119900


0) asymp




119899119863lowast(119876

119899 119866119899) = 0 with (12) at 119875 =

(119876119899 119866119899) yields

Ψ (119876119899) minus Ψ (119876

0) = (119875

119899minus 1198750)119863lowast(119876119899 119866119899) + 119877119899 (13)








119899 119892119899) = (119875

119899minus1198750)119863lowast(119876

0 1198920) + 119900119875(1radic119899) if

1198750119863lowast(119876

119899 119866119899)minus119863lowast(119876






119899= 119900119875(1radic119899) and


120595119899minus 1205950=1

119899

119899

sum119894=1

119863lowast(1198760 1198660) (119874119894) + 119900119875(1

radic119899) (14)







conditions









0













0 say 119876

0= 119876(119875

0)






1198750(119876) = 119864






119899119866119899(120598) 120598 with Euclidean


119889

119889120598119871 (119876119899119866119899

(120598))10038161003816100381610038161003816100381610038161003816120598=0

(15)




119882) so




119889

119889120598119871 (119876119866(120598)) (119874)

10038161003816100381610038161003816100381610038161003816120598=0

= 119863lowast119884(119876 119866) =

2119860 minus 1

119866 (119860119882)(119884 minus 119876 (119860119882))

(16)


119882(119876 119876119882))119876119882


119882(119882) where


119889

119889120598119871 (119876119882(120598))10038161003816100381610038161003816100381610038161003816120598=0

= 119863lowast

119882(119876) (17)


then119889

119889120598119871 (119876119866(120598) 119876

119882(120598))

10038161003816100381610038161003816100381610038161003816120598=0= 119863lowast(119876 119866) (18)





119884(119876 119866) at (119876 119866) as




1for the path


119882 In



119899119863lowast119884(119876lowast

119899 119866119899) = 0 and thus in




not update 119876119882119899



119899119871(119876119899119866119899(120598)) along this




119899asymp 0


0



119899solves the


119899 119866119899)(119874119894) = 0


119899) is asymp-



120598119875119899119871

(1198760

119899119866119899(120598)) while 120598

2119899= arg min119875

119899119871(119876119882119899(1205982)) equals zero



119899= 1198761

119899


119882119899= 119876119882119899


1 119882

119899 The TMLE of 120595

0is the













1198750119892lowast (119874) =

119870+1

prod119896=0

1198750119871(119896)|119871(119896minus1)119860(119896minus1)

(119871 (119896) | 119871 (119896 minus 1) 119860 (119896 minus 1))

119870

prod119896=0

119892lowast

119896(119860 (119896) | 119860 (119896 minus 1) 119871 (119896))

(19)







119875119864119875(119884 | 119860 = 1



119875(119884 | 119860119882) minus 119864

119875(119884 | 119860 = 0119882)

















119895in the fit of 119866



119895 but in



119895 We




119899[2 44ndash




119899and 119866

119899are not


119899






119899[58

68]


119899








119899might












119899








0119884119886 119886) where 119864

0119884119886= 11986401198640(119884 | 119860 =




















0and 119866







































Acknowledgments


References






































































































Volume 2014




Journal of











Journal of


Function Spaces






Algebra











119899) is defined


119899=1minus

1198750) where 119875


observation 119874


0[13] but it








0= 119876(119875


0


119899) where119876


0that is




where1198761198820


1198760(119860119882) = 119864



1205950= Ψ (119876

0) = 1198641198761198820

1198760(1119882) minus 119876

0(0119882) (7)







for1198761198820


regression 1198760

120595119899= Ψ (119876

119882119899 119876119899) =

1

119899

119899

sum119894=1

119876119899(1119882119894) minus 119876119899(0119882119894) (8)


120595119899119868119875119879119882

=1

119899

119899

sum119894=1

2119860119894minus 1

119866119899(119860119894| 119882119894)119884119894 (9)






0of the data dis-


0= 119866(119875




119860 given119882








1198761198641198750


0is then defined as









0) equal to


120595119899minus 1205950=1

119899

119899

sum119894=1

119863lowast(1198750) (119874119894) + 119900119875(1

radic119899) (10)



120598=0as an


1 ℎ2⟩ =





is given by

119863lowast (1198750) (119874) =

2119860 minus 1

1198660(119860119882)

(119884 minus 1198760(119860119882))

+1198760(1119882) minus 119876

0(0119882) minus Ψ (119876

0)

(11)



0= 1198750IC(1198750)2 of



119894=1IC119899(119874119894)2 where


0) Efficiency








119899(119874119894) of119863lowast(119875

0)(119874119894)




119899) =

(1119899)sum119899

119894=1(119876119899(1119882119894) minus 119876119899(0119882119894))


0through 119876






119899) beyond 119876











0) minus Ψ (119875) + 119877 (119875 119875

0) (12)

where 119877(119875 1198750) = 119900(119875 minus 119875




0) and


119899 119866119899) = 0 targets 119876

119899to fit

Ψ(1198760)


1(119875 1198750) minus

1198770(119875 1198750) where 119877

119886(119875 1198750) = int

119908((119866 minus 119866

0)(119886 | 119882)119866(119886 |

119882))(119876 minus1198760)(119886119882)119889119875



1198660)(119876 minus 119876


either 119866 = 1198660or 119876 = 119876




0so that

119877( 1198750) = 119900


0) asymp




119899119863lowast(119876

119899 119866119899) = 0 with (12) at 119875 =

(119876119899 119866119899) yields

Ψ (119876119899) minus Ψ (119876

0) = (119875

119899minus 1198750)119863lowast(119876119899 119866119899) + 119877119899 (13)








119899 119892119899) = (119875

119899minus1198750)119863lowast(119876

0 1198920) + 119900119875(1radic119899) if

1198750119863lowast(119876

119899 119866119899)minus119863lowast(119876






119899= 119900119875(1radic119899) and


120595119899minus 1205950=1

119899

119899

sum119894=1

119863lowast(1198760 1198660) (119874119894) + 119900119875(1

radic119899) (14)







conditions









0













0 say 119876

0= 119876(119875

0)






1198750(119876) = 119864






119899119866119899(120598) 120598 with Euclidean


119889

119889120598119871 (119876119899119866119899

(120598))10038161003816100381610038161003816100381610038161003816120598=0

(15)




119882) so




119889

119889120598119871 (119876119866(120598)) (119874)

10038161003816100381610038161003816100381610038161003816120598=0

= 119863lowast119884(119876 119866) =

2119860 minus 1

119866 (119860119882)(119884 minus 119876 (119860119882))

(16)


119882(119876 119876119882))119876119882


119882(119882) where


119889

119889120598119871 (119876119882(120598))10038161003816100381610038161003816100381610038161003816120598=0

= 119863lowast

119882(119876) (17)


then119889

119889120598119871 (119876119866(120598) 119876

119882(120598))

10038161003816100381610038161003816100381610038161003816120598=0= 119863lowast(119876 119866) (18)





119884(119876 119866) at (119876 119866) as




1for the path


119882 In



119899119863lowast119884(119876lowast

119899 119866119899) = 0 and thus in




not update 119876119882119899



119899119871(119876119899119866119899(120598)) along this




119899asymp 0


0



119899solves the


119899 119866119899)(119874119894) = 0


119899) is asymp-



120598119875119899119871

(1198760

119899119866119899(120598)) while 120598

2119899= arg min119875

119899119871(119876119882119899(1205982)) equals zero



119899= 1198761

119899


119882119899= 119876119882119899


1 119882

119899 The TMLE of 120595

0is the













1198750119892lowast (119874) =

119870+1

prod119896=0

1198750119871(119896)|119871(119896minus1)119860(119896minus1)

(119871 (119896) | 119871 (119896 minus 1) 119860 (119896 minus 1))

119870

prod119896=0

119892lowast

119896(119860 (119896) | 119860 (119896 minus 1) 119871 (119896))

(19)







119875119864119875(119884 | 119860 = 1



119875(119884 | 119860119882) minus 119864

119875(119884 | 119860 = 0119882)

















119895in the fit of 119866



119895 but in



119895 We




119899[2 44ndash




119899and 119866

119899are not


119899






119899[58

68]


119899








119899might












119899








0119884119886 119886) where 119864

0119884119886= 11986401198640(119884 | 119860 =




















0and 119866







































Acknowledgments


References






































































































Volume 2014




Journal of











Journal of


Function Spaces






Algebra












1198761198641198750


0is then defined as









0) equal to


120595119899minus 1205950=1

119899

119899

sum119894=1

119863lowast(1198750) (119874119894) + 119900119875(1

radic119899) (10)



120598=0as an


1 ℎ2⟩ =





is given by

119863lowast (1198750) (119874) =

2119860 minus 1

1198660(119860119882)

(119884 minus 1198760(119860119882))

+1198760(1119882) minus 119876

0(0119882) minus Ψ (119876

0)

(11)



0= 1198750IC(1198750)2 of



119894=1IC119899(119874119894)2 where


0) Efficiency








119899(119874119894) of119863lowast(119875

0)(119874119894)




119899) =

(1119899)sum119899

119894=1(119876119899(1119882119894) minus 119876119899(0119882119894))


0through 119876






119899) beyond 119876











0) minus Ψ (119875) + 119877 (119875 119875

0) (12)

where 119877(119875 1198750) = 119900(119875 minus 119875




0) and


119899 119866119899) = 0 targets 119876

119899to fit

Ψ(1198760)


1(119875 1198750) minus

1198770(119875 1198750) where 119877

119886(119875 1198750) = int

119908((119866 minus 119866

0)(119886 | 119882)119866(119886 |

119882))(119876 minus1198760)(119886119882)119889119875



1198660)(119876 minus 119876


either 119866 = 1198660or 119876 = 119876




0so that

119877( 1198750) = 119900


0) asymp




119899119863lowast(119876

119899 119866119899) = 0 with (12) at 119875 =

(119876119899 119866119899) yields

Ψ (119876119899) minus Ψ (119876

0) = (119875

119899minus 1198750)119863lowast(119876119899 119866119899) + 119877119899 (13)








119899 119892119899) = (119875

119899minus1198750)119863lowast(119876

0 1198920) + 119900119875(1radic119899) if

1198750119863lowast(119876

119899 119866119899)minus119863lowast(119876






119899= 119900119875(1radic119899) and


120595119899minus 1205950=1

119899

119899

sum119894=1

119863lowast(1198760 1198660) (119874119894) + 119900119875(1

radic119899) (14)







conditions









0













0 say 119876

0= 119876(119875

0)






1198750(119876) = 119864






119899119866119899(120598) 120598 with Euclidean


119889

119889120598119871 (119876119899119866119899

(120598))10038161003816100381610038161003816100381610038161003816120598=0

(15)




119882) so




119889

119889120598119871 (119876119866(120598)) (119874)

10038161003816100381610038161003816100381610038161003816120598=0

= 119863lowast119884(119876 119866) =

2119860 minus 1

119866 (119860119882)(119884 minus 119876 (119860119882))

(16)


119882(119876 119876119882))119876119882


119882(119882) where


119889

119889120598119871 (119876119882(120598))10038161003816100381610038161003816100381610038161003816120598=0

= 119863lowast

119882(119876) (17)


then119889

119889120598119871 (119876119866(120598) 119876

119882(120598))

10038161003816100381610038161003816100381610038161003816120598=0= 119863lowast(119876 119866) (18)





119884(119876 119866) at (119876 119866) as




1for the path


119882 In



119899119863lowast119884(119876lowast

119899 119866119899) = 0 and thus in




not update 119876119882119899



119899119871(119876119899119866119899(120598)) along this




119899asymp 0


0



119899solves the


119899 119866119899)(119874119894) = 0


119899) is asymp-



120598119875119899119871

(1198760

119899119866119899(120598)) while 120598

2119899= arg min119875

119899119871(119876119882119899(1205982)) equals zero



119899= 1198761

119899


119882119899= 119876119882119899


1 119882

119899 The TMLE of 120595

0is the













1198750119892lowast (119874) =

119870+1

prod119896=0

1198750119871(119896)|119871(119896minus1)119860(119896minus1)

(119871 (119896) | 119871 (119896 minus 1) 119860 (119896 minus 1))

119870

prod119896=0

119892lowast

119896(119860 (119896) | 119860 (119896 minus 1) 119871 (119896))

(19)







119875119864119875(119884 | 119860 = 1



119875(119884 | 119860119882) minus 119864

119875(119884 | 119860 = 0119882)

















119895in the fit of 119866



119895 but in



119895 We




119899[2 44ndash




119899and 119866

119899are not


119899






119899[58

68]


119899








119899might












119899








0119884119886 119886) where 119864

0119884119886= 11986401198640(119884 | 119860 =




















0and 119866







































Acknowledgments


References






































































































Volume 2014




Journal of











Journal of


Function Spaces






Algebra











119899(119874119894) of119863lowast(119875

0)(119874119894)




119899) =

(1119899)sum119899

119894=1(119876119899(1119882119894) minus 119876119899(0119882119894))


0through 119876






119899) beyond 119876











0) minus Ψ (119875) + 119877 (119875 119875

0) (12)

where 119877(119875 1198750) = 119900(119875 minus 119875




0) and


119899 119866119899) = 0 targets 119876

119899to fit

Ψ(1198760)


1(119875 1198750) minus

1198770(119875 1198750) where 119877

119886(119875 1198750) = int

119908((119866 minus 119866

0)(119886 | 119882)119866(119886 |

119882))(119876 minus1198760)(119886119882)119889119875



1198660)(119876 minus 119876


either 119866 = 1198660or 119876 = 119876




0so that

119877( 1198750) = 119900


0) asymp




119899119863lowast(119876

119899 119866119899) = 0 with (12) at 119875 =

(119876119899 119866119899) yields

Ψ (119876119899) minus Ψ (119876

0) = (119875

119899minus 1198750)119863lowast(119876119899 119866119899) + 119877119899 (13)








119899 119892119899) = (119875

119899minus1198750)119863lowast(119876

0 1198920) + 119900119875(1radic119899) if

1198750119863lowast(119876

119899 119866119899)minus119863lowast(119876






119899= 119900119875(1radic119899) and


120595119899minus 1205950=1

119899

119899

sum119894=1

119863lowast(1198760 1198660) (119874119894) + 119900119875(1

radic119899) (14)







conditions









0













0 say 119876

0= 119876(119875

0)






1198750(119876) = 119864






119899119866119899(120598) 120598 with Euclidean


119889

119889120598119871 (119876119899119866119899

(120598))10038161003816100381610038161003816100381610038161003816120598=0

(15)




119882) so




119889

119889120598119871 (119876119866(120598)) (119874)

10038161003816100381610038161003816100381610038161003816120598=0

= 119863lowast119884(119876 119866) =

2119860 minus 1

119866 (119860119882)(119884 minus 119876 (119860119882))

(16)


119882(119876 119876119882))119876119882


119882(119882) where


119889

119889120598119871 (119876119882(120598))10038161003816100381610038161003816100381610038161003816120598=0

= 119863lowast

119882(119876) (17)


then119889

119889120598119871 (119876119866(120598) 119876

119882(120598))

10038161003816100381610038161003816100381610038161003816120598=0= 119863lowast(119876 119866) (18)





119884(119876 119866) at (119876 119866) as




1for the path


119882 In



119899119863lowast119884(119876lowast

119899 119866119899) = 0 and thus in




not update 119876119882119899



119899119871(119876119899119866119899(120598)) along this




119899asymp 0


0



119899solves the


119899 119866119899)(119874119894) = 0


119899) is asymp-



120598119875119899119871

(1198760

119899119866119899(120598)) while 120598

2119899= arg min119875

119899119871(119876119882119899(1205982)) equals zero



119899= 1198761

119899


119882119899= 119876119882119899


1 119882

119899 The TMLE of 120595

0is the













1198750119892lowast (119874) =

119870+1

prod119896=0

1198750119871(119896)|119871(119896minus1)119860(119896minus1)

(119871 (119896) | 119871 (119896 minus 1) 119860 (119896 minus 1))

119870

prod119896=0

119892lowast

119896(119860 (119896) | 119860 (119896 minus 1) 119871 (119896))

(19)







119875119864119875(119884 | 119860 = 1



119875(119884 | 119860119882) minus 119864

119875(119884 | 119860 = 0119882)

















119895in the fit of 119866



119895 but in



119895 We




119899[2 44ndash




119899and 119866

119899are not


119899






119899[58

68]


119899








119899might












119899








0119884119886 119886) where 119864

0119884119886= 11986401198640(119884 | 119860 =




















0and 119866







































Acknowledgments


References






































































































Volume 2014




Journal of











Journal of


Function Spaces






Algebra














0













0 say 119876

0= 119876(119875

0)






1198750(119876) = 119864






119899119866119899(120598) 120598 with Euclidean


119889

119889120598119871 (119876119899119866119899

(120598))10038161003816100381610038161003816100381610038161003816120598=0

(15)




119882) so




119889

119889120598119871 (119876119866(120598)) (119874)

10038161003816100381610038161003816100381610038161003816120598=0

= 119863lowast119884(119876 119866) =

2119860 minus 1

119866 (119860119882)(119884 minus 119876 (119860119882))

(16)


119882(119876 119876119882))119876119882


119882(119882) where


119889

119889120598119871 (119876119882(120598))10038161003816100381610038161003816100381610038161003816120598=0

= 119863lowast

119882(119876) (17)


then119889

119889120598119871 (119876119866(120598) 119876

119882(120598))

10038161003816100381610038161003816100381610038161003816120598=0= 119863lowast(119876 119866) (18)





119884(119876 119866) at (119876 119866) as




1for the path


119882 In



119899119863lowast119884(119876lowast

119899 119866119899) = 0 and thus in




not update 119876119882119899



119899119871(119876119899119866119899(120598)) along this




119899asymp 0


0



119899solves the


119899 119866119899)(119874119894) = 0


119899) is asymp-



120598119875119899119871

(1198760

119899119866119899(120598)) while 120598

2119899= arg min119875

119899119871(119876119882119899(1205982)) equals zero



119899= 1198761

119899


119882119899= 119876119882119899


1 119882

119899 The TMLE of 120595

0is the













1198750119892lowast (119874) =

119870+1

prod119896=0

1198750119871(119896)|119871(119896minus1)119860(119896minus1)

(119871 (119896) | 119871 (119896 minus 1) 119860 (119896 minus 1))

119870

prod119896=0

119892lowast

119896(119860 (119896) | 119860 (119896 minus 1) 119871 (119896))

(19)







119875119864119875(119884 | 119860 = 1



119875(119884 | 119860119882) minus 119864

119875(119884 | 119860 = 0119882)

















119895in the fit of 119866



119895 but in



119895 We




119899[2 44ndash




119899and 119866

119899are not


119899






119899[58

68]


119899








119899might












119899








0119884119886 119886) where 119864

0119884119886= 11986401198640(119884 | 119860 =




















0and 119866







































Acknowledgments


References






































































































Volume 2014




Journal of











Journal of


Function Spaces






Algebra













119884(119876 119866) at (119876 119866) as




1for the path


119882 In



119899119863lowast119884(119876lowast

119899 119866119899) = 0 and thus in




not update 119876119882119899



119899119871(119876119899119866119899(120598)) along this




119899asymp 0


0



119899solves the


119899 119866119899)(119874119894) = 0


119899) is asymp-



120598119875119899119871

(1198760

119899119866119899(120598)) while 120598

2119899= arg min119875

119899119871(119876119882119899(1205982)) equals zero



119899= 1198761

119899


119882119899= 119876119882119899


1 119882

119899 The TMLE of 120595

0is the













1198750119892lowast (119874) =

119870+1

prod119896=0

1198750119871(119896)|119871(119896minus1)119860(119896minus1)

(119871 (119896) | 119871 (119896 minus 1) 119860 (119896 minus 1))

119870

prod119896=0

119892lowast

119896(119860 (119896) | 119860 (119896 minus 1) 119871 (119896))

(19)







119875119864119875(119884 | 119860 = 1



119875(119884 | 119860119882) minus 119864

119875(119884 | 119860 = 0119882)

















119895in the fit of 119866



119895 but in



119895 We




119899[2 44ndash




119899and 119866

119899are not


119899






119899[58

68]


119899








119899might












119899








0119884119886 119886) where 119864

0119884119886= 11986401198640(119884 | 119860 =




















0and 119866







































Acknowledgments


References






































































































Volume 2014




Journal of











Journal of


Function Spaces






Algebra











119875119864119875(119884 | 119860 = 1



119875(119884 | 119860119882) minus 119864

119875(119884 | 119860 = 0119882)

















119895in the fit of 119866



119895 but in



119895 We




119899[2 44ndash




119899and 119866

119899are not


119899






119899[58

68]


119899








119899might












119899








0119884119886 119886) where 119864

0119884119886= 11986401198640(119884 | 119860 =




















0and 119866







































Acknowledgments


References






































































































Volume 2014




Journal of











Journal of


Function Spaces






Algebra











119899and 119866

119899are not


119899






119899[58

68]


119899








119899might












119899








0119884119886 119886) where 119864

0119884119886= 11986401198640(119884 | 119860 =




















0and 119866







































Acknowledgments


References






































































































Volume 2014




Journal of











Journal of


Function Spaces






Algebra
























0and 119866







































Acknowledgments


References






































































































Volume 2014




Journal of











Journal of


Function Spaces






Algebra














0and 119866







































Acknowledgments


References






































































































Volume 2014




Journal of











Journal of


Function Spaces






Algebra








































Acknowledgments


References






































































































Volume 2014




Journal of











Journal of


Function Spaces






Algebra

































Acknowledgments


References






































































































Volume 2014




Journal of











Journal of


Function Spaces






Algebra

























Acknowledgments


References






































































































Volume 2014




Journal of











Journal of


Function Spaces






Algebra













Acknowledgments


References






































































































Volume 2014




Journal of











Journal of


Function Spaces






Algebra



















































































Volume 2014




Journal of











Journal of


Function Spaces






Algebra


















































Volume 2014




Journal of











Journal of


Function Spaces






Algebra
















Volume 2014




Journal of











Journal of


Function Spaces






Algebra









Date post:	14-Jul-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Review Article Entering the Era of Data Science:...

Documents